Instead of having to reimplement a series of linear algebra transformation as co...

Instead of having to reimplement a series of linear algebra transformation as coarse-grained transformation on ndarrays/tensors, you create a domain-specific language that still give you the mathematical feel. This produce a computation graph that is optimized and compiled across the whole processing pipelines i.e. similar to Dask and Tensorflow delayed computations but you allow manual tuning or autotuning on block size (to best use L1, L2 or GPU cache), on memoization versus recomputing intermediate stages, on swizzling/re-ordering/re-packing inputs to improve cache access for micro-kernels and/or allow their SIMD vectorization.

Furthermore a compiler approach will lend itself to autodifferentiation like what Julia, Swift for Tensorflow and Halide are doing.