The Linear Algebra Mapping Problem

StefanKarpinski · on Nov 22, 2019

Whether you want a language to do this or not is a fundamental matter of design philosophy. I would be quite unhappy if a language changed multiplication by the inverse of a matrix into backslash without my permission: that is not the computation I told you to do. In a language that did that automatically, how would you check the difference in the result between using the inverse and using backslash? You would try to compare them and conclude that they produce identical results, because the system had surreptitiously changed both computations to do the same thing behind your back. Then you’d get user reports about this and have to add a way to opt out of the optimization when it wasn’t wanted. And of course that’s not the only time you want to suppress an optimization like this—sometimes it will give incorrect results. Optimizations should either be truly indistinguishable from the correct answer or be opt-in.

In Julia we take a very different approach: make it as easy and convenient as possible to ask for the efficient version, but not automatic. For example, you can wrap a matrix that you know to be symmetric in the `Symmetric` wrapper type and leave the rest of the code the same and dispatch will call specialized BLAS and LAPACK routines for symmetric matrices. This approach had been designed by people who live and breathe numerical linear algebra and it allows them to write the naive version of an algorithm, and then tweak it to use the most efficient available kernels with minimal effort and without fundamentally changing the shape of their code.

What that means for the evaluation in this paper is that while the naive code may not do the fastest, cleverest thing, a very small variation on the naive code will do the fastest thing, using optimized kernels. You ask for this explicitly, but it’s easy and convenient to ask for. We’ve found this to be a very pragmatic and effective approach. It keeps optimization explicit but easy.

It would be an easy layer on top of that to add a code analyzer to look for patterns that might be possible to optimize further, but that seems like a problem that is better suited for tooling like a linter than being baked into the language itself.

fyp · on Nov 22, 2019

The paper also talks about reordering matrix multiplications by automatically adding parentheses (despite the fact that floating point multiplication is non-associative).

Is there a way to achieve this in Julia? If not what syntax might you introduce to do it? You can't really annotate just the matrices or multiplication operators since you are optimizing a whole expression.

photon-torpedo · on Nov 22, 2019

Matrix chain multiplication can be seen a special case of tensor contraction, and the https://github.com/Jutho/TensorOperations.jl package has an implementation which tries to optimize the contraction order (macro @tensoropt).

ChrisRackauckas · on Nov 22, 2019

Needs an update, but https://github.com/AustinPrivett/MatrixChainMultiply.jl does it. Could probably be handled more automatically.

mratsim · on Nov 22, 2019

The follow-up paper should include the deep learning and linear algebra compilers:

- Halide

- Tiramisu

- Ngraph

- Weld

- PyTorch JIT & PyTorch Glow and Tensor Comprehensions

- TVM

- Futhark

- A toy MLIR language build on the linalg dialect. I'm sure Google would love to help them.

to name the main ones.

Everyone is moving towards a compiler approach.

I'm actively researching the domain and I think Halide is the most promising: by separating the algorithm from the schedule you can leave the algorithm side to the researchers and optimization is done by HPC devs or automated methods without having to rewrite in C/Fortran.

Much more attempts in the past:

- https://github.com/mratsim/Arraymancer/issues/347#issuecomme...

- My own domain specific language and compiler: https://github.com/numforge/laser/tree/master/laser/lux_comp...

  there are a low_level_design_considerations.md and challenges.md that should be interesting as well.

Unfortunately I didn't go beyond the proof of concept as I need to revamp the Nim multithreading runtime to improve composition of parallel linear algebra kernels.

One thing for sure is that OpenMP is today something that needs to be workaround due to it's lack of composability and GCC implementation is a contention point due to using a single queue protected by a lock. While traditional linear algebra algorithms can deal with static distribution and no load balancing, some algorithms would hugely benefits from a more flexible task system, for example Beam Search.

pretty_dumm_guy · on Nov 22, 2019

Hi,

Does one need to understand compiler technology fully to get into this field? I have been trying to get a grip over MLIR and GLOW source code but have failed miserably so far. Could you please give me some pointers which would help me in this regard ? Thanks

Jhsto · on Nov 22, 2019

How viable do you think SPIR-V is for computation? I am thinking of researching (that is, creating my own) DSLs targeting SPIR-V.

mratsim · on Nov 22, 2019

Good question.

I can't find a single SPIR-V AXPY (a*X+y) example which is the Hello World of linear algebra in Vulkan/SPIR-V so I'm not optimistic on ease of use.

You can probably do texture hacks like what people did in 2006 when they realized that GPU were fast for linear algebra and OpenCL/Cuda did not exist (https://igoro.com/archive/how-gpu-came-to-be-used-for-genera...)

The best would probably be to follow this Halide issue and the corresponding branches: https://github.com/halide/Halide/issues/2296. Note that Halide is a compiler so it automates lots of the boilerplate that is probably needed to Vulkan for compute compared to Cuda.

Loq · on Nov 22, 2019

   moving towards a compiler 
   approach.

What do you mean by "compiler approach" in this context?

mratsim · on Nov 22, 2019

Instead of having to reimplement a series of linear algebra transformation as coarse-grained transformation on ndarrays/tensors, you create a domain-specific language that still give you the mathematical feel. This produce a computation graph that is optimized and compiled across the whole processing pipelines i.e. similar to Dask and Tensorflow delayed computations but you allow manual tuning or autotuning on block size (to best use L1, L2 or GPU cache), on memoization versus recomputing intermediate stages, on swizzling/re-ordering/re-packing inputs to improve cache access for micro-kernels and/or allow their SIMD vectorization.

Furthermore a compiler approach will lend itself to autodifferentiation like what Julia, Swift for Tensorflow and Halide are doing.

pfortuny · on Nov 22, 2019

Looks like “you cannot automate every decision”, which is rather true and a very relevant problem imho: I teach at a school of Engineering and it looks like everything now boils down to “let the computer do it”.

commandlinefan · on Nov 22, 2019

> you cannot automate every decision

Disturbingly too many people don’t seem to recognize the difference between letting the computer do something they don’t have time to do and trusting the computer to do something they don’t understand.

JadeNB · on Nov 22, 2019

I think that people think that they are doing the former when they do the latter: "I don't have time to understand this technique, so I'll let the computer understand it for me."

hinkley · on Nov 23, 2019

Which is quite a relief, if you think about it.

Imagine the nihilism if you could automate every decision.

vmchale · on Nov 22, 2019

Numpy was quite difficult to get working efficiently. J actually outperformed numpy pretty much any way I compiled it: https://github.com/vmchale/matrix-benchmarks, even doing basic stuff like multiplying two matrices.

ur-whale · on Nov 22, 2019

Looks to me like the age-old standard problem of using higher abstractions and expecting the machinery underneath to be smart enough to deploy it optimally to the underlying hardware.

For similar instances of the same story, see: compilers, java, haskell, cuda, etc...

News at 10: you always pay a performance price when you use high level abstractions, aka you can't have your lunch and eat it too.

TheWizardofOdds · on Nov 22, 2019

Theoretically you are correct, but taking the skills of the user into account, I think the very opposite is true. In practice, hardly anyone has the skills to actually implement anything with high performance anymore (especially in numerics) and it is thus (in most cases ) more performant to simply stick to stuff like numpy.

In most cases, "working code, fast" is more important than "fast working code".

commandlinefan · on Nov 22, 2019

> "working code, fast" is more important than "fast working code"

That depends on who you ask. If you ask the people who are managing the schedules, whose job reviews and bonuses are tied to meeting a specific date? Yes, absolutely, performance is a "nice to have" as long as the dates don't "slip". Now ask the users, and the number one complaint I hear most often is "why is this thing so damned slow?"

TheWizardofOdds · on Nov 22, 2019

> Now ask the users, and the number one complaint I hear most often is "why is this thing so damned slow?"

Good point. My viewpoint is rather focused on numerical linear algebra, since I worked there and the article is about it. If you skim through the paper mentioned, in the article you can e.g. compare numpy and armadillo. You will see that the speedups from using a (arguably) much more complex C++ framework instead of numpy are marginal and will not be visible for the user. The increased production/maintenance costs due to a more complex code, will be visible for the user.

geoalchimista · on Nov 22, 2019

In other words, we need an optimizing linear algebra compiler to compile high-level abstractions to efficient BLAS/MKL calls, instead of just interpreting them for a literal translation. For example, in the case of NumPy, when given x = inv(A) @ b it should attempt to optimize it as x = A \ b under safe conditions. So, why not the best of both worlds?

adrianN · on Nov 22, 2019

The performance price can be made pretty much 0. For example compilers for C can produce optimal machine code for many realistic example programs.

yowlingcat · on Nov 22, 2019

By machinery do you mean compiler? The Java JIT has become extremely, extremely fast. All those $B that have collectively put into it have yielded reaped rewards. It is of course challenging to have your lunch and eat it too if you're not aware of the best way to order it.