The Tensor Algebra Compiler

fredrikbk · on Nov 1, 2017

Hi Hacker News! I’m one of the developers. This project was also featured in MIT News yesterday:

http://news.mit.edu/2017/faster-big-data-analysis-tensor-alg...

The code is available at:

https://github.com/tensor-compiler/taco/issues

I’m happy to discuss the project and to answer any questions :)

philipkglass · on Nov 1, 2017

The MIT story focuses on "big data," but it looks like this might be applicable to coupled-cluster calculations in physics/chemistry too. Is it? Can you compare/contrast with e.g. the Cyclops Tensor Framework (http://solomon2.web.engr.illinois.edu/ctf/) or NWChem's Tensor Contraction Engine (http://www.csc.lsu.edu/~gb/TCE/)?

fredrikbk · on Nov 1, 2017

Larry chose to focus on the big data part, because it is intuitive. But I think you're absolutely correct, that it has applications in physics/chemistry (and machine learning too). We're actually talking to people in our theoretical physics department, who may want to use taco for their QCD computations. There's also a new issue on our tracker about adding complex number support for nuclear computations and quantum computing: https://github.com/tensor-compiler/taco/issues/116.

The tensor contraction engine is great work, and focuses on dense tensor. We currently optimize for sparse tensors so TCE will do better than us for pure dense expressions. We want to bridge this gap next semester though.

The cyclops framework is also great. We discuss it in our related work, but we did not directly compare to it in our evaluation. The first version of it, like TCE, focused on dense tensors, and their main focus is on distributed computing, which we don't support yet (we do shared memory parallel execution at the moment). They have some followup work on sparse computations. The difference to our work is that they, if I read their paper correctly, transposes the data until they can call pre-existing matrix multiplication routines. This causes data movement overheads. Our work compiles expressions to work at the data at hand, without moving it.

quadruplebond · on Nov 2, 2017

This looks like it might be nice, but regular coupled cluster doesn't really have very many sparse matrices so mapping things into Blas3 is fine. Also if you go the reduced scaling route and use PNOs or OSVs everything is mapped into matrix matrix for CCSD anyways.

110011 · on Nov 1, 2017

Just saw your talk. Man you are one hell of a speaker. Great slides, animations and presentation. I enjoyed it.

It seems like a strange thing that no one thought of doing this before! Nice that you identified a cool problem and solved it :)

scottlegrand2 · on Nov 1, 2017

Not entirely true. Sparse x Dense Matrix is an important component of the DSSTNE deep learning framework:

https://github.com/amzn/amazon-dsstne

But this library is clearly more thorough.

fredrikbk · on Nov 2, 2017

For sure, as you point out, many such kernels have been written, going at least as far back as 1967.

We believe our contribution is being able to generate kernels for all the expressions.

Thanks for the reference though! We’re trying to learn where sparsity may be important in neural networks.

fredrikbk · on Nov 2, 2017

Thanks, that is really nice of you :)

mattnewton · on Nov 1, 2017

Skimming it quickly I didn’t see anything about gpu or vector instruction support as compilation targets. Is this planned? Did I miss something and this is at a higher layer?

Ps: excellent name

fredrikbk · on Nov 2, 2017

We currently compile to C code and use the system Compiler to compile it further. For dense loop nests it does a good job of auto-vectorizing, but we believe there’s good opportunities for doing something custom, knowing the high-level algebraic structure.

taco does not target GPUs yet, and we want to work on it this spring. It’s clearly needed, for example for neural networks

amelius · on Nov 1, 2017

Hi, I just saw the presentation too, and it looks impressive.

I'm hoping that you can (at some point) write a language-agnostic library, so that your work can be used in many other development tools (which are not using C++). I suppose it would amount to publishing the documentation of the intermediate code you are already using. This would also save you the trouble of writing and maintaining a GPU back-end because other people could do that using your library.

Anyway, great work!

fredrikbk · on Nov 2, 2017

That is a good idea, but we are only so many. One student in the Julia group is developing Julia bindings though, and we hope to make C/python bindings. The inner workings are published in the paper “The Tensor Algebra Compiler”, so it can be implemented in other languages. We like the idea of bindings because then our future work can benefit more users.

chroem- · on Nov 1, 2017

Any chance you'd be interested in adding more direct support for nonlinear systems? That seems to be missing from most computation graph compiler libraries, e.g. Tensorflow.

fredrikbk · on Nov 1, 2017

Hi, that sounds very interesting and I hope someone works on it. We're only so many people though, and we have several things we want to do that will at least occupy us until next summer.

110011 · on Nov 1, 2017

I entered "x(i) = y(i) + z(i)" in the online demo, all three sparse, and I noticed some useless variable definitions: int32_t iz0 = z1_idx[pz1];

Is this intentional? (if so, why)

fredrikbk · on Nov 2, 2017

We’ve tried to engineer it to emit clean code, but seems we left in some unused variables.

The C compilers dead code elimination should remove those though.

surban · on Nov 1, 2017

We did some work on computing derivative expressions (goal is application to deep learning) for such tensor algebras. I was going to release it on arXiv in the next few weeks, but now seems to be a good time. Here you go (preliminary version):

https://github.com/surban/TensorAlgDiff/raw/master/elemdiff....

Our system takes a tensor algebra (we call it element-wise defined tensor) and outputs expressions for the derivatives w.r.t. all of its arguments. The expression may contain sums and the indices of the argument tensors can be any linear combination of the function indices (see our example for more details). It correctly handles the cases where an index does not appear in an argument, appears twice, appears as (i+j) etc.

Code to play with at https://github.com/surban/TensorAlgDiff

throwaway613834 · on Nov 1, 2017

I'm not sure if this is a sensible question, but what is the use of this compared to (say) an autodiff library like FABAD++ [1]? Is performance the main advantage, or expressive power, or something else?

[1] http://www.fadbad.com/fadbad.html

surban · on Nov 1, 2017

If you use autodiff and don't have an 1:1 relationship between function and argument indices you might have to do atomic sums or locking when computing the derivative because multiple elements of the function derivative correspond to one element of the argument derivative. On a CPU this might be okay but on CUDA GPUs this usually has a performance impact. Thus we transform the derivatives so that we have an explicit expression for each derivative element and thus can use one CUDA thread per derivative element.

throwaway613834 · on Nov 2, 2017

Oh wow, interesting. Thanks!

amelius · on Nov 1, 2017

I'm not getting it. If you have a tensor expression, then differentiation gives you another tensor expression. In principle you can then just feed it through a compiler like taco, right? Or is there some optimization you can do, knowing that you are computing a derivative?

surban · on Nov 1, 2017

Yes, but you need to take care of the indices of the arguments when calculating the derivative expression. This is what we do.

For example if you have f_{i,j} = (x_i)^2, then the derivative (of some loss) w.r.t. x will be: dx_i = \sum_j df_{i,j} 2 x_i. The sum is needed because the argument x does not depend on the index j and thus x_i is affected by all j elements of df_{i,j}.

Another example would be: f_i = (x_ii)^2, i.e. taking the squares of the diagonal of the matrix x. Here the derivative is x_{i,j} = kronecker_{i=j} 2 df_i x_ii because off-diagonal elements have zero derivatives.

For such simple expressions it's trivial, but for complex expressions it's error-prone when you do it by hand.

comnetxr · on Nov 1, 2017

Often (in the scientific computing communities that I am in) large tensor contractions are done by reshaping tensors into matrices, transposing, and using matrix multiplication. The contraction time can change dramatically based on the order one contracts the tensors and the relative sizes of the tensors. It seems this just uses a large nested for loop - how does this strategy compare to using dedicated matrix multiplication algorithms?

lsorber · on Nov 1, 2017

An efficient xgemm kernel is probably faster than the code that taco generates if you don't need to permute your tensor. A contraction like B(i,k) = T(i,j,k) * A(j) would require a permutation of T before you could run the matrix multiplication though, while taco can just keep the data in-place.

macawfish · on Nov 1, 2017

Awesome!

I made something similar recently, a GLSL code printer for for SymPy. I use it for generating GLSL out of Clifford algebras. I'm using it to do higher dimensional & conformal geometry.

https://github.com/sympy/sympy/pull/12713

gugagore · on Nov 2, 2017

Conventional wisdom has that dense matrix-matrix multiply is generally faster if you use some cache-aware scheme instead of the textbook 3 nested for loops.

Is there a similar story for sparse matrices ever?

fredrikbk · on Nov 2, 2017

The story for sparse matrices is more complicated. Because of the dependencies imposes by sparse data structures (you don’t have random access and often have to traverse them) you cannot do loop ruling without adding if statements (I think you can probably get rid of these by preconputing, e.g. row halfway points). It may still make sense, but there’s a bigger cost. However, you can lay out data in a tiled way to get better cache blocking; taco lets you do this by storing a blocked matrix as a 4-tensor.

Because of the lack of random access, some algorithm like linear combination matrix-matrix multiplication benefits from adding a dense workspace that gives you a view into one row. Then you can scatter into it and when you’re done copy the nonzeroes to the sparse result matrix. This algorithm is sometimes called Gustavson’s algorithm after the person who published it first). We have worked out this optimization within the taco framework, it applies to other kernels too, and are now writing it up for publication.

nmca · on Nov 1, 2017

Hm, the lack of GPU-specific support seems unfortunate. Cool work though, the formulation as a lattice is pretty sweet :)

floatboth · on Nov 1, 2017

Yeah, would be nice to see one of these compute frameworks compile to SPIR-V and run on Vulkan…

metisploit · on Nov 2, 2017

When they add GPU-specific support they can call it taco supreme.

ianai · on Nov 1, 2017

Is there a list of applications this may be especially geared toward? I’m a little out of practice and this seems interesting.

fredrikbk · on Nov 1, 2017

I don't know of an explicit list like this, but some areas that we are particularly interested in are: data analytics (tensor factorization), machine learning (anything sparse including perhaps sparse neural networks), and scientific computing/graphics (general relativity, quantum mechanics such as QCD, finite element simulations, and apparently nuclear physics). It is particularly suited anywhere you have sparse matrices or tensors.

realitygrill · on Nov 1, 2017

Saw you guys at SPLASH; good shit!

fredrikbk · on Nov 1, 2017

Thanks! I really like that conference!

rsodhi3050 · on Nov 9, 2017

Man, this is awesome!