Hacker News new | past | comments | ask | show | jobs | submit login

We did some work on computing derivative expressions (goal is application to deep learning) for such tensor algebras. I was going to release it on arXiv in the next few weeks, but now seems to be a good time. Here you go (preliminary version):

https://github.com/surban/TensorAlgDiff/raw/master/elemdiff....

Our system takes a tensor algebra (we call it element-wise defined tensor) and outputs expressions for the derivatives w.r.t. all of its arguments. The expression may contain sums and the indices of the argument tensors can be any linear combination of the function indices (see our example for more details). It correctly handles the cases where an index does not appear in an argument, appears twice, appears as (i+j) etc.

Code to play with at https://github.com/surban/TensorAlgDiff




I'm not sure if this is a sensible question, but what is the use of this compared to (say) an autodiff library like FABAD++ [1]? Is performance the main advantage, or expressive power, or something else?

[1] http://www.fadbad.com/fadbad.html


If you use autodiff and don't have an 1:1 relationship between function and argument indices you might have to do atomic sums or locking when computing the derivative because multiple elements of the function derivative correspond to one element of the argument derivative. On a CPU this might be okay but on CUDA GPUs this usually has a performance impact. Thus we transform the derivatives so that we have an explicit expression for each derivative element and thus can use one CUDA thread per derivative element.


Oh wow, interesting. Thanks!


I'm not getting it. If you have a tensor expression, then differentiation gives you another tensor expression. In principle you can then just feed it through a compiler like taco, right? Or is there some optimization you can do, knowing that you are computing a derivative?


Yes, but you need to take care of the indices of the arguments when calculating the derivative expression. This is what we do.

For example if you have f_{i,j} = (x_i)^2, then the derivative (of some loss) w.r.t. x will be: dx_i = \sum_j df_{i,j} 2 x_i. The sum is needed because the argument x does not depend on the index j and thus x_i is affected by all j elements of df_{i,j}.

Another example would be: f_i = (x_ii)^2, i.e. taking the squares of the diagonal of the matrix x. Here the derivative is x_{i,j} = kronecker_{i=j} 2 df_i x_ii because off-diagonal elements have zero derivatives.

For such simple expressions it's trivial, but for complex expressions it's error-prone when you do it by hand.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: