Hacker News new | past | comments | ask | show | jobs | submit login

I presume you're already familiar with computing the numerical and analytical Jacobian[1][2] and just wishing for a better way? :) They're memory intensive as all hell and pretty finicky but at least it's something. I'll admit that when floating point calculations are involved it can all go to hell anyway.

Recently I had to implement gradient calculations by hand recently (writing custom CUDA code) and had a pretty terrible time. Mixing the complications of CUDA code with my iffy manual differentiations and floating point silliness can drive you a little bonkers. I ended up implementing a slow automatic differentiated version and compared resulting outputs and gradients to help work through my bugs.

Here's hoping that Tensorflow's XLA and other JIT style CUDA compilers/optimizers will make much of this obsolete in the near future.

For those not familiar, the overhead for calling a CUDA kernel can be insanely high, especially when you're just doing an elementwise operation such as an add. Given your neural network likely has many many of these, wrapping many of these into one small piece of custom CUDA can result in substantial speed increases. Unfortunately there's not really any automatic way of doing that yet. We're stuck in the days of either writing manual assembly or being fine with suboptimal compiled C.

[1]: https://www.tensorflow.org/versions/r0.11/api_docs/python/te...

[2]: https://github.com/pytorch/pytorch/blob/master/torch/autogra...




We spent a ton of time thinking about this. We have an "op executioner" in our tensor library that handles special cases like this. We call it "grid execution" where we look for opportunities for grouping ops automatically. We will be combining that with our new computation graph to automatically look for optimization opportunities like that.

Right now we hand write all of our own gradients as well.

The overhead can come from a ton of different places. This is why we wrote workspaces: http://deeplearning4j.org/workspaces

Allocation reduction and op grouping are only a few things you can do.


I did not know about gradcheck. Thanks for the pointer! I have some handwritten code that does some of this for me. But essentially, yes! I want better tooling to catch my mistakes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: