Hacker News new | past | comments | ask | show | jobs | submit login
Automatic Differentiation in Machine Learning: A Survey [pdf] (jmlr.org)
127 points by sonabinu on Nov 20, 2018 | hide | past | favorite | 20 comments



AD reminds me of another idea: in cryptography, you can determine eg. how bit dependencies propagate through a state vector by running the regular code for your compression function or whatever, but on "numbers" that actually track the dependencies of bits. Details vary, but a simple idea is to use as a "number" a vector of symbolic equations for each bit in the number in terms of symbolic variables for input bits to the function. You can then easily calculate items of interest eg. at this many rounds, has every input bits affected every output bit?

So, jets instead of numbers for differentiation, systems of boolean equations instead of numbers for answering some crypto questions... anyone know of other tricks in this vein?


Taint checking: https://en.m.wikipedia.org/wiki/Taint_checking

Every external input to a program is assumed “tainted”. There are certain operations that make them “safe” (e.g. escaping a string makes it safe against SQL injection, etc.).

You simply attach this “tainted/safe” bit of information to each piece of data and flow it through your program. There is a simple calculus for combining these tainted/safe values, i.e. appending a tainted string to a safe string produces a tainted string, etc.

The goal is to prove that tainted values can’t make it to the business logic, so that your program is safe against certain kinds of malicious input.


What you’re describing sounds like Abstract Interpretation[1], a technique used in static analyzers and compilers (e.g. for bounds-check elimination). You can pick any abstract domain you want, but the more complex the domain — i.e., the more complex/precise your tracking of the possible values associated with each variable — the longer it takes for the analysis to run.

[1]: https://en.m.wikipedia.org/wiki/Abstract_interpretation


Abstract interpretation would be the equivalent of symbolic differentiation in this analogy.


The KLEE project[1] does something similar for general code, with symbolic variables.

[1] http://klee.github.io


I would be surprised if there was any crypto where not any output bit depended on any input bit (of course it depends on what precisely one means by "depends upon"). Otherwise, couldn't the search-space be partitioned? See also: Shannon's confusion and diffusion principle [1], as well as the avalanche effect [2].

[1] https://en.wikipedia.org/wiki/Confusion_and_diffusion

[2] https://en.wikipedia.org/wiki/Avalanche_effect


Sounds like higher-order differential cryptanalysis:

https://en.wikipedia.org/wiki/Higher-order_differential_cryp...


There’s definitely some work of tracking activations in a network for a given set of inputs. It’s a good black-box way to prune your network for lighter weight models.


Do you have a reference for the crypto trick you described?


Would you mind providing a reference to this technique? I recently got into theoretical cryptography and am quite interested in any analytical tools.


Another paper related to AD was posted last month: The Simple Essence of Automatic Differentiation [0]. Video presentation [1]. I found it pretty awesome, just in case you hadn't seen it.

[0]: https//news.ycombinator.com/item?id=18306860

[1]: https://www.youtube.com/watch?v=ne99laPUxN4


Also not mentioned in the Baydin et al paper (which has been knocking about for a while: it was submitted in 2017, so not surprising):

* F. Wang, X. Wu, G. Essertel, J. Decker, T. Rompf, Demystifying Differentiable Programming: Shift/Reset the Penultimate Backpropagator, see [1], and [4] for a presentation based on this paper.

* S. Laue, M. Mitterreiter, J. Giesen, Computing Higher Order Derivatives of Matrix and Tensor Expressions, see [2], discussed in [3].

Quite a bit of exciting work in better understanding backpropagation going on right now.

[1] https://arxiv.org/abs/1803.10228

[2] http://www.matrixcalculus.org/matrixcalculus.pdf

[3] https://news.ycombinator.com/item?id=18464003

[4] https://www.youtube.com/watch?v=igRLKYgpHy0


I skimmed the article and I enjoyed it. Recently, I was looking for a good reference to cite that discussed the link between automatic differentiation and back propagation. From those I know if the field, they knew about the link for a very, very long time, but I had a hard time finding a paper to cite. The best I found was an article titled, "Backwards Differentiation in AD and Neural Nets: Past Links and New Opportunities" by Paul Werbos, which can be found in the book "Automatic Differentiation: Applications, Theory, and Implementations" published in 2006 by Springer. That article is preceded by one written by Louis Rall titled, "Perspectives on Automatic Differentiation: Past, Present, and Future?", which also provides some good history of AD. Anyway, if someone has a better paper to cite linking back propagation and AD, I'd be interested to hear.


I thought the article itself did a good job of explaining it. Backpropagation is just a special case of one way to do AD, apparently. AD prescribes a method of taking derivatives, and if you use it on a neural network, there you go, backpropagation.


Dual numbers are favorite idea of late. Nilpotence seems under appreciated.


Can you elaborate? What makes you think that nilpotence is under appreciated?


Well in this case, dual numbers have this nilpotent constant epsilon that’s analogous to i such that epsilon ^ 2 = 0. It’s the basis of automatic differentiation. I don’t fully understand it yet, but it seems that nilpotence is useful for smoothness and differentiability.


I think you are thinking about this the wrong way. nilpotence doesn't relate to smoothness (of what) and differentiability (of what). Dual numbers happen to, the way I see it, model forward-mode differentiation. There are a couple of ways to see how eps^2=0 is a good model. But the easiest to get across is that in many calculus derivations, you take dx^2 to be 0, since dx is "a very small number".


They discuss this on page 11


Great but machine learning is too change in these days in their algorithm




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: