Great post, as an AD tutorial and as a (an?) Haskell exercise. Having known nothing about AD before, I feel like I have a good understanding of what it is -- as he says, it's so simple -- but I don't understand why the algorithm is so much faster. Just looking at the differentiator function and the AD function, it actually appears that the AD should take longer because it does more computation per step (both the function and the derivative). But it seems like every article or paper is talking about how to implement AD, not why the algorithm is so efficient. Does anyone happen to know of a good article or paper about that? Ideally, one just as nice and comprehensible as this!
The first alternative builds a large tree structure, and then evaluates the whole tree structure afterwards.
So first it blows up the size of the expression to process and then it calculates the result. A lot of those calculations will be redundant
The second one not only avoids evaluating the tree separately, but "prunes" a lot of the evaluation automatically by effectively short-circuiting the whole process. Consider (with the caveat that my Haskell understanding is rudimentary at best and it was about 20 years since I last did anything involving symbolic differentiation) :
Product e e' -> Product e (diff e') `Sum` Product (diff e) e'
Followed by a separate run with:
Product e e' -> ev e * ev e'
vs
Product e e' -> let (ex, ed) = ev e
(ex', ed') = ev e'
in (ex * ex', ex * ed' + ed * ex')
(I pick the "Product" rule as an example because it is one of the ones that blows up the size of the tree)
Lets say you do something simple like Product X X. You get Sum (Product X One) (Product One X) out, and then you have to evaluate each node.
In the second case, you match Product e e'. You process X and assign (x,1) to (ex,ed), and process the second X and assign (x,1) to (ex', ed'), and then return (ex * ex', ex * ed' + ed * ex').
In the first case, you've first differentiated 3 nodes (Product + 2x "X"), then evaluated the 7 nodes that was produced as output, for a total of ten nodes processed.
In the second you've evaluated/differentatiated 3 nodes in one go without the intermediate step of having to evaluate a much larger tree.
In a large example, the number of nodes in the differentiated output quickly explodes and subsequent evaluation would increase rapidly in cost.
From an asymptotic complexity viewpoint, I don't see any difference between the two algorithms (AD versus building an expression tree and doing it symbolically, then evaluating). Both are linear in the "size" of the expression. So I don't understand what you mean by "quickly explodes".
In american english, this would be "a Haskell exercise," pronouncing "a" to rhyme with "hey" or "ah." In british english, it would be "an 'askell exercise."
I'm american, though. British people please weigh in.
> For fans of Lisp, there is no question that one motivation is to show how easy it is to implement in Lisp.
Lisp provides a natural representation for programs as data and a natural form for writing programs that
write programs, which is what we do in ADIL. The code is short, and is in ANSI standard Common Lisp.
Since it is not "naive" code written in just the obvious idioms of introductory Lisp, it illustrates, for those
with only a cursory familiarity, that Lisp is more than
CAR
and
CDR. In fact we did not use those routines
at all.
Ah, but you did use A and D. cAr-tomatic cDr-ivation!
Algebraically, automatic differentiation is the same as adding a nilpotent element e, such that e^2=0 to your algebra. You can continue this pattern out to get higher order derivatives. For example, if you also add an element f where f^3=0, the coefficient of f is proportional to the second derivative.
This example makes it really clear what's going on. Could someone translate it to do reverse automatic differentiation? That's the one I never quite understand.
You mean integration? That's much harder, but has been done automatically. Symbolic differentiation is easy, because you can just keep blindly applying a set of rewrite rules until none of them apply. That process converges on a unique result. Symbolic integration doesn't converge in that way. More strategy is required, and you're not guaranteed a closed form solution. Mathematica has a good symbolic integrator.
The short answer is no, for a couple of reasons. First, you have to distinguish whether you want the antiderivative or the integral. In the first case, the problem with finding the antiderivative is that it's not unique. For example, the antiderivative of x can be 0.5 * x^2 or 0.5 * x^2 + 1 or 0.5 * x^2 + 2, etc. Basically, taking derivatives can lose information and we don't know how to put that information back. In the second case, we have the integral, but that creates problems because the integral really needs to have a domain defined to make sense. Basically, over what values are we integrating? In 1-D we can specify this with a bunch of intervals, but in 2 and higher dimensions, it becomes difficult. In fact, this is partially why we have meshing tools like Cubit or gmesh. Yes, we use them with finite element methods, but really, at it's core, we need them to break up a domain because we want to integrate the pieces.
Anyway, with enough information or on a simple enough setup, both can actually be done, but the general case isn't super well defined, which is why the simple answer is no.
I read the parent (as does 'tome in a neighboring thread) to be asking about reverse-mode automatic differentiation (where what is explained in TFA is forward-mode), not about computing antiderivatives.
The thing that's great about the typeclass approach, you can do anything you want behind the implementation. you can numerically evaluate the expression, but even cooler, you can recover the parse tree. I never could sort out how to deal with 'if', because it's not typeclassed. if it was, boy could you do some amazing stuff. partial differentiation, tree rewriting, with the LLVM stuff you could runtime compile arbitrary functions. super neat trick.
That's also what's great about pure dynamic systems like Smalltalk. 'ifTrue:' is just a message-send, the class "False" doesn't evaluate the block parameter, the class "True" does. And yes, you can then recover the parse tree, for example the Gemstone OODB lets you use arbitrary blocks (closures) as queries, recovers the parse tree and then creates an optimized query from it. Quite neat.
Could Automatic Differentiation learn by applying learnable weights to the compiled S-expression atoms - Backpropagating Errors applied with Stochastic Gradient Descent ?
A program would constrain a task with functional statements, which is then compiled to weighted s-expressions which learn the specific task from training data.
Dual numbers are just a means of formalizing the properties of an epsilon (a tiny change in calculus), and are the means of preserving enough information to think of the function and its derivative at the same time. EG: (x + ε)² = x² + 2ε + ε², but ε² = 0, so we get x² + 2ε, (a tiny change squared becomes an insignificantly tiny change)
I'm a little late to the party, but hopefully this'll explain.
Basically, you have to be careful about what it means to scale or not scale. If all you want is a derivative with respect to a single variable, forward mode scales just fine, great in fact. However, if you want the gradient, or the derivative with respect to every variable, then the forward mode does not scale well at all with respect to the number of variables. Specifically, assume we have m variables. In order to calculate the derivative of an expression with respect to 1 variable is 2 times the cost of a function evaluation, 2 * eval. In order to see this, it's easiest to note that we don't need an expression tree for forward mode AD like the article uses. Really, we can get away with just a tuple that contains the function evaluation as the first element and the partial derivative as the second element. Then, all of the rules are basically the same as the article, but we're always doing one operation on the first element, whatever the function is, and a different operation on the second element for the partial derivative. This is twice to work, so 2 * eval. Since we have m variables, this becomes 2 * m * eval. And, yes, memory layouts, fewer optimizations for algebraic data types compared to floats, etc. mean that it's actually slower, but, honestly, it's pretty fast.
The reverse mode is different because it turns out that it can calculate the entire gradient, or all m partial derivatives, with 4 * eval cost. Note, this is independent of the number of variables. Proving this is a pain, so I can't give a good explanation here. Realistically, source code transformation tools perform around 10-20 * eval. Operator overloading tools perform around 20-30 * eval, so it's slower in practice, but pretty damn good.
Now, unlike the forward mode, where we really only need a tuple to carry information, the reverse mode does require an expression tree. In order to understand why, it helps to note that the forward mode is really a directional (Gatteaux) derivative and the reverse mode is the total (Frechet) derivative. This affects how the chain rule manifests. Specifically, the forward mode repeatedly applies two rules
(f o g)'(x) dx = f'(g(x)) g'(x) dx
(f o (g,h))'(x) dx = f'(g(x),h(x)) (g'(x)dx,h'(x)dx)
Basically, in the function evaluation, we do some operation g before f. In order to figure out the derivative, we also do the g derivative operation before the f derivative operation. The first rule is for unary operations like negation and the the second rule is for binary operations like addition. Anyway, the reverse mode takes the Hilbert adjoint of this. Specifically:
(f o g)'(x)^* = g'(x)^* f'(g(x))^*
(f o (g,h))'(x)^* = [g'(x)^* h'(x)^* ]f'(g(x),h(x))^*
We care about the adjoint because of a trick from the Riesz representation theorem. Specifically,
f'(x)dx =
(f'(x)dx)1 =
<f'(x)dx,1> =
<dx,f'(x)^* 1> =
<dx,grad f(x)>
where <.,.> denotes the inner product. Anyway, basically the gradient of f is the adjoint of the total derivative of f applied to 1. Therefore, if we knew the adjoint of a computation applied to 1, we'd get the gradient. In other words, we can rewrite the chain rule above as
grad (f o g)(x) = g'(x)^* grad f(g(x))
grad (f o (g,h))(x) = [g'(x)^* h'(x)^* ]grad f(g(x),h(x))
That's the core of reverse mode AD. Note, many, if note most descriptions of reverse mode AD talk about doing the chain rule in reverse and then they add dual variables, etc. That may be a description that's helpful for some, but not for me. In truth, it's just a bunch of adjoints applied two one and knowing the Riesz representation trick.
Now, the reverse mode AD does require an expression tree to be kept. The reason for this is that the computation about did g before f. However, if we look at the chain rule we have
grad (f o g)(x) = g'(x)^* grad f(g(x))
This means that in order to calculate the gradient of the composition, we need to know the gradient of f first even though we did the evaluation of g first. However, we need to know the evaluation of g in order to calculate the gradient of f. The way we resolve this is that we evaluate the functions in order, but keep an expression tree of what we did. This gives all of the g(x), f(g(x)), etc. Then, we run over that expression tree backward to calculate all of the gradients. Because we run over the expression tree backwards, we call this the reverse mode.
How we run over the expression tree backwards is important and tricky to do right. The way that we can sort of see that we can do everything in 4 * eval cost is that the trick is not to create multiple vectors to store the gradient when running over the tree, but to have 1 vector and to update this vector with the new derivative information when required. Basically, we're just inserting information in the right spots, which can be done efficiently. In practice, storing the expression tree in memory can be really expensive. For example, imagine a for-loop that had 10 billion loops. That's a really long expression tree to hold in memory. Now, source code transformation tools are really clever and don't actually store all of those expressions in memory, but just run back the for loop, which is why they're more efficient. Operator overloading techniques (algebraic data types) can technically optimize this as well by doing some interesting caching techniques. However, the overall idea is that it can be expensive and there are lots of ways to do this wrong, but also lots of places to do things right and be creative.
As aside to a comment left above, back propagation is indeed just reverse mode AD combined with a nonglobally convergent version of steepest descent. I've never seen a paper that worked this out, but it's something that's known within the AD community. Someone, someday, should really write that down.
Anyway, that's probably a much too long response to your simple question. In short, forward mode doesn't scale when calculating gradients because the cost is 2 * m * eval whereas the reverse mode can do it in 4 * eval. For a single variable, or an entire directional derivative, the forward mode scales fine and in fact works better than the reverse mode for this case.
Edit: This formatting is killing me. Hopefully, it all looks fine now.
http://matt.might.net/articles/parsing-with-derivatives/