Automatic Differentiation in Machine Learning: A Survey (2018) [pdf]

cs702 · on March 8, 2019

I'm trembling with excitement at the prospect that easy-to-use high-performance automatic differentiation looks likely to become a "must have" capability for more and more computer languages. It's going to become easier and easier to specify objective functions and have the computer optimize programs for a wider range of application.

thearn4 · on March 8, 2019

I work on non-linear optimization problems for most of my work projects, and I know exactly what you mean. One of these years we'll be freed from remembering calculus I :)

killjoywashere · on March 8, 2019

right after the year of the Linux desktop, eh?

adamnemecek · on March 7, 2019

The idea of duality is insane. Its present in linear logic, automatic diff, constructive mathematics, probability, quantum theory, game design, discrete optimization, etc.

I wonder if it's related to chirality.

throwawaymath · on March 7, 2019

I think characterizing duality in this way is kind of superfluous, because the only way all those meanings of duality are the same is in the most abstract sense of the word.

In other words, lots of things have duals. But the duality between any given pair of things doesn't necessarily expose any deep, fundamental connection to another pair of things which have duality. So it's not that duality features so heavily throughout mathematics as its own concept; rather, we frequently build new theories to tie these things together. It's helpful to be able to translate things from one context to another context.

We could just as easily say that isomorphisms are insane because they feature heavily throughout mathematics. But I don't think that provides a deep insight, because it's not like an isomorphism is a special property that ties a bunch of mathematics together in a grand way. Specific pairs of things can be isomorphic. Likewise specific pairs of things can be duals.

Any given pair of dual things is its own duality. It doesn't necessarily have anything to do with the way another pair of objects is in duality. The terminology here is semantically convenient for intuition, but it's definitely overloaded. I think the commonalities you're seeing here are simply due to the vast utility of linearity in all of those disciplines.

adamnemecek · on March 7, 2019

> I think characterizing duality in this way is kind of superfluous,

It's an analysis done out of necessity. These dualities might not be a 100% in every case, but maybe I care about the ways in which they are similar.

> because the only way all those meanings of duality are the same is in the most abstract sense of the word.

So is a monad. Do you think that in the future, the level of abstraction in mathematics is going to increase or decrease?

throwawaymath · on March 8, 2019

It will increase, which I guess is sort of my point. We already know there's a lot of abstraction. If these things are only alike semantically (two pairs of dual things can be completely unrelated), what does it gain you to point out they've everywhere?

I don't mean to be obtuse, but it strikes me as saying that a city is full of concrete.

adamnemecek · on March 8, 2019

You can think of it as a really nice intermediate language.

It might be hard to build a computer system that lets you reason about both probability and say quantum mechanics.

However it might be easier to build a system that phrases a probabilistic problem in terms of duality and then solves it.

Like why should each of this have it's own foundation when there's one that captures a lot of them?

WallWextra · on March 7, 2019

Those are a bunch of different ideas. It's the word "duality" that is insane(ly overloaded).

adamnemecek · on March 7, 2019

They aren't, the underlying idea is the same, or very similar. Pick two of them and I'll provide reference on the connection.

throwawaymath · on March 7, 2019

Okay, automatic differentiation and probability.

adamnemecek · on March 7, 2019

https://math.stackexchange.com/questions/2593296/why-is-prob...

Automatic differentiation is something that you get for free if you use dual numbers. Lie theory describes the relationship between the discrete and continuous spaces. Probability has this deep connection to Lie groups.

To give you some intuition (and I'm really rephrasing the stackexchange post above), the only way you can only measure randomness (or generate randomness) is if each draw has a "reference" to some global object that has the global, normalized view of the discrete space.

Are you familiar with Vovk's foundations of probability? That brings you from probability to game semantics. Duality is right next to it.

throwawaymath · on March 8, 2019

Wait, what?

> Automatic differentiation is something that you get for free if you use dual numbers. Lie theory describes the relationship between the discrete and continuous spaces. Probability has this deep connection to Lie groups.

This...doesn't follow. Probability has a connection to Lie groups because it's fundamentally analytic ("continuous"). But you haven't explained how you make the connection to the dual numbers.

What you're showing here is that a lot of things in mathematics can be described analytically (and saying that would be likewise pretty superfluous). But just because you're working with continuous spaces doesn't mean you've engaged the duals. It generally means you're using the reals.

This gets to the heart of what I'm saying - if I wanted to be flippant I could have said the real numbers, or continuity, or analysis, etc are at the heart of so many distinct subfields of mathematics. It doesn't mean quite a lot.

Duality features in a lot of different parts of mathematics, but that doesn't mean you can productively draw connections between dual things in one area and dual things in another. I'm not seeing how you get from dual numbers to Lie groups.

adamnemecek · on March 8, 2019

> Probability has a connection to Lie groups because it's fundamentally analytic ("continuous"). But you haven't explained how you make the connection to the dual numbers.

Are you familiar with Chu spaces?

> But just because you're working with continuous spaces doesn't mean you've engaged the duals. It generally means you're using the reals.

They are not just continous spaces, it's a pair of a discrete space and a continous space that are directly connected. You never work only with one of them at once. You manipulate things in smooth space to solve things in the discrete space and vice versa.

> This gets to the heart of what I'm saying - if I wanted to be flippant I could have said the real numbers, or continuity, or analysis, etc are at the heart of so many distinct subfields of mathematics. It doesn't mean quite a lot.

Analysis is too general and also much higher conceptually. Also, you need to be looking at constructive mathematics to really capture duality. Also analysis is unusuable for a lot of problems that duality is useful for.

For example the Rust borrow checker is based on linear logic, a logic that reifies the concept of duality. No one has ever used analysis to build a compiler.

twtw · on March 7, 2019

Discrete optimization and automatic differentiation.

KenoFischer · on March 7, 2019

Forward AD is the pushforward of a tangent vector (an element of the tangent space), Reverse AD is a pullback of a cotangent vector (an element of the cotangent space). The duality notion between tangent and cotangent spaces is the same as the duality notion of spaces in optimization. Unfortunately, I'm only passingly familiar with discrete optimization, but I would suspect the notion extends from optimization. That's not to say that they are fundamentally the same or that writing this down helps anybody in any way, but a lot of these "dual" notions do have some sort of dual vector space under the hood.

throwawaymath · on March 7, 2019

Yeah, but all you're really describing here is linear algebra. Vector spaces and linearity are a significant part of every single discipline the grandparent commenter mentioned, but they picked out duality.

I would agree with the critique: I don't think highlighting duality here is particularly useful. For example, the way dual numbers are used to extend the reals for automatic differentiation doesn't have a deep connection to duality in vector spaces. It's just a very general semantic concept that describes pairs of things. But it doesn't say that any given pair of dual things is related to another pair of dual things.

adamnemecek · on March 7, 2019

> For example, the way dual numbers are used to extend the reals for automatic differentiation doesn't have a deep connection to duality in vector spaces.

They don't. Because certain operations are hard to reason about in linear spaces. Such as optimization.

Don't get me wrong, I'm not shitting on vector spaces. All I'm saying is that some problems are hard to do in vector spaces, that are easy in the smooth spaces and vice versa. Like having these two APIs to the same space much more powerful, because again, you generalize over the conversions between the two spaces. You use whichever API is more appropriate in the particular context.

In some sense the linear spaces deal with things like infinity, the smooth spaces deal with cyclical things (signals, wavelets, modular arithmetic).

throwawaymath · on March 8, 2019

Quite a bit of optimization is easy to reason about in linear algebra. Take linear and mixed integer programming, for example. And convex optimization subsumes linear optimization in general. There is a lot of nonlinear optimization, but I can assure you with extremely high confidence that the common thread you're seeing here isn't duality, but more abstractly linearity.

Likewise cyclic things show up all the time in purely algebraic (read: discrete, non-smooth) contexts. We have that in vector spaces, group theory, rings, modules, etc.

adamnemecek · on March 8, 2019

They show up separately but not in tandem.

The canonical example is robotic motion and the reason why Lie theory is used there. You have very discrete states (positions) that you want to interpolate between smoothly.

KenoFischer · on March 8, 2019

> For example, the way dual numbers are used to extend the reals for automatic differentiation doesn't have a deep connection to duality in vector spaces.

Yes, the right way to think about dual numbers (esp once you generalize them beyond just the single e^2=0), is to think of them as tangent vectors (sections of the tangent bundle). I've never really liked the "dual number" terminology here. That's why I deliberately chose to use the duality of forward and reverse mode AD, because that notion of duality agrees with the underlying linear algebra (or in general differential geometry). I do agree it's a mess of terminology.

adamnemecek · on March 7, 2019

Gimme five and I'll answer two. There's quite a few pairwise permutations and some are easier to understand and more instructive than others.

Fundamentally, they are both connected via the idea of convex optimization. Automatic differentiation is a computational technique to solve optimization problems.

Yes optimization problems is very general however calculus is a fundamental tool. Dual numbers are somewhat like lie groups, very smooth and conducive to optimization.

wenc · on March 7, 2019

Curious, can you expand on the connection to convex optimization? To my understanding, discrete optimization is nonconvex by nature due to discontinuities in the feasible space.

adamnemecek · on March 7, 2019

There are two types of spaces, discrete and continuous. These are in a dual relationship.

Duality is the isomorphism between these two. For example, for humans, it's easier to reason about discrete spaces. However a lot of things simply cannot be done that way.

Think of anything that is tangential (pun intended) to Lie theory. In the context of Lie theory, you have the discrete group and the continuous algebra (the group's tangent space). You go between these two using the exponent (group -> algebra) and logarithm to go back (algebra -> group).

It's the difference between an integral and a Rieman sum. It's the fundamental idea that underlies sampling (say audio sampling or even statistical sampling). You capture some invariants and then you interpolate between these invariants to recreate some smooth curve (or distribution).

The nice thing about the smooth space is that optimization is easy. In the exponential space, addition is multiplication and some expensive things are cheap (computationally speaking).

throwawaymath · on March 8, 2019

Unless I'm severely misunderstanding you, a discrete set (or function) cannot be a dual of a continuous set (or function). If nothing else, the former is countable and the latter is uncountable; there can be no isomorphism between the two.

adamnemecek · on March 8, 2019

Unless I'm severely misunderstanding you, a discrete set (or function) cannot be a dual of a continuous set (or function). If nothing else, the former is countable and the latter is uncountable; there can be no isomorphism between the two.

There is. Discrete samples are samples of the continuous space. In order to capture the continous space, you only really need to capture a particular set of samples that you then interpolate between.

The way I interpret isomorphism in this context is if you can capture one space in the other and then convert to the other without a loss of information.

Imagine a polynomial (in the smooth space). You can capture a particular set of points that uniquely determines the polynomial. In some circumstances you can use these samples to reconstruct the original polynomial by interpolating between any of the two points.

throwawaymath · on March 8, 2019

Okay, I think I understand what you're getting at. But if you've taken a set of points from a continuous set (like an interval on the reals) and you can put those in bijection with a discrete set, then by definition your subset of the continuous set isn't continuous. It must be discrete.

More succinctly, you actually can't draw an isomorphism between discrete and continuous spaces without losing information from the continuous space.

adamnemecek · on March 8, 2019

Here's the thing, your description of the continuous set is already discretized. If we say an interval 4-6 we have captured the continuous space using only two numbers.

I know that this is a silly argument in some sense but this is something that you do naturally that you don't even think about it.

Do you see what I'm getting at? You capture extrema in the discrete space and the interpolate in the smooth space to recreate the smooth curve.

wenc · on March 8, 2019

> Imagine a polynomial (in the smooth space). You can capture a particular set of points that uniquely determines the polynomial. In some circumstances you can use these samples to reconstruct the original polynomial by interpolating between any of the two points.

This is not duality though and you do lose information.

For instance let’s say one has a cubic polynomial and one samples 5 points from it and stores those points. If one didn’t know the original order was cubic, and if one tried to interpolate over the 5 points to fit a quintic, that would be an incorrect reconstruction.

adamnemecek · on March 8, 2019

You don't lose information if you pick your points correctly (you store only the extrema). In the cubic case, you need the two extrema (one minimum and one maximum and you need to know whether each extremum is a min or max) and then interpolate between them.

wenc · on March 8, 2019

Unfortunately this is incorrect. Extrema do not always exist (consider y=x^3) and they do not uniquely define a polynomial (y=x^2 and y=x^4 both have minima at x=0).

adamnemecek · on March 8, 2019

Unfortunately this is incorrect. Those minima are not the same. Remember that dual points have a real part and a dual part that indicates the rate of change at that point. The real part is the same but the dual is different.

wenc · on March 8, 2019

My point is it’s not possible to uniquely reconstruct an arbitrary polynomial by just knowing the extrema because there may be information loss in the general case. I will stop here.

adamnemecek · on March 8, 2019

It is possible if you know the rate of change which you do with dual points. Like you don’t interpolate just position but also the dual parts I.e. rate of change.

wenc · on March 8, 2019

I’m not sure. I’m not entirely convinced that discrete and continuous spaces are dual spaces. They are connected, but they are not duals.

Same with sampling vs continuous. One cannot interchange the order of the composing morphisms while preserving the properties of the original. The sampled object cannot reconstruct the continuous object in all situations due to effects like aliasing.

In optimization, the concept of duality is also a much stronger idea: the primal and the dual of a problem are opposing views of the same problem that correspond exactly (not approximately) in their dual properties.

Discrete optimization is nonconvex by nature (does not satisfy convexity definitions) so I’m not sure if it has any duality relations to convex optimization. There is a relationship but it is not a dual relationship.

adamnemecek · on March 8, 2019

Look into Chu spaces.

> One cannot interchange the order of the composing morphisms while preserving the properties of the original.

Good observation one really can't but that was never a hard requirement, right? Ordering becomes actually more interesting because you can have interesting properties like anti-commutativity (https://en.wikipedia.org/wiki/Anticommutativity) which is a lot more useful than commutativity. Lie groups are anti-commutative groups btw.

> In optimization, the concept of duality is also a much stronger idea: the primal and the dual of a problem are opposing views of the same problem that correspond exactly (not approximately) in their dual properties.

My view is more general. The difference between these spaces lies in the idea of choice and in the idea of adversarial choice. You are correct, they are opposing view, like two players playing a game.

I control my moves. I do not have control over my opponents players moves, however I do have knowledge about my opponent's potential moves. Therefore I can do some sort of min-max optimization to figure out my optimal play given my situation and knowing my opponent's options.

wenc · on March 8, 2019

But it is. Duality requires commutativity of composition.

It sounds like the ideas that you’ve put forward confounds duality with something else, perhaps transformations. You may be digging a hole here.

adamnemecek · on March 8, 2019

> Duality requires commutativity of composition.

Not necessarily.

No, I’m not confounding it.

selimthegrim · on March 7, 2019

I think the dualities in quantum field theory are different from the dualities in optimization but maybe some category theorist can correct me if I’m wrong

adamnemecek · on March 7, 2019

Category theory isn't quite the correct formalism.

selimthegrim · on March 7, 2019

What would be then?

adamnemecek · on March 7, 2019

I like linear logic a lot. Not saying it solves everything but it's easier to understand and you don't really give up much of what category theory has.

Also alternating graphs or game semantics.

These are all related.

ssivark · on March 7, 2019

For those of us who are not aware of duality in auto differentiation (and haven't had a chance to read the above review), could you introduce the idea? Are you talking about forward mode vs reverse mode -- since I haven't pondered, what's interesting/deep about that?

adamnemecek · on March 7, 2019

AD relies on dual numbers. Dual numbers are more suited for doing calculus.

Structurally, dual numbers are numbers of the form a + b * e (where e is epsilon s.t. e^2 = 0 but e != 0. Think of it as the imaginary constant but instead of i^2 = -1, you have e^2=0).

For example, multiplication of two dual numbers (a + b * e)(c + d * e) = (ac + ad * e + bc * e +bd * e^2). Since e^2 = 0, you end up with (ac + (ad + bc) * e).

Here comes the magic. Dual numbers let you evaluate a function and get the derivative at that point by just evaluating the function. In the above, example, if this was the result of a function, ac is the value of the function and ad + bc is the derivative of that function at that value.

https://blog.demofox.org/2014/12/30/dual-numbers-automatic-d...

selimthegrim · on March 8, 2019

Before looking at the wiki I thought, huh, Grassmann numbers on HN?

adamnemecek · on March 8, 2019

You are not wrong.

b_tterc_p · on March 7, 2019

Edit: I think I answered a different question than I believed.

Been a while but my best crack at it: If you’re trying to minimize a function, you can call it the primal function. It will have n inputs and m constraints (like how many of each product should I buy constrained by budget and carrying capacity). You can flip the problem around into its dual formulation. This will be a function with m inputs and n constraints, and it will be a maximization problem.

If you can solve the dual formulation (global max), you know that the primal cannot possibly be lower than the dual. For some types of problems (convex), you can even guarantee that the global max of the dual is the global min of the primal.

The transformation between primal and dual formulations goes both ways, so the primal is the dual of the dual.

bitL · on March 7, 2019

I'd add a very important thing - with your primal problem you are trying to e.g. minimize a function and you proceed the way that all your intermediate solutions are feasible. With dual approach you flip the direction, i.e. maximization in this case, but you start in completely infeasible solutions and hope to end up in the very first feasible solution that should be your optimum.

Now how does that relate to automatic differentiation I am not sure either.