Forward AD is the pushforward of a tangent vector (an element of the tangent space), Reverse AD is a pullback of a cotangent vector (an element of the cotangent space). The duality notion between tangent and cotangent spaces is the same as the duality notion of spaces in optimization. Unfortunately, I'm only passingly familiar with discrete optimization, but I would suspect the notion extends from optimization. That's not to say that they are fundamentally the same or that writing this down helps anybody in any way, but a lot of these "dual" notions do have some sort of dual vector space under the hood.
Yeah, but all you're really describing here is linear algebra. Vector spaces and linearity are a significant part of every single discipline the grandparent commenter mentioned, but they picked out duality.
I would agree with the critique: I don't think highlighting duality here is particularly useful. For example, the way dual numbers are used to extend the reals for automatic differentiation doesn't have a deep connection to duality in vector spaces. It's just a very general semantic concept that describes pairs of things. But it doesn't say that any given pair of dual things is related to another pair of dual things.
> For example, the way dual numbers are used to extend the reals for automatic differentiation doesn't have a deep connection to duality in vector spaces.
They don't. Because certain operations are hard to reason about in linear spaces. Such as optimization.
Don't get me wrong, I'm not shitting on vector spaces. All I'm saying is that some problems are hard to do in vector spaces, that are easy in the smooth spaces and vice versa. Like having these two APIs to the same space much more powerful, because again, you generalize over the conversions between the two spaces. You use whichever API is more appropriate in the particular context.
In some sense the linear spaces deal with things like infinity, the smooth spaces deal with cyclical things (signals, wavelets, modular arithmetic).
Quite a bit of optimization is easy to reason about in linear algebra. Take linear and mixed integer programming, for example. And convex optimization subsumes linear optimization in general. There is a lot of nonlinear optimization, but I can assure you with extremely high confidence that the common thread you're seeing here isn't duality, but more abstractly linearity.
Likewise cyclic things show up all the time in purely algebraic (read: discrete, non-smooth) contexts. We have that in vector spaces, group theory, rings, modules, etc.
The canonical example is robotic motion and the reason why Lie theory is used there. You have very discrete states (positions) that you want to interpolate between smoothly.
> For example, the way dual numbers are used to extend the reals for automatic differentiation doesn't have a deep connection to duality in vector spaces.
Yes, the right way to think about dual numbers (esp once you generalize them beyond just the single e^2=0), is to think of them as tangent vectors (sections of the tangent bundle). I've never really liked the "dual number" terminology here. That's why I deliberately chose to use the duality of forward and reverse mode AD, because that notion of duality agrees with the underlying linear algebra (or in general differential geometry). I do agree it's a mess of terminology.
Gimme five and I'll answer two. There's quite a few pairwise permutations and some are easier to understand and more instructive than others.
Fundamentally, they are both connected via the idea of convex optimization. Automatic differentiation is a computational technique to solve optimization problems.
Yes optimization problems is very general however calculus is a fundamental tool. Dual numbers are somewhat like lie groups, very smooth and conducive to optimization.
Curious, can you expand on the connection to convex optimization? To my understanding, discrete optimization is nonconvex by nature due to discontinuities in the feasible space.
There are two types of spaces, discrete and continuous. These are in a dual relationship.
Duality is the isomorphism between these two. For example, for humans, it's easier to reason about discrete spaces. However a lot of things simply cannot be done that way.
Think of anything that is tangential (pun intended) to Lie theory. In the context of Lie theory, you have the discrete group and the continuous algebra (the group's tangent space). You go between these two using the exponent (group -> algebra) and logarithm to go back (algebra -> group).
It's the difference between an integral and a Rieman sum. It's the fundamental idea that underlies sampling (say audio sampling or even statistical sampling). You capture some invariants and then you interpolate between these invariants to recreate some smooth curve (or distribution).
The nice thing about the smooth space is that optimization is easy. In the exponential space, addition is multiplication and some expensive things are cheap (computationally speaking).
Unless I'm severely misunderstanding you, a discrete set (or function) cannot be a dual of a continuous set (or function). If nothing else, the former is countable and the latter is uncountable; there can be no isomorphism between the two.
Unless I'm severely misunderstanding you, a discrete set (or function) cannot be a dual of a continuous set (or function). If nothing else, the former is countable and the latter is uncountable; there can be no isomorphism between the two.
There is. Discrete samples are samples of the continuous space. In order to capture the continous space, you only really need to capture a particular set of samples that you then interpolate between.
The way I interpret isomorphism in this context is if you can capture one space in the other and then convert to the other without a loss of information.
Imagine a polynomial (in the smooth space). You can capture a particular set of points that uniquely determines the polynomial. In some circumstances you can use these samples to reconstruct the original polynomial by interpolating between any of the two points.
Okay, I think I understand what you're getting at. But if you've taken a set of points from a continuous set (like an interval on the reals) and you can put those in bijection with a discrete set, then by definition your subset of the continuous set isn't continuous. It must be discrete.
More succinctly, you actually can't draw an isomorphism between discrete and continuous spaces without losing information from the continuous space.
Here's the thing, your description of the continuous set is already discretized. If we say an interval 4-6 we have captured the continuous space using only two numbers.
I know that this is a silly argument in some sense but this is something that you do naturally that you don't even think about it.
Do you see what I'm getting at? You capture extrema in the discrete space and the interpolate in the smooth space to recreate the smooth curve.
> Imagine a polynomial (in the smooth space). You can capture a particular set of points that uniquely determines the polynomial. In some circumstances you can use these samples to reconstruct the original polynomial by interpolating between any of the two points.
This is not duality though and you do lose information.
For instance let’s say one has a cubic polynomial and one samples 5 points from it and stores those points. If one didn’t know the original order was cubic, and if one tried to interpolate over the 5 points to fit a quintic, that would be an incorrect reconstruction.
You don't lose information if you pick your points correctly (you store only the extrema). In the cubic case, you need the two extrema (one minimum and one maximum and you need to know whether each extremum is a min or max) and then interpolate between them.
Unfortunately this is incorrect. Extrema do not always exist (consider y=x^3) and they do not uniquely define a polynomial (y=x^2 and y=x^4 both have minima at x=0).
Unfortunately this is incorrect. Those minima are not the same. Remember that dual points have a real part and a dual part that indicates the rate of change at that point. The real part is the same but the dual is different.
My point is it’s not possible to uniquely reconstruct an arbitrary polynomial by just knowing the extrema because there may be information loss in the general case. I will stop here.
It is possible if you know the rate of change which you do with dual points. Like you don’t interpolate just position but also the dual parts I.e. rate of change.
I’m not sure. I’m not entirely convinced that discrete and continuous spaces are dual spaces. They are connected, but they are not duals.
Same with sampling vs continuous. One cannot interchange the order of the composing morphisms while preserving the properties of the original. The sampled object cannot reconstruct the continuous object in all situations due to effects like aliasing.
In optimization, the concept of duality is also a much stronger idea: the primal and the dual of a problem are opposing views of the same problem that correspond exactly (not approximately) in their dual properties.
Discrete optimization is nonconvex by nature (does not satisfy convexity definitions) so I’m not sure if it has any duality relations to convex optimization. There is a relationship but it is not a dual relationship.
> One cannot interchange the order of the composing morphisms while preserving the properties of the original.
Good observation one really can't but that was never a hard requirement, right? Ordering becomes actually more interesting because you can have interesting properties like anti-commutativity (https://en.wikipedia.org/wiki/Anticommutativity) which is a lot more useful than commutativity. Lie groups are anti-commutative groups btw.
> In optimization, the concept of duality is also a much stronger idea: the primal and the dual of a problem are opposing views of the same problem that correspond exactly (not approximately) in their dual properties.
My view is more general. The difference between these spaces lies in the idea of choice and in the idea of adversarial choice. You are correct, they are opposing view, like two players playing a game.
I control my moves. I do not have control over my opponents players moves, however I do have knowledge about my opponent's potential moves. Therefore I can do some sort of min-max optimization to figure out my optimal play given my situation and knowing my opponent's options.