Hacker News new | past | comments | ask | show | jobs | submit login
Computer scientists discover limits of gradient descent (2021) (quantamagazine.org)
161 points by Anon84 on July 30, 2023 | hide | past | favorite | 121 comments



> Brouwer’s fixed point theorem. The theorem says that for any continuous function, there is guaranteed to be one point that the function leaves unchanged — a fixed point, as it’s known. This is true in daily life. If you stir a glass of water, the theorem guarantees that there absolutely must be one particle of water that will end up in the same place it started from.

Wait. What if by stirring I shook the glass and at the end set it back down a foot away from its initial location. Is this author claiming that Brouwer’s theorem guarantees a particle floating somewhere where the glass used to be? Unchanged?

I can imagine an unlimited number of scenarios where I can guarantee the water particles are displaced.

This seems like a bad application of the theorem, right?


A better example I've seen for the theorem is that if you take a paper map of a country, messily schrunch it up into a ball and drop it somewhere in the country, there will be at least one point on the map that is exactly above the corresponding real point in the country.

As others have said, it's meant for mappings of the space to itself. So stirring the water, but not moving the glass.

But anyway, the theorem works only for continiuous mappings. The moment they started mentioning "water particles", instead of some hypothetical fluid that is a continuous block, the theorem no longer applied. You could break it by mirroring the position of every particle. There's still a fixed point (the line of mirroring), but there's no obligation that there's a particle on that line.


Another practical example is that you're guaranteed to have an annoying spot on your head where the hair sticks straight out and you can't brush it down.

That's actually the one that helped me visualize the theorem. If you look at your scalp from above, you can divide all the hairs into "points left" or "points right" and draw a boundary between them of hairs that point neither left or right. Then you can do the same thing with "points up" and "points down." Where the two kinds of boundaries cross, you have a hair that doesn't point up, down, left, or right - it points straight out of your scalp.


This still doesnt make sense to me. Imagine a continuous line whose positions map to the real numbers between 0 and 1. If I "move" the line over 0.1 wrapping the end back to the beginning (i.e. x2 = (x1 + 0.1) % 1), there will be no points that are in the same position as they were in before.

EDIT: If you need a continuous function, wouldnt expanding the space to a line from -Inf to +Inf and then using x2 = x1 + 0.1 do the trick?


Brouwer's fixed point theorem only applies to compact convex sets.

Infinite lines don't work, as they are not compact.

Similarly a circle would not work as it is not convex (you're close with your example, you just need to glue together the endpoints to turn it into a circle and make the map continuous).


The line example is what I thought and then I looked the theorem up in Wikipedia. Thanks for pointing this out.


The theorem only applies to continuous functions.


It only applies when you're mapping a space onto itself (plus a couple other qualifications), so you have to put the glass back in the same place.

I think of it as a generalization of the intermediate value theorem - some things are going left, some things are going right, one thing must be sitting still in between.


And: it has to be ‘continuous’. E.g. in ‘real’ water, the molecule of H2O ‘colloquially’ known as ‘number 12345678998776165441’ which sat next to molecule known as ‘number 1’, still has to sit next to it. All be it in a different location. I think something like mayonaise would have been a better example, as water molecules that are ‘next’ to each other can in reality split up quite easily by itself. While something like mayonaise would not.

*off course, quantum mechanics and all that suggesting that it would be impossible to label the individual molecules.


Are you trying to say that for every point P there exists a neighborhood U around it for which the each transformed point T(P) is contained in the neighborhood T(U)?


QM doesn't say that it's impossible to label the individual moleccules. However, hydrogens are labile, it makes more sense to identify the unique oxygens.


What's the conceptual difference between this and the hairy ball theorem?


Loosely, Hairy ball theorem is about the derivative of a mapping at a single point in time. (Imagine slowly deforming/mixing the input configuration t to obtain the output configuration. Brouwer's theorem can be thought of as being about the integral of a continuous family of hairy balls over a time interval.

It's not exactly the same because the assumptions about differentiability are different.


Interesting, thanks!


They left off some qualifiers. According to wikipedia, it's applicable to a continuous function mapping a nonempty compact convex set to itself. Which ends up making a lot more sense to me.


Isn't a container of water actually a good example? Maybe swirling gently (laminar flow) is a better mental image than shaking vigorously (possibly discontinuous).


It doesn't make sense to me because a glass of water is a discrete set. If we only had two molecules you could just interchange them (with a 180 degree rotation), without any fixed points.


Show me a mathematical continuum in the real-world. And this is just the domain, nothing said about the mapping yet (the continuous function). Every analogy has its limits.


The space represented in a flat piece of paper.

The space is continuous even if we can only measure down to the Planck length.


Key here: "represented"

I'm only a mathematician, no physicist. But I think to remember, that the concept of a continuous physical space becomes quite muddled at this scale.


"The space represented by the water in a convex container"


If you get to assume continuous paper, then I get assumed continuous paper.

If you don't want to assume continuity, then you get it back by rephrasing your theorems as "within a margin of error equal to the distance between discrete objects".


I'd guess the example of the glass of water was made as an afterthought, or as a vivid example of the potential complexity of the theorem. As your example shows, it simply cannot be true. If you think of a vase with marbles, it is also clear. Or if you think of an (ideal) gas in a closed system: would there be at least one particle that always(!) stays in the same place?


Shaking a container is a good example I think, assuming the glass is convex.

Shaking has to be continuous, the particles move quickly and erratically, but they trace continuous paths.


The paths are continuous, but if they move two neighbouring molecules away from each other, the final transformation won't be continuous, will it?


It’s difficult to discuss this physical example because particles are discrete.

In an ideal system with points instead of particles, shaking would be continuous.


And then, we would not call it shaking but bending and twisting, would we?

Think of the 1D variant. If you shuffle a deck of cards, but require that to be ‘continuous’, few shuffles remain (I think only the identity mapping and ‘flipping the deck upside down’). I doubt anybody would restricting the possible permutations that much stil call shuffling.


I don’t think that analogy works because an idealized fluid is point particles, but an idealized deck of cards isn’t.


They both have a concept of neighbors but the deck of cards is simpler, it’s a good illustration, no?


I think point particles are different because there are infinitely many of them. They are more continuous than cards. This means shaking isn’t necessarily discontinuous even if two particles that were “right next to each other” wind up far apart, there should be more particles in between that ended up closer.


Erratic movement only occurs if the container isn't filled completely. If it is filled completely and the shaking doesn't include any turning/twisting motion, not much happens.


Even then. The required mapping is 'onto' (otherwise, a discontinuous counterexample would be trivial). The air particles are equivalent to the water.


So is it just a fancy way of saying "every transformation has an axis or origin"?


One of the pre-conditions of the theorem is that the domain and codomain are the same. So that's one way to satisfy the theorem. But it's not really intuitive. If you gently swirl a bottle of water, somewhere in the bottle, some water molecule did not move. It's not necessarily in the center of anything. The fixed point could be anywhere. The theorem is about existence only, not construction.

(I am not a professional mathematician, I might also be wrong.)


No, you are on point.


Well, some continuous functions are simple and it's highly intuitive why this applies to them. Others are very weird and it's not intuitive at all that they should have such a point.

And of course, some functions/transformations are not continuous, and those may not have such an "axis of origin" at all.


I don't quite know what you mean by that, but consider that this is not true for e.g. a torus or a sphere (take the function mapping every point to its antipodal point).

The fact that the underlying space is contractible is very important here.


That's not the best way to view it. You can prove it by showing that for an n-dimensional disc, there is no contraction to its boundary; which I think is a bit more illustrative of what this FPT is doing


This is also how I interpreted it, which I take it is probably incorrect and would like to know why.


The theorem assumes the function has identical domain and codomain. Eg f(x) = x^3 maps [0, 1] to itself.

So that's why they gave the glass stirring example, the domain here is the whole volume of water. So as long as it ends up in the same volume of water, the process of stirring is assumed to be continuous and therefore this theorem applies. Your example changes where it ends up (ie not back to the same domain) and so it cannot be applied.


I'm misunderstanding the theorem. Take f(x) = x + 1. That is a continuous function. But there does not exist an x where f(x) ~ x holds. What am I missing?


That's because there is also a precondition for the domain to be bounded, and since the real numbers are not bounded, the theorem does not apply here (or for any function R -> R).


You can just do the same thing in the integers modulo any value greater than 1, creating a bounded domain without creating a fixed point. I imagine this runs afoul of a different precondition which I haven't identified, though -- it's obviously way more likely than that the entire field has managed to miss something this obvious.

Edit: Actually it's obviously just that this set is not convex, or even continuous.


It must be a compact convex set. The entire reals is not compact.


I do not understand this comment. In what world does "stirring" fluid in a cup imply any sort of dispacement of the cup itself?


You are not just stirring, then


> If you stir a glass of water, the theorem guarantees that there absolutely must be one particle of water that will end up in the same place it started from.

Wait what? If you stir a glass long enough, any configuration of particles should be possible. For instance you can imagine "cutting" the water like a deck of cards.


The full theorem is a continuous mapping back on itself. So, only the water in the glass mapped back to the glass. Pour it out to another glass, and you no longer mapped back to the glass. Pour it back into original glass, and you are back to having a fixed point possible. And, obviously, only with respect to the mapping in the glass.


I wonder how this really applies, since IEEE 754 floating points are not continuous.



That function is from a compact convex set to itself.


i remember the example of "you can't comb the hair on a coconut without a whorl".



Thanks



Thank you.


Tfp?


The fucking paper, although there’s no reason for anyone to understand that.


Going with this version.


The Fine Paper


the frickin paper?


Good to note that the article was published on "August 17, 2021"


Here's the original paper: https://arxiv.org/abs/2011.01929

And here are the complexity classes that are mentioned in the abstract: https://complexityzoo.net/Complexity_Zoo:P#ppad and https://complexityzoo.net/Complexity_Zoo:P#pls.


From problems that require recognition of patters, whether in images or sounds, a non-linear function represented through neural networks and trained through gradient descent just suffices to produce smoothly distributed fuzzy classification that just works.

This setup is not a good prior for anything that requires more precise solution.


Hobest question, why not just use a binary/ternary search? It's because there's no known a priori how many local maximum and/or minimums?

Because that's just an iterative/recursive version of binary/ternary search.


How would you do a binary search on a continuous space without discretizing?


Replace A[x] with f(x) and done.

A modified binary search called ternary search is well known used to find when derivative = 0, numerically.

https://en.m.wikipedia.org/wiki/Ternary_search


Try doing binary search on a space of nontrivial dimensionality (ie, not 1, 2, or 3) and you will have your answer.


What would you search for, the zero/root in the derivative? Imagine the derivative is a function like y=x^2-eps with some small eps, say 0.000001. For binary search like that to work, you need to start with two values x1 and x2 such that sign(y1) != sign(y2). But with the derivative above those values are hard to find with random initialization.

More generally, for binary search to find a value x such that f(x)=0, we need to have f(any value less than x)<0 and f(any value greater than x)>0. We can't guarantee these properties for general derivatives.


> What would you search for

I imagine you could search until the incremental change between your current and last result is below some threshold epsilon, as a kind of stopping criterion.


IEEE 754 is by default limited by 64 bits, so yes.


We do, it’s called global optimization. It works for small problems with ~ 1000 variables or so. Check BARON and the branch & reduce algorithm


Thanks!


Binary search only works on sorted data.


This is probably what you're thinking of:

https://en.m.wikipedia.org/wiki/Bisection_method


Because of the dimensionality of the problem. Bracketing is easy to do in 1D space, not so in N-dimensional space, with large N.


Because the search space is tremendously huge.


Binary search only works in 1D parameter space


This site (QuantaMagazine) had really excellent quality content when it first launched,like the one at the OP, now it has become sort of IFLScience in terms of editorializing.


Their articles are often rehashes of press releases, but this one seems to be of the better ones


This title is still a bit click baity because ultimately this problem has very little impact on any real world usage of stochastic gradient descent - and especially the more important ones, where getting infinitesimally close to any particular local minimum is specifically and totally not what people care about (and getting to a global minimum is something we've long known to give up on).

It is a theoretically interesting result about the PPAD complexity class itself. Not sure if it should be STOC best paper though, if it was nominated as best paper because of the weak link to machine learning I'd be a bit disappointed.


I should’ve clarified I was referring to the old (2021) article at the OP when I was talking about the high quality.

The recent ones haven’t remained in the same pathway unfortunately.


Is this a quality article? It starts badly:

> ... elevation of the land is equal to the value of the function (the “profit”) at that particular spot. Gradient descent searches for the function’s local minimum ...


Ouch. Compared to current ones, I'd unfortunately have to say yes.


What are the alternatives to gradient descent?


If you solve f(x) = 0 where f: ℝ^k -> ℝ^k maps vectors to vectors, most schemes are based on Taylor expansion around x_n

    f(x) = f(x_n) + Df(x_n)(x - x_n) + O(||x - x_n||^2)
Drop the quadratic term, equate to 0, and you get an approximate solution for x for the next iteration x_{n+1}:

    x_{n+1} = x_n - Df(x_n)^{-1} * f(x_n)
Like this it's the Newton method. The problem is that the Jacobian Df(x_n) is a matrix of size k x k, and inverting it may require O(k^3) work.

So, pretty much all schemes are based on approximations of the Jacobian.

Gradient descent for example replaces Df(x_n)^{-1} with a scalar a, so it's O(1) work to "compute" it:

    x_{n+1} = x_n - a * f(x_n)
A method like L-BFGS tries to build a relatively cheap approximation to the inverse of the Jacobian during iteration, resulting in O(k) work to compute:

    x_{n+1} = x_n - P * f(x_n)
Other methods may exploit sparsity of the Jacobian, or solve the linear system Df(x_n)z = f(x_n) only approximately for z using e.g. iterative methods.

Note: in optimization problems, the function f is typically a gradient of a cost function, and the jacobian is then the hessian of that cost function.


Gradient descent also allows batching, which reduces the memory requirements, can the other methods support that?


As far as I know, this is the #1 gradient descent superpower that makes it the preferred choice above all others. I don't think e.g. L-BFGS supports batching, I've certainly never seen it.


There's also Coordinate Descent which was traditionally used for fitting LASSO regression.


There are probably ly a hundred. Many people have asked for research into this for I don't know 20 years. But the hype doesn't die down and modern trends have mostly involved doing more GD with bigger models...

A colleague and myself experimented with some alternatives and passed notes once in a while... For certain modelling problems you can save literal gpu days by going against the grain and get better results.

Oh well...


> Many people have asked for research into this for I don't know 20 years

Sometimes I wonder if people in machine learning ever look at literature.

Basic iteration schemes like the secant method (ok, 1-dimensional) have been known well over 3000 years.

Newton's method is over 300 years old.

Quasi-Newton methods (the secant method being an example) became popular in the early 1960s.


Most ML papers I used to read fell into a couple of categories... 1. Baseless empirical result that probably was p hacked. 2. Mostly untested result. 3. A rediscovery of something know for decades without attribution to the source they probably read it from. 4. Incomplete description of problem, resolution, or probably fraud/incompetency. 5. Useless in general. Unverifiable, etc. 6. Review article. 7. A derivation with way more omissions/flaws than say an engineering paper.

I'm somewhat seasoned on optimization methods personally, but yea it seems once people go ML they tend to um stop studying the fundamental literature that ML came from. "Online masters program learn AI in 12 weeks from nothing!". Oh okay so calculus won't be included in that... Or statistics... Or... Yep it's going to be scikit learn notebooks ...


It does feel like academic outsiders hacking/taking on the brand of serious academic research to give themselves authority.

> Baseless empirical result that probably was p hacked

This to me seems like the biggest regression in science. It's all heresy which is very hard to re-produce or learn general lessons from. It feels like disparate social science methodologies are being used to study math.

Nobody is going to look back and benefit from these papers. I often bring up to ML folks limitations proven in the book Perceptrons and wonder how their models differ. I have never gotten a response.


I once tried talking about how there is probably a fundamental limit to how deep of a graph traversal chat bots can do to a colleague. And he started blankly at me and said "it'll work". As if chat bots can now solve NP complete problems with ease because "AI"... I'm so glad I left the field the level of snake oil relative to sustenance is pretty hard to stomach.


I have seen similar claims that computer science is wrong about complexity theory.

What field did you move to?


Maybe we are wrong about complexity theory. I know people who have dedicated good chunks of their lives to studying it. One things for sure, if we are wrong about it, and have no basis for studying any of its exceptions, it's hard for me to accept hot takes like this as worth considering. The general "we can do ...insert currently impossible thing... Because AI!" Gets very old. Once had a boss request that light travel faster then it does- literally... "no" wasn't an acceptable answer. Anyway...

I float between a few technical fields. Some in natural science, computer science, data science hybrid roles, data bases/engineering, etc. Not a jack of all trades, nor a master of none. What I do have mastered isn't something people hire for, so basically I am an averagely smart person who will take any job and figure it out to pay the bills.

At home though I play with all of the areas of creation I can get my hands on. I guess I am just in the field of discovering new things and making things.


> if people in machine learning ever look at literature.

They don't.

A few other things they seem completely unaware of:

- other ways to represent functions besides neural networks (harmonic analysis, polynomials, etc)

- other models exist besides neural networks. ie. if you can model the problem with a simple equation you can just optimize that.

- polynomial regression.

Someone who has read "numerical recipies" is probably more capable in solving ML problems than an "ML software engineer".


None really.

Gradient descent is great because it is first order, thus cheap to compute since you don’t need a Hessian, points to a descent direction, works with anything that is differentiable or piece-wise differentiable without caveats, and given the millions of parameters in today’s there is always a descent direction.

If you do population methods, you suffer in terms of memory because you need to keep all those candidates, evaluate them individually, and then update them. This bounds the memory you can use.

More memory means more parameters, means you enter the interpolation scheme means your model behaves well in real world.

If you try to go with second order optimisation methods then you need to go for Hessian free methods as computing the Hessian is computationally intractable for large NNs.

You can attempt to build a local model of the loss landscape but that is expensive and has many caveats.

Nocedal et al is a recommended read for numerical optimisation theory.


>More memory means more parameters, means you enter the interpolation scheme means your model behaves well in real world.

The surprising thing about large neural networks is that the difference in quality between the local minima goes down and makes it less and less relevant which one you end up in. The global minimum may also lead to overfitting so you probably don't even want to go there.


It is also the case that many of such local minima are locally identical in structure and you can jump between one another via permutations of the parameters.


As a simple example, if you have a NN with one input and output neuron and a single two-neuron hidden layer (with same activation function), you can swap the weights (and bias) of the two neurons in the hidden layer and the result will be the same. Right?

Is there something to gain by trying to eliminate or exploit such symmetries?


These symmetries come up in lotteries. The lottery ticket hypothesis says that within an overparameterized ann exists a smaller sparser neural network that behaves at least as good as the original one and learns faster.

Re the example; yes that is correct; a permutation is simply a row-wise shuffled identity matrix, it doesn’t affect the gradients or performance.


If accuracy isn't as important as speed for you, you can sample 1000 points and pick the smallest one. This point will probably be smaller than 99% of the function, but I wouldn't recommend deploying this in production.


Not to sound snarky, but is there any example where theoretical computer science has actually been useful? The results are often like "Here is a proof this will be slow if you scale it up enough" -- "Yeah we knew that for 20 years, it doesn't matter"


Anything at scale. Anything that uses a database. Anything that uses anything smarter than bubblesort to organize results. Anything that uses a tree (or a trie) internally to represent the data and make some sort of order from it. CS isn't just about "hey this thing is slower than this thing", it's also about making things possible in the first place. If you've used Google (or Facebook or Microsoft or Netflix), they're only possible due to deep computer science work. Which videos gets chosen to be stored on the edge cache? Computer science algorithm.


I'm not sure how true this is. A lot of "algorithms" can be tested by using them or by running simulations for reasonable input ranges.


Well first off, i think there is an argument that creating & simulating algorithms is CS stuff anyways.

Second its often (not always) a lot more work to test scalability with benchmarks than it is to theoretically analyze.

Especially because people are often worried about both average case and worst case. The input that takes down your site is not the one you expect, and simulating realistic load is difficult.

It can also be pretty hard to compare benchmarks. You probably dont want to test every algorithm known to man yourself, but two separate benchmarks on diff hardware and diff conditions cannot be reasonably compared. More abstract analysis can (to an extent anyways)


Don't a lot of these algorithms come from TCS researchers themselves? They may publish about complexity but if that gives some people a better understanding of algorithms and leads to better ones being invented, it's a win.


Now that I've had a chance to think about it, it's not even "at scale". Your smartphone, which makes liberal use of sqlite databases uses a tree data structure to organize and index the information in the db files. When you turn on the screen to your phone or laptop, and see a bunch of icons laid out, it's thanks to computer science algorithm work that you're not waiting minutes or days for all the icons to get loaded from storage organized with a filesystem smarter than FAT, arranged how you like them, and displayed to a screen with millions of pixels.

to a screen that


You wouldn’t come up with the high-permorming algorithms without understanding why they would be high-performing in the first place.


Well, this did happen in machine learning. There was a theoretical argument that neural networks "should" terribly overfit according to statistical learning theory, since they have very large VC dimension, but they don't. Practice did race way ahead of theory in AI. Another example is the Chinchilla scaling law, which was discovered empirically instead of being derived from first principles.


That's different from complexity though. We actually understand why some algorithms have better complexity and results than others, while we mostly have no idea how intelligence works or even how to define it.


It tends to not matter until it suddenly does and bites your behind.

What was that game that took forever to start up because the code for parsing the config files was accidentally quadratic or exponential?

Edit: also, the standard library of any programming language tends to have quite a bit of theoretical computer science inside.


You're thinking of GTA Online, from a quick DDG search: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times...


If you haven't seen it, Accidentally Quadratic has a tumblr documenting those kinds of things. https://accidentallyquadratic.tumblr.com/


Hah, I just realized they host a text blog on ... tumblr. Thought that site was for random images.


I liked a few results. . For example, smooth analysis by Spielman showed why so many algorithms works very well in practice even though their worst case complexity is terrible e.g. simplex method. Second, concave-convex optimization shows that you will definitely reach closer to global optimum at each iteration. I guess its comforting to know a reason algo behave the way they do.

Sampling theorem is also case in point. Also compressed sensing. Information theory will tell you there is no point trying any harder at compression when you reached epsilon closer to the theoretical limit.


> is there any example where theoretical computer science has actually been useful?

Anything requiring extremely optimized algorithms will generally start in academia and work its way outward as the techniques are adopted in commercial software. Sorting and compression algorithms are two examples that come to mind. Even games development which has a history of non-academic roots now borrows heavily from academic research to refine things like culling algorithms or optimized computational methods for specialized physics calculations.


Ah yes, I heard somewhere that Brian Karis at Epic was partially relying on prior work by an academic when he developed virtualized geometry for Unreal Engine 5. (I can't remember the name of the researcher though.)


An obvious one that comes to mind - this problem is np-complete so pragmatically i can stop looking for a fast algo.


Or resort to a SAT solver right away.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: