> Brouwer’s fixed point theorem. The theorem says that for any continuous function, there is guaranteed to be one point that the function leaves unchanged — a fixed point, as it’s known. This is true in daily life. If you stir a glass of water, the theorem guarantees that there absolutely must be one particle of water that will end up in the same place it started from.
Wait. What if by stirring I shook the glass and at the end set it back down a foot away from its initial location. Is this author claiming that Brouwer’s theorem guarantees a particle floating somewhere where the glass used to be? Unchanged?
I can imagine an unlimited number of scenarios where I can guarantee the water particles are displaced.
This seems like a bad application of the theorem, right?
A better example I've seen for the theorem is that if you take a paper map of a country, messily schrunch it up into a ball and drop it somewhere in the country, there will be at least one point on the map that is exactly above the corresponding real point in the country.
As others have said, it's meant for mappings of the space to itself. So stirring the water, but not moving the glass.
But anyway, the theorem works only for continiuous mappings. The moment they started mentioning "water particles", instead of some hypothetical fluid that is a continuous block, the theorem no longer applied. You could break it by mirroring the position of every particle. There's still a fixed point (the line of mirroring), but there's no obligation that there's a particle on that line.
Another practical example is that you're guaranteed to have an annoying spot on your head where the hair sticks straight out and you can't brush it down.
That's actually the one that helped me visualize the theorem. If you look at your scalp from above, you can divide all the hairs into "points left" or "points right" and draw a boundary between them of hairs that point neither left or right. Then you can do the same thing with "points up" and "points down." Where the two kinds of boundaries cross, you have a hair that doesn't point up, down, left, or right - it points straight out of your scalp.
This still doesnt make sense to me. Imagine a continuous line whose positions map to the real numbers between 0 and 1. If I "move" the line over 0.1 wrapping the end back to the beginning (i.e. x2 = (x1 + 0.1) % 1), there will be no points that are in the same position as they were in before.
EDIT: If you need a continuous function, wouldnt expanding the space to a line from -Inf to +Inf and then using x2 = x1 + 0.1 do the trick?
Brouwer's fixed point theorem only applies to compact convex sets.
Infinite lines don't work, as they are not compact.
Similarly a circle would not work as it is not convex (you're close with your example, you just need to glue together the endpoints to turn it into a circle and make the map continuous).
It only applies when you're mapping a space onto itself (plus a couple other qualifications), so you have to put the glass back in the same place.
I think of it as a generalization of the intermediate value theorem - some things are going left, some things are going right, one thing must be sitting still in between.
And: it has to be ‘continuous’. E.g. in ‘real’ water, the molecule of H2O ‘colloquially’ known as ‘number 12345678998776165441’ which sat next to molecule known as ‘number 1’, still has to sit next to it. All be it in a different location. I think something like mayonaise would have been a better example, as water molecules that are ‘next’ to each other can in reality split up quite easily by itself. While something like mayonaise would not.
*off course, quantum mechanics and all that suggesting that it would be impossible to label the individual molecules.
Are you trying to say that for every point P there exists a neighborhood U around it for which the each transformed point T(P) is contained in the neighborhood T(U)?
QM doesn't say that it's impossible to label the individual moleccules. However, hydrogens are labile, it makes more sense to identify the unique oxygens.
Loosely,
Hairy ball theorem is about the derivative of a mapping at a single point in time. (Imagine slowly deforming/mixing the input configuration t to obtain the output configuration.
Brouwer's theorem can be thought of as being about the integral of a continuous family of hairy balls over a time interval.
It's not exactly the same because the assumptions about differentiability are different.
They left off some qualifiers. According to wikipedia, it's applicable to a continuous function mapping a nonempty compact convex set to itself. Which ends up making a lot more sense to me.
Isn't a container of water actually a good example? Maybe swirling gently (laminar flow) is a better mental image than shaking vigorously (possibly discontinuous).
It doesn't make sense to me because a glass of water is a discrete set. If we only had two molecules you could just interchange them (with a 180 degree rotation), without any fixed points.
Show me a mathematical continuum in the real-world. And this is just the domain, nothing said about the mapping yet (the continuous function). Every analogy has its limits.
If you get to assume continuous paper, then I get assumed continuous paper.
If you don't want to assume continuity, then you get it back by rephrasing your theorems as "within a margin of error equal to the distance between discrete objects".
I'd guess the example of the glass of water was made as an afterthought, or as a vivid example of the potential complexity of the theorem. As your example shows, it simply cannot be true. If you think of a vase with marbles, it is also clear. Or if you think of an (ideal) gas in a closed system: would there be at least one particle that always(!) stays in the same place?
And then, we would not call it shaking but bending and twisting, would we?
Think of the 1D variant. If you shuffle a deck of cards, but require that to be ‘continuous’, few shuffles remain (I think only the identity mapping and ‘flipping the deck upside down’). I doubt anybody would restricting the possible permutations that much stil call shuffling.
I think point particles are different because there are infinitely many of them. They are more continuous than cards. This means shaking isn’t necessarily discontinuous even if two particles that were “right next to each other” wind up far apart, there should be more particles in between that ended up closer.
Erratic movement only occurs if the container isn't filled completely. If it is filled completely and the shaking doesn't include any turning/twisting motion, not much happens.
One of the pre-conditions of the theorem is that the domain and codomain are the same. So that's one way to satisfy the theorem. But it's not really intuitive. If you gently swirl a bottle of water, somewhere in the bottle, some water molecule did not move. It's not necessarily in the center of anything. The fixed point could be anywhere. The theorem is about existence only, not construction.
(I am not a professional mathematician, I might also be wrong.)
Well, some continuous functions are simple and it's highly intuitive why this applies to them. Others are very weird and it's not intuitive at all that they should have such a point.
And of course, some functions/transformations are not continuous, and those may not have such an "axis of origin" at all.
I don't quite know what you mean by that, but consider that this is not true for e.g. a torus or a sphere (take the function mapping every point to its antipodal point).
The fact that the underlying space is contractible is very important here.
That's not the best way to view it. You can prove it by showing that for an n-dimensional disc, there is no contraction to its boundary; which I think is a bit more illustrative of what this FPT is doing
The theorem assumes the function has identical domain and codomain. Eg f(x) = x^3 maps [0, 1] to itself.
So that's why they gave the glass stirring example, the domain here is the whole volume of water. So as long as it ends up in the same volume of water, the process of stirring is assumed to be continuous and therefore this theorem applies. Your example changes where it ends up (ie not back to the same domain) and so it cannot be applied.
I'm misunderstanding the theorem. Take f(x) = x + 1. That is a continuous function. But there does not exist an x where f(x) ~ x holds. What am I missing?
That's because there is also a precondition for the domain to be bounded, and since the real numbers are not bounded, the theorem does not apply here (or for any function R -> R).
You can just do the same thing in the integers modulo any value greater than 1, creating a bounded domain without creating a fixed point. I imagine this runs afoul of a different precondition which I haven't identified, though -- it's obviously way more likely than that the entire field has managed to miss something this obvious.
Edit: Actually it's obviously just that this set is not convex, or even continuous.
> If you stir a glass of water, the theorem guarantees that there absolutely must be one particle of water that will end up in the same place it started from.
Wait what? If you stir a glass long enough, any configuration of particles should be possible. For instance you can imagine "cutting" the water like a deck of cards.
The full theorem is a continuous mapping back on itself. So, only the water in the glass mapped back to the glass. Pour it out to another glass, and you no longer mapped back to the glass. Pour it back into original glass, and you are back to having a fixed point possible. And, obviously, only with respect to the mapping in the glass.
From problems that require recognition of patters, whether in images or sounds, a non-linear function represented through neural networks and trained through gradient descent just suffices to produce smoothly distributed fuzzy classification that just works.
This setup is not a good prior for anything that requires more precise solution.
What would you search for, the zero/root in the derivative? Imagine the derivative is a function like y=x^2-eps with some small eps, say 0.000001. For binary search like that to work, you need to start with two values x1 and x2 such that sign(y1) != sign(y2). But with the derivative above those values are hard to find with random initialization.
More generally, for binary search to find a value x such that f(x)=0, we need to have f(any value less than x)<0 and f(any value greater than x)>0. We can't guarantee these properties for general derivatives.
I imagine you could search until the incremental change between your current and last result is below some threshold epsilon, as a kind of stopping criterion.
This site (QuantaMagazine) had really excellent quality content when it first launched,like the one at the OP, now it has become sort of IFLScience in terms of editorializing.
This title is still a bit click baity because ultimately this problem has very little impact on any real world usage of stochastic gradient descent - and especially the more important ones, where getting infinitesimally close to any particular local minimum is specifically and totally not what people care about (and getting to a global minimum is something we've long known to give up on).
It is a theoretically interesting result about the PPAD complexity class itself. Not sure if it should be STOC best paper though, if it was nominated as best paper because of the weak link to machine learning I'd be a bit disappointed.
> ... elevation of the land is equal to the value of the function (the “profit”) at that particular spot. Gradient descent searches for the function’s local minimum ...
Drop the quadratic term, equate to 0, and you get an approximate solution for x for the next iteration x_{n+1}:
x_{n+1} = x_n - Df(x_n)^{-1} * f(x_n)
Like this it's the Newton method. The problem is that the Jacobian Df(x_n) is a matrix of size k x k, and inverting it may require O(k^3) work.
So, pretty much all schemes are based on approximations of the Jacobian.
Gradient descent for example replaces Df(x_n)^{-1} with a scalar a, so it's O(1) work to "compute" it:
x_{n+1} = x_n - a * f(x_n)
A method like L-BFGS tries to build a relatively cheap approximation to the inverse of the Jacobian during iteration, resulting in O(k) work to compute:
x_{n+1} = x_n - P * f(x_n)
Other methods may exploit sparsity of the Jacobian, or solve the linear system Df(x_n)z = f(x_n) only approximately for z using e.g. iterative methods.
Note: in optimization problems, the function f is typically a gradient of a cost function, and the jacobian is then the hessian of that cost function.
As far as I know, this is the #1 gradient descent superpower that makes it the preferred choice above all others. I don't think e.g. L-BFGS supports batching, I've certainly never seen it.
There are probably ly a hundred. Many people have asked for research into this for I don't know 20 years. But the hype doesn't die down and modern trends have mostly involved doing more GD with bigger models...
A colleague and myself experimented with some alternatives and passed notes once in a while... For certain modelling problems you can save literal gpu days by going against the grain and get better results.
Most ML papers I used to read fell into a couple of categories...
1. Baseless empirical result that probably was p hacked.
2. Mostly untested result.
3. A rediscovery of something know for decades without attribution to the source they probably read it from.
4. Incomplete description of problem, resolution, or probably fraud/incompetency.
5. Useless in general. Unverifiable, etc.
6. Review article.
7. A derivation with way more omissions/flaws than say an engineering paper.
I'm somewhat seasoned on optimization methods personally, but yea it seems once people go ML they tend to um stop studying the fundamental literature that ML came from. "Online masters program learn AI in 12 weeks from nothing!". Oh okay so calculus won't be included in that... Or statistics... Or... Yep it's going to be scikit learn notebooks
...
It does feel like academic outsiders hacking/taking on the brand of serious academic research to give themselves authority.
> Baseless empirical result that probably was p hacked
This to me seems like the biggest regression in science. It's all heresy which is very hard to re-produce or learn general lessons from. It feels like disparate social science methodologies are being used to study math.
Nobody is going to look back and benefit from these papers. I often bring up to ML folks limitations proven in the book Perceptrons and wonder how their models differ. I have never gotten a response.
I once tried talking about how there is probably a fundamental limit to how deep of a graph traversal chat bots can do to a colleague. And he started blankly at me and said "it'll work". As if chat bots can now solve NP complete problems with ease because "AI"...
I'm so glad I left the field the level of snake oil relative to sustenance is pretty hard to stomach.
Maybe we are wrong about complexity theory. I know people who have dedicated good chunks of their lives to studying it. One things for sure, if we are wrong about it, and have no basis for studying any of its exceptions, it's hard for me to accept hot takes like this as worth considering. The general "we can do ...insert currently impossible thing... Because AI!" Gets very old. Once had a boss request that light travel faster then it does- literally... "no" wasn't an acceptable answer. Anyway...
I float between a few technical fields. Some in natural science, computer science, data science hybrid roles, data bases/engineering, etc. Not a jack of all trades, nor a master of none. What I do have mastered isn't something people hire for, so basically I am an averagely smart person who will take any job and figure it out to pay the bills.
At home though I play with all of the areas of creation I can get my hands on. I guess I am just in the field of discovering new things and making things.
Gradient descent is great because it is first order, thus cheap to compute since you don’t need a Hessian, points to a descent direction, works with anything that is differentiable or piece-wise differentiable without caveats, and given the millions of parameters in today’s there is always a descent direction.
If you do population methods, you suffer in terms of memory because you need to keep all those candidates, evaluate them individually, and then update them. This bounds the memory you can use.
More memory means more parameters, means you enter the interpolation scheme means your model behaves well in real world.
If you try to go with second order optimisation methods then you need to go for Hessian free methods as computing the Hessian is computationally intractable for large NNs.
You can attempt to build a local model of the loss landscape but that is expensive and has many caveats.
Nocedal et al is a recommended read for numerical optimisation theory.
>More memory means more parameters, means you enter the interpolation scheme means your model behaves well in real world.
The surprising thing about large neural networks is that the difference in quality between the local minima goes down and makes it less and less relevant which one you end up in. The global minimum may also lead to overfitting so you probably don't even want to go there.
It is also the case that many of such local minima are locally identical in structure and you can jump between one another via permutations of the parameters.
As a simple example, if you have a NN with one input and output neuron and a single two-neuron hidden layer (with same activation function), you can swap the weights (and bias) of the two neurons in the hidden layer and the result will be the same. Right?
Is there something to gain by trying to eliminate or exploit such symmetries?
These symmetries come up in lotteries. The lottery ticket hypothesis says that within an overparameterized ann exists a smaller sparser neural network that behaves at least as good as the original one and learns faster.
Re the example; yes that is correct; a permutation is simply a row-wise shuffled identity matrix, it doesn’t affect the gradients or performance.
If accuracy isn't as important as speed for you, you can sample 1000 points and pick the smallest one. This point will probably be smaller than 99% of the function, but I wouldn't recommend deploying this in production.
Not to sound snarky, but is there any example where theoretical computer science has actually been useful? The results are often like "Here is a proof this will be slow if you scale it up enough" -- "Yeah we knew that for 20 years, it doesn't matter"
Anything at scale. Anything that uses a database. Anything that uses anything smarter than bubblesort to organize results. Anything that uses a tree (or a trie) internally to represent the data and make some sort of order from it. CS isn't just about "hey this thing is slower than this thing", it's also about making things possible in the first place. If you've used Google (or Facebook or Microsoft or Netflix), they're only possible due to deep computer science work. Which videos gets chosen to be stored on the edge cache? Computer science algorithm.
Well first off, i think there is an argument that creating & simulating algorithms is CS stuff anyways.
Second its often (not always) a lot more work to test scalability with benchmarks than it is to theoretically analyze.
Especially because people are often worried about both average case and worst case. The input that takes down your site is not the one you expect, and simulating realistic load is difficult.
It can also be pretty hard to compare benchmarks. You probably dont want to test every algorithm known to man yourself, but two separate benchmarks on diff hardware and diff conditions cannot be reasonably compared. More abstract analysis can (to an extent anyways)
Don't a lot of these algorithms come from TCS researchers themselves? They may publish about complexity but if that gives some people a better understanding of algorithms and leads to better ones being invented, it's a win.
Now that I've had a chance to think about it, it's not even "at scale". Your smartphone, which makes liberal use of sqlite databases uses a tree data structure to organize and index the information in the db files. When you turn on the screen to your phone or laptop, and see a bunch of icons laid out, it's thanks to computer science algorithm work that you're not waiting minutes or days for all the icons to get loaded from storage organized with a filesystem smarter than FAT, arranged how you like them, and displayed to a screen with millions of pixels.
Well, this did happen in machine learning. There was a theoretical argument that neural networks "should" terribly overfit according to statistical learning theory, since they have very large VC dimension, but they don't. Practice did race way ahead of theory in AI. Another example is the Chinchilla scaling law, which was discovered empirically instead of being derived from first principles.
That's different from complexity though. We actually understand why some algorithms have better complexity and results than others, while we mostly have no idea how intelligence works or even how to define it.
I liked a few results. . For example, smooth analysis by Spielman showed why so many algorithms works very well in practice even though their worst case complexity is terrible e.g. simplex method. Second, concave-convex optimization shows that you will definitely reach closer to global optimum at each iteration. I guess its comforting to know a reason algo behave the way they do.
Sampling theorem is also case in point. Also compressed sensing. Information theory will tell you there is no point trying any harder at compression when you reached epsilon closer to the theoretical limit.
> is there any example where theoretical computer science has actually been useful?
Anything requiring extremely optimized algorithms will generally start in academia and work its way outward as the techniques are adopted in commercial software. Sorting and compression algorithms are two examples that come to mind. Even games development which has a history of non-academic roots now borrows heavily from academic research to refine things like culling algorithms or optimized computational methods for specialized physics calculations.
Ah yes, I heard somewhere that Brian Karis at Epic was partially relying on prior work by an academic when he developed virtualized geometry for Unreal Engine 5. (I can't remember the name of the researcher though.)
Wait. What if by stirring I shook the glass and at the end set it back down a foot away from its initial location. Is this author claiming that Brouwer’s theorem guarantees a particle floating somewhere where the glass used to be? Unchanged?
I can imagine an unlimited number of scenarios where I can guarantee the water particles are displaced.
This seems like a bad application of the theorem, right?