> High dimensional spaces are unlikely to have local optima, and probably don’t have any optima at all.
Can someone who knows more about DL than I do help me understand this a little better?
The article uses the analogy of walls:
> Just recall what is necessary for a set of parameters to be at a optimum. All the gradients need to be zero, and the hessian needs to be positive semidefinite. In other words, you need to be surrounded by walls. In 4 dimensions, you can walk through walls. GPT3 has 175 billion parameters. In 175 billion dimensions, walls are so far beneath your notice that if you observe them at all it is like God looking down upon individual protons.
I'm struggling to understand what this really means in 4+ dimensions. But when I try to envision it going from 1 or 2 to 3 dimensions, it doesn't seem obvious at all that a 3D space should have fewer local optima than a 2D space.
In fact, having a "universal function" like a deep network seems like it should have more local optima. What am I missing?
The “physicist” explanation that I heard (meaning non-rigorous but good for building intuition) is that at every point where the derivative vanishes, for suitably random functions, every direction you move in will either be a direction where you increase or decrease at about 50-50 odds. In D dimensions there are 2D independent directions to rise or fall in (e.g. in D=2 dimensions, there’s north south east west), so there’s about 1/2^(2D) odds that any given critical point is actually a minimum if those probabilities are independent. That gets small really fast at large D.
But you have to be careful about that word "independent".
There's a reason that things like 3D protein structure estimation, for example, are still very difficult problems, because none of the coordinates are even approximately independent of the others.
So you're back to a standard "minimization is really difficult" even in ultra-high dimensional spaces.
Yeah I was thinking about that as I was writing and trying to convey why I feel like deep models are different. I think one way of thinking about it is that protein structure, even though it has lots of parameters, it is all happening within the confines of 3D space. A protein that could move in lots of dimensions at once, could probably reliably fold much more easily, and it would be easy to find this structure.
That makes sense and explains why it's unlikely to have local minimas, but I don't get the "no minimal at all" argument. Why no global minimas at all, just because of high dimensionality?
It’s a heuristic argument that critical points are extremely unlikely to be local minima (ie positive definite second derivative). Loss surfaces of DNNs do typically have a global minimum (zero if they fit the training data exactly).
Arguably, a DNN seems likely to have many global minimums - given the level of (over)parametrization commonly used, a set of parameters that gets the lowest possible loss won't be unique, there will be huge sets of parameters that give exactly identical results.
Doesn't the fact the network is discrete (floats have a maximum precision) mean this isn't actually the case? There's a finite number of states the net can be in, and one (or more) is best.
Your intuition is off, but only slightly. As the dimensionality increases, the number of stationary points (zero gradient) also increases. But they become overwhelmingly likely to be saddle points rather than minima.
Think of the model as a point in high d space that you're trying to trap inside a cage. (Corresponding to it being at a local minima and surrounded by higher points.) So you're trying to build a cube in n dimensions, which has 2n sides. Now imagine that each of those sides will randomly be there or not, probabilistically, with probability p. To actually trap a model like gpt3, p^175,000,000,000 has to be high enough that you observe a case during training where you roll a 1.
My intuition here is that there is always a fair bit of noise in the (data -> label) pairs. Models tend to air on the side of being _too_ expressive to compensate (just add another layer right?). Assuming our model is too expressive, we must turn off training before we get too far down and start memorizing training datasets. Put another way, one input pair will want to go down one path, another will want to go down a totally different path. Assuming the model is too expressive, we actually don't want to reach a minima, because that essentially guarantees we've overfit.
What the author is saying is that very quickly optimization becomes a maze, and can quickly turn into a combinatorial game. Each input starts down its own set of corridors (parameters at certain values) and it can take an _extremely_ long time for this maze to end. Any noise in the (data, label) pairs can make this maze have no end at all if the model is too small. If the model is too big, it's a moot point because it will have overfit at this point.
Consider trying classifying shapes into either square or circle. The model outputs probability of circle. The absolute best you can do is to completely learn the training set. Assign 1 to all the circles and 0 to all the squares.
It is typical to squish the set of all real numbers into the interval 0 to 1. Any finite value will be squished to a value less than 1. So the model tries to make the number go higher and higher. No matter which model you have, you can always go a bit higher. Thus there is no optimal model.
The author mentions regularization, but strangely he then proceeds as if it didn't exist. Because with regularization, you can prove that there is a minimum. Basically (don't worry if you don't understand, I just include for other readers): loss goes to infinity as parameter norm goes to infinity, loss has a lower bound so it has a highest lower bound. Take a sequence of points in parameter space with loss converging to this bound. By Bolzano Weirstass it has a convergent sub sequence. Loss is continuous function of parameters, so the loss of the limit point is the limit of the loss. I.e. it's a minimum.
I think the point being made is that a deep network (GPT-3 was the example with 175B parameters) will (due to the virtue of its size) not have any 'local optima' in the sense that there is no traditional 'local' for these high dimensional places. This is because as the # of dimensions increases it is easier to move away from or towards both better or worse parameter sets. Thus optimization algorithms don't have to be concerned about being trapped in a 'local optima'. Also because there are many good parameter sets not all parameters even need be used to get good results, thus processes like distillation can work.
The concept of optima is heavily dependent upon constraints on dimensionality.
If you're surrounded by walls at a 3D coordinate (walls in this sense is usually something akin to an a priori constraint on step size, which itself is the upper and lower bounds of the imposed delta introduced to the current step to try to find a new direction to go in numerical gradient descent), but you can arbitrarily "jump" 4 dimensionally, and there exists a point in the fourth dimensional space where your 3 dimensional space is no longer constrained from moving in the lower 3 dimensions again, you've essentially "avoided" a local optima, because your optimization function can continue to shift to find something better.
At least this holds if we're talking gradient descent, where the definition of an optima for a function is a point wherein any numerical deviation within your error tolerance always ends up converging to the same point.
If you take that same technique and apply it to higher dimensional spaces, the more dimensions you have, the less likely (in theory) your model is from getting stuck.
I know for a fact this doesn't always hold though, as almost every GPT2/3 model I come across I can still manage to get it snarled in predictive loops, where it does nothing but suggest the same thing over and over and over and over and over and over again, indicating a locally maximal optima for the predictor.
One of my favorite way to trip them up is generally some variant of "I once heard a story from a man who heard it from a man, who heard it from a man,... usually it sets it up for the loop. Sometimes you need to massage it a bit, but it's generally pretty easy to lead the predictor into a loop.
If you really want to blow the theory there are no higher dimensionality optima though, just look at other people. If there were not higher dimensional optima, why do bad habits exist, and get converged on so readily that we actively have to discourage, label, or avoid them?
The fact is that for a general function simulator, the trick isn't not falling victim to higher dimensional optima, but learning to recognize what and when you can safely tolerate some, and when you can't, because you really can't avoid the damn things in a resource or physically constrained problem space.
Autocomplete in GPT-3 isn't optimization at all, though? The optimization was in the training.
I've read that the loops you're talking about come from taking too many high-probability choices compared to real-world text that has some low-probability words.
Can someone who knows more about DL than I do help me understand this a little better?
The article uses the analogy of walls:
> Just recall what is necessary for a set of parameters to be at a optimum. All the gradients need to be zero, and the hessian needs to be positive semidefinite. In other words, you need to be surrounded by walls. In 4 dimensions, you can walk through walls. GPT3 has 175 billion parameters. In 175 billion dimensions, walls are so far beneath your notice that if you observe them at all it is like God looking down upon individual protons.
I'm struggling to understand what this really means in 4+ dimensions. But when I try to envision it going from 1 or 2 to 3 dimensions, it doesn't seem obvious at all that a 3D space should have fewer local optima than a 2D space.
In fact, having a "universal function" like a deep network seems like it should have more local optima. What am I missing?