> High dimensional spaces are unlikely to have local optima, and probably don’t ...

zachf · on Oct 20, 2020

The “physicist” explanation that I heard (meaning non-rigorous but good for building intuition) is that at every point where the derivative vanishes, for suitably random functions, every direction you move in will either be a direction where you increase or decrease at about 50-50 odds. In D dimensions there are 2D independent directions to rise or fall in (e.g. in D=2 dimensions, there’s north south east west), so there’s about 1/2^(2D) odds that any given critical point is actually a minimum if those probabilities are independent. That gets small really fast at large D.

Obviously this is not rigorous, though.

an_d_rew · on Oct 20, 2020

But you have to be careful about that word "independent".

There's a reason that things like 3D protein structure estimation, for example, are still very difficult problems, because none of the coordinates are even approximately independent of the others.

So you're back to a standard "minimization is really difficult" even in ultra-high dimensional spaces.

moultano · on Oct 20, 2020

Yeah I was thinking about that as I was writing and trying to convey why I feel like deep models are different. I think one way of thinking about it is that protein structure, even though it has lots of parameters, it is all happening within the confines of 3D space. A protein that could move in lots of dimensions at once, could probably reliably fold much more easily, and it would be easy to find this structure.

maxlamb · on Oct 20, 2020

That makes sense and explains why it's unlikely to have local minimas, but I don't get the "no minimal at all" argument. Why no global minimas at all, just because of high dimensionality?

dougabug · on Oct 20, 2020

It’s a heuristic argument that critical points are extremely unlikely to be local minima (ie positive definite second derivative). Loss surfaces of DNNs do typically have a global minimum (zero if they fit the training data exactly).

PeterisP · on Oct 20, 2020

Arguably, a DNN seems likely to have many global minimums - given the level of (over)parametrization commonly used, a set of parameters that gets the lowest possible loss won't be unique, there will be huge sets of parameters that give exactly identical results.

dougabug · on Oct 20, 2020

Due to symmetry, at least, there are many global minima, but with the same minimum value.

blackbear_ · on Oct 20, 2020

> but I don't get the "no minimal at all" argument.

That's just wrong, no need to think about it. Only unbounded loss functions don't have minima, but using such a loss function would not make sense.

moultano · on Oct 20, 2020

This isn't true. A loss function that asymptotes can also have no minima, which commonly used loss functions do.

bananaface · on Oct 21, 2020

Doesn't the fact the network is discrete (floats have a maximum precision) mean this isn't actually the case? There's a finite number of states the net can be in, and one (or more) is best.

blackbear_ · on Oct 20, 2020

Your intuition is off, but only slightly. As the dimensionality increases, the number of stationary points (zero gradient) also increases. But they become overwhelmingly likely to be saddle points rather than minima.

You can read more here [1].

[1] http://ganguli-gang.stanford.edu/pdf/14.SaddlePoint.NIPS.pdf

moultano · on Oct 20, 2020

Think of the model as a point in high d space that you're trying to trap inside a cage. (Corresponding to it being at a local minima and surrounded by higher points.) So you're trying to build a cube in n dimensions, which has 2n sides. Now imagine that each of those sides will randomly be there or not, probabilistically, with probability p. To actually trap a model like gpt3, p^175,000,000,000 has to be high enough that you observe a case during training where you roll a 1.

f154hfds · on Oct 20, 2020

My intuition here is that there is always a fair bit of noise in the (data -> label) pairs. Models tend to air on the side of being _too_ expressive to compensate (just add another layer right?). Assuming our model is too expressive, we must turn off training before we get too far down and start memorizing training datasets. Put another way, one input pair will want to go down one path, another will want to go down a totally different path. Assuming the model is too expressive, we actually don't want to reach a minima, because that essentially guarantees we've overfit.

What the author is saying is that very quickly optimization becomes a maze, and can quickly turn into a combinatorial game. Each input starts down its own set of corridors (parameters at certain values) and it can take an _extremely_ long time for this maze to end. Any noise in the (data, label) pairs can make this maze have no end at all if the model is too small. If the model is too big, it's a moot point because it will have overfit at this point.

im3w1l · on Oct 20, 2020

Consider trying classifying shapes into either square or circle. The model outputs probability of circle. The absolute best you can do is to completely learn the training set. Assign 1 to all the circles and 0 to all the squares.

It is typical to squish the set of all real numbers into the interval 0 to 1. Any finite value will be squished to a value less than 1. So the model tries to make the number go higher and higher. No matter which model you have, you can always go a bit higher. Thus there is no optimal model.

The author mentions regularization, but strangely he then proceeds as if it didn't exist. Because with regularization, you can prove that there is a minimum. Basically (don't worry if you don't understand, I just include for other readers): loss goes to infinity as parameter norm goes to infinity, loss has a lower bound so it has a highest lower bound. Take a sequence of points in parameter space with loss converging to this bound. By Bolzano Weirstass it has a convergent sub sequence. Loss is continuous function of parameters, so the loss of the limit point is the limit of the loss. I.e. it's a minimum.

ozborn · on Oct 20, 2020

I think the point being made is that a deep network (GPT-3 was the example with 175B parameters) will (due to the virtue of its size) not have any 'local optima' in the sense that there is no traditional 'local' for these high dimensional places. This is because as the # of dimensions increases it is easier to move away from or towards both better or worse parameter sets. Thus optimization algorithms don't have to be concerned about being trapped in a 'local optima'. Also because there are many good parameter sets not all parameters even need be used to get good results, thus processes like distillation can work.

salawat · on Oct 20, 2020

The concept of optima is heavily dependent upon constraints on dimensionality.

If you're surrounded by walls at a 3D coordinate (walls in this sense is usually something akin to an a priori constraint on step size, which itself is the upper and lower bounds of the imposed delta introduced to the current step to try to find a new direction to go in numerical gradient descent), but you can arbitrarily "jump" 4 dimensionally, and there exists a point in the fourth dimensional space where your 3 dimensional space is no longer constrained from moving in the lower 3 dimensions again, you've essentially "avoided" a local optima, because your optimization function can continue to shift to find something better.

At least this holds if we're talking gradient descent, where the definition of an optima for a function is a point wherein any numerical deviation within your error tolerance always ends up converging to the same point.

If you take that same technique and apply it to higher dimensional spaces, the more dimensions you have, the less likely (in theory) your model is from getting stuck.

I know for a fact this doesn't always hold though, as almost every GPT2/3 model I come across I can still manage to get it snarled in predictive loops, where it does nothing but suggest the same thing over and over and over and over and over and over again, indicating a locally maximal optima for the predictor.

One of my favorite way to trip them up is generally some variant of "I once heard a story from a man who heard it from a man, who heard it from a man,... usually it sets it up for the loop. Sometimes you need to massage it a bit, but it's generally pretty easy to lead the predictor into a loop.

If you really want to blow the theory there are no higher dimensionality optima though, just look at other people. If there were not higher dimensional optima, why do bad habits exist, and get converged on so readily that we actively have to discourage, label, or avoid them?

The fact is that for a general function simulator, the trick isn't not falling victim to higher dimensional optima, but learning to recognize what and when you can safely tolerate some, and when you can't, because you really can't avoid the damn things in a resource or physically constrained problem space.

skybrian · on Oct 21, 2020

Autocomplete in GPT-3 isn't optimization at all, though? The optimization was in the training.

I've read that the loops you're talking about come from taking too many high-probability choices compared to real-world text that has some low-probability words.