Hacker News new | past | comments | ask | show | jobs | submit login
Gradient descent visualization (github.com/lilipads)
263 points by weinzierl 4 months ago | hide | past | favorite | 36 comments



These are nice animations. However I've always hesitated to get too enamored with these simple 2D visualizations of gradient descent, because one of the strange takeaways from deep learning is that behavior in high dimensions is very different from behavior in low dimensions.

In a 2D problem with many local minima, like the "eggholder functions" [1], gradient descent will be hopeless. But neural net optimization in high dimensions really is a similar situation with many local minima, except gradient descent does great.

Gradient descent in high dimensions also seems to have the ability to "step over" areas of high loss, which you can see by looking at the loss of a linear interpolation between weights at successive steps of gradient descent. This, again, seems like extremely strange behavior with no low-dimensional analogue.

[1] https://www.sfu.ca/~ssurjano/egg.html


My understanding of this phenomenon in DL is that its not due to anything intrinsic to gradient descent, the same principles and understanding apply.

Rather, it is that with very large dimensionality the probability that you spuriously get all derivatives to be zero is vanishingly small. That is, local minima are less likely because you need a lot of dimensions to agree that df(x_i)/dx_i = 0.

I may be wrong though!


if the probability that you get a derivative close to 0 is small, say only 10%, then you need just 3 dimensions to get that multiplicative probability equal to a tenth of a percent. 3 is hardly “very large dimensionality”

you can assign different numbers, but still you will find you don’t need more than say 10 dimensions for this effect to happen.


Behavior in high dimensions is even weirder than you might think, because you're nearly always close to a saddle point. There's way more saddle points than minima or maxima (consider modeling the sign of eigenvalues for the Jacobian as a binomial distribution -- how likely is it they're all positive or all negative?).

However usually only a relatively small subset of dimensions are important (read: represent a stable optimum), see the 'lottery ticket' hypothesis for more.

What's also weird is that "convergence" in these cases isn't really convergence, it's just slowly floating between saddle points as you begin to overfit. Hence early stopping.

> Gradient descent in high dimensions also seems to have the ability to "step over" areas of high loss

This has much less to do with gradients per se and way more to do with step size. It's an artifact of discretization, not of the algorithm per se.


Might I suggest an insightful seminar for anyone curious about why gradient descent is even tractable in such high dimensions:

Stanford Seminar - A Picture of the Prediction Space of Deep Networks by Pratik Chadhari:

https://youtu.be/ZD2cL-QoI5g?si=hVFU5_4CeoLYtyuB


Not super strange if you think of it going the other way. Think of a saddle like construct, in one dimension you get stuck easily, in another tgere is a way forward where over time the other previously 'stuck' dimension is freed up again. In higher dimensions you have a much higher chance of havibg such alternative pathways. During training this looks lime breif plateaus in loss reduction followed by a sudden rapid decrease again. Sorry for typos, bumpy train, fix one get another, will try edit later.


Does gradient descent really do well for deep learning when the gradient is computed with respect to the whole dataset? I assumed that the noise in SGD played an important role for escaping local minima.


There aren't really local minima in most deep networks. When you get into millions/billions of parameters, there will essentially always be some directions that point downwards. You have to get really really close to the true minimum for there to be no direction to go that improves the loss.

Incidentally this same phenomenon is IMO how evolution is able to build things like the eye. Naively you'd think that since you need so many parts arranged so well, it's impossible to find a step by step path where fitness goes up at every step, i.e. if you just have a retina with no lens or vice-versa, it doesn't work. However, due to the high dimensionality of DNA, there is essentially guaranteed to be a path with monotonically increasing fitness just because there are so many different possible paths from A to B in the high dimensional space that at least one is bound to work.

Now this isn't strictly true for every high dimensional system. You need to have a degree of symmetry or redundancy in the encoding for it to work. For example, in convolutional neural networks, you see this phenomenon where some filters get "stuck" in training, and those are local minima for that subspace. What happens though is that if one filter gets stuck, the network will just use another one that had a better initialization. This is why pruning works, lottery tickets, etc. Things like residual connections enhance this effect since you can even be stuck in a whole layer and the training process can just bypass it.

You see the same thing with life, where you could put a sequence for the same protein in different parts of the genome and it could still be produced, regardless of the position. There are also many different ways to encode the exact same protein, and many different possible proteins that will have the same shape in the critical areas. Life finds a way.


If that was the case we would be finding globally optimal solutions for complicated non-convex optimization problems.

The reality is different, you need to really explore the space to find the truly global optimal solution.

A better explanation is that for ml you don't want a globally optimal solution that overindexes on your training set. You want a suboptimal solution that might also fit an unseen data set.


That is what people thought until around 2018, but it was wrong. It turns out that deep learning optimization problems have many global optima. In fact, when the #parameters exceeds the #data, SGD reliably finds parameters that interpolate the training data with 0 loss. Surprisingly, most of these generalize well and overfitting is not a big problem.

In other words, deep learning is a very special nonconvex optimization problem. A lot of our old intuition about optimization for ML is invalid in the overparameterized regime.


Why DL generalizes well is still an open research problem AFAIK. I've read numerous papers that tries to argue one way or another why this works, and they are all interesting! One paper (that I found compelling, even though it didn't propose a thorough solution) showed that SGD successfully navigated around "bad" local minimas (with bad generalization) and ended up in a "good" local minima (that generalized well), and their interpretation was that due to the S in SGD, it will only find wide loss basins, and thus the conclusion was that solutions that generalize well seem to have wider basins (for some reason).

Another paper showed that networks trained on roughly the same dataset but initialized from different random initializations, had a symmetry relation in the loss landscape by a permutation of all the weights. You could always find a permutation that allowed you to then linearly interpolate between the two weight sets without climbing over a loss mountain. Also very interesting even if it wasn't directly related to generalization performance. It has potential applications in network merging I guess.


I have read this in several places and want to learn more. Do you have a reference handy?


[1] Was an empirical paper that inspired much theoretical follow-up.

[2] Is one such follow-up, and the references therein should point to many of the other key works in the years between.

[3] Introduces the neural tangent kernel (NTK), a theoretical tool used in much of this work. (Not everyone agrees that reliance on NTK is the right way towards long-term theoretical progress.)

[4] Is a more recent paper I haven't read yet that goes into more detail on interpolation. Its authors were well known in more "clean" parts of ML theory (e.g. bandits) and recently began studying deep learning.

---

[1] Understanding deep learning requires rethinking generalization. Zhang et al., arXiv, 2016. https://arxiv.org/abs/1611.03530

[2] Stochastic Mirror Descent on Overparameterized Nonlinear Models: Convergence, Implicit Regularization, and Generalization. Azizan et al., arXiv, 2019. https://arxiv.org/abs/1906.03830.

[3] Neural Tangent Kernel: Convergence and Generalization in Neural Networks. Jacot et al., NeurIPS, 2018. https://proceedings.neurips.cc/paper/2018/hash/5a4be1fa34e62...

[4] A Universal Law of Robustness via Isoperimetry. Bubeck et al., NeurIPS, 2021. https://proceedings.neurips.cc/paper/2021/hash/f197002b9a085...


Awesome, thank you!


This is something I saw a talk about a while ago. There are probably more recent papers on this topic, you might want to look browse the citations of this one

https://arxiv.org/abs/2003.00307


> There aren't really local minima in most deep networks.

How so?

If there are no local minima other than the global one there are convex optimization methods that are far faster than SGD or Adam. The only reason these methods exist is because deep networks is a non-convex optimization problem.


There are many nonconvex functions where every local minimum is global, and even many nonconvex functions with a unique local minimum (which is de facto global). Convex methods can fail on those.

The reason why GD and friends are a good choice in deep learning is that computing the gradient is cheap (and approximating it even more so). Every descent method relies on solving a subproblem of sorts, typically projecting the current iterate on a sublevel set of an approximation of the function, for some definition of projection. With GD, it's as cheap as it gets, just subtract a shrinked version of the gradient. Subproblems in other algorithms are a lot more expensive computationally, particularly at high dimensions. So more efficient as in requiring fewer function evaluations, yes, but at the cost of doing a lot more work at each step.


>behavior in high dimensions is very different from behavior in low dimensions

Keeping that in mind it is still useful. Maybe one useful addition would be to reduce dimensionality to show that if we pick some dimensions we don't have a local minimum, hence we don't have a local minimum in the larger dimension we started with.


Note that such a 2D parameter space gives often the wrong intuition when thinking about applying gradient descent on high-dimensional parameter space.

Also, mini-batch stochastic gradient descent behaves more stochastic than just gradient descent.


>gives often the wrong intuition when thinking about applying gradient descent on high-dimensional parameter space

Can you give some examples?


Not an expert in this field, but I'm guessing this is related to unintuitive nature of geometry in high-dimensional spaces.

One rough example I can think of is the fact that the number of ways to move away from the origin increases exponentially (or even faster?) as the dimensionality goes up. There is _way_ more volume away from the origin than near it, I've seen this explained as something like "most of the volume of a high-dimensional orange is in the peel". One result of this is the fact that samples of a standard gaussian end up forming a "shell" as opposed to a "ball" that you would expect (this phenomenon is called the concentration of measure in general).

Also, very roughly, high-dimensional objects have lots of corners, and these corners are also very sharp. I would guess that gradient descent would get stuck in these corners and have a hard time getting out.

Some links related to this:

- Spikey Spheres: http://www.penzba.co.uk/cgi-bin/PvsNP.py?SpikeySpheres#HN2

- Thinking outside the 10-dimensional box - 3blue1brown: https://www.youtube.com/watch?v=zwAD6dRSVyI

- This is a fairly long talk about HMC, but it does talk about some problems that come up when sampling high-dimensional distributions: https://www.youtube.com/watch?v=pHsuIaPbNbY


Probably there are very few local minimums since the chances of all local derivatives to be 0 decrease with the number of dimensions.


Agreed, but I still think this is a useful tool for building intuition about a subset of problems gradient descent faces. The animation in the readme is similar to what I picture in my head for local minima for instance.


This visualization appears to be for batch gradient descent only, not stochastic or mini-batch.

That said, your critique would be true of similar visualizations for gravity warping space-time, yet these visualizations are still widely used because they help with the initial understanding of the idea--why, for example, you hold one variable constant when taking the partial differential in another dimension, and what effect changing the step size for the gradient descent has on the ability to find a local minimum.


I vaguely remember something like this explained in one of my math classes (calculus or linear equations). Not sure if the math teacher was referring the same problem:

If you were walking down a mountain following an algorithm that told you just to keep going down. You might get stuck in a local minimum. Since you might reach a valley. Sometimes you need to go up to get out of the valley to keep going down.


In very high dimensional spaces (like trying to optimize a neural network with billions of parameters), to be "in a valley", you must be in a valley with respect to every one of the billions of dimensions. It turns out that in many practical settings, loss landscapes are pretty well-behaved in this regard, and you can almost always find some direction to continue going downward in that lets you go around the next hill rather than over it.

This 2015 paper has some good examples (although it does sort of sweep some issues under the rug): https://arxiv.org/pdf/1412.6544


Is the claim that there aren’t local many local minima for high dimensional problems eg in neural network loss functions?


Yes. To be more specific, it's that nearly all points where the derivative is zero are saddle points rather than minima. Note that some portion of this nice behavior seems to be due to design choices in modern architectures, like residual connections, rather than being a general fact about all high dimensional problems.

https://arxiv.org/pdf/1712.09913 This paper has some nice visualizations.


When I took an ML class we solved that with a temperature parameter, to allow random deviations out of a local minimum. I wonder if there are novel methods now or if they're just improved versions of temperature?


Looks great!

Killer feature: allow arbitrary surfaces to be used (choice of 2D projection if in higher dimensions). And integrate in Python to allow use in production! Allow arbitrary zoom and level of detail of surface.


Here is a project I created for myself (some time ago) to help visualize the gradient as a vector field.

https://github.com/GistNoesis/VisualizeGradient

Probably best used as a support material with someone teaching along the way the right mental picture one should have.

A great exercise is to have the student (or yourself) draw this visualization with pen and paper, (both in 2d and 3d), for various functions. And you can make the connection to the usual "tangent" on the curve of a derivative.


Ignoring the dimensionality issue for a moment, wouldn't this also represent neural network training on a single training example/batch rather than on the whole trainset?

Still a beginner in this area, but my understanding is that the standard training procedure for neural networks used today involves computing the loss ("height" of the ball) and gradient ("slope" below the ball) for one instance, then moving the ball one step in the gradient's direction, then repeating the process with a different instance.

(You will also typically group a small number of instances together in a batch and process them in the same iteration - but that's for efficiency reasons, not because it's essential to the train algorithm)

So if you wanted to visualise the entire training procedure, I'd imagine an animation of a smoothly rolling ball but with a rapidly changing "landscape" underneath.

Would that be correct?

(Writing this out of genuine curiosity, not as a criticism of the project. I think a lot more work should be done on useful visualisations in this area - and this is an incredibly helpful start.)


I love creating things to solidify my intuition about topics. When I learned about gradient descent, I saw this repo and was inspired to create my own toy Python package for visualizing gradient descent.

My package, which uses PyVista for visualization:

https://github.com/JacobBumgarner/grad-descent-visualizer


I thought gradient descent always followed the steepest slope, this looks like a physics simulation where marbles are rolling down hills. In mathematical gradient descent do you really oscillate around the minimum like a marble rocking back and forth in a pit?

Edit: Oh, I see. The animations are "momentum" gradient descent


This is great!

Which open source license is this under? (Absence of license by default implies 'copyrighted', which in this case could be in conflict with the Qt open source license. Note: I am not a lawyer.)


As a extremely visual thinker/learner, thank you for creating this!

Many things that were too abstract on paper and formulas are MUCH easier to understand this way.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: