Calculus is all you need! Neural nets are trained to minimize their errors (what...

drdeca · 2024-03-19T21:08:24 1710882504

I think there are still open questions about this that are worth asking.

It is clear enough that following gradients of a bounded differentiable function can bring you to a local minimum of the function (unless I guess if there’s a path that heads away from starting location, going off to infinity, along which the function is always decreasing, asymptotically approaching some value, but this sort of situation can be prevented by adding loss terms that penalize parameters being too big).

But, what determines whether it reaches a global minimum? Or, if it doesn’t reach a global minimum, what kinds of local minima are there, and what determines which kinds it is more likely to end up in? Does including momentum and stochastic stuff in the gradient descent influence the kinds of local minima that are likely to be approached? If so, in what way?

HarHarVeryFunny · 2024-03-19T22:41:46 1710888106

Local minima aren't normally a problem for neural nets since they usually have a very large number of parameter, meaning that the loss/error landscape has a correspondingly high number of dimensions. You might be in a local minima in one of those dimensions, but the probability of simultaneously being in a local minima of all of them is vanishingly small.

Different learning rate schedules, as well as momentum/etc, can also help getting stuck for too long in areas of the loss landscape that many not be local minima, but may still be slow to move out of. One more modern approach is to cycle between higher and lower learning rates rather than just use monotonically decreasing ones.

I'm not sure what latest research is, but things like batch size and learning rate can certainly effect the minimum found, with some resulting in better generalization than others.