Hacker News new | past | comments | ask | show | jobs | submit login

None really.

Gradient descent is great because it is first order, thus cheap to compute since you don’t need a Hessian, points to a descent direction, works with anything that is differentiable or piece-wise differentiable without caveats, and given the millions of parameters in today’s there is always a descent direction.

If you do population methods, you suffer in terms of memory because you need to keep all those candidates, evaluate them individually, and then update them. This bounds the memory you can use.

More memory means more parameters, means you enter the interpolation scheme means your model behaves well in real world.

If you try to go with second order optimisation methods then you need to go for Hessian free methods as computing the Hessian is computationally intractable for large NNs.

You can attempt to build a local model of the loss landscape but that is expensive and has many caveats.

Nocedal et al is a recommended read for numerical optimisation theory.




>More memory means more parameters, means you enter the interpolation scheme means your model behaves well in real world.

The surprising thing about large neural networks is that the difference in quality between the local minima goes down and makes it less and less relevant which one you end up in. The global minimum may also lead to overfitting so you probably don't even want to go there.


It is also the case that many of such local minima are locally identical in structure and you can jump between one another via permutations of the parameters.


As a simple example, if you have a NN with one input and output neuron and a single two-neuron hidden layer (with same activation function), you can swap the weights (and bias) of the two neurons in the hidden layer and the result will be the same. Right?

Is there something to gain by trying to eliminate or exploit such symmetries?


These symmetries come up in lotteries. The lottery ticket hypothesis says that within an overparameterized ann exists a smaller sparser neural network that behaves at least as good as the original one and learns faster.

Re the example; yes that is correct; a permutation is simply a row-wise shuffled identity matrix, it doesn’t affect the gradients or performance.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: