None really. Gradient descent is great because it is first order, thus cheap to ...

imtringued · on July 30, 2023

>More memory means more parameters, means you enter the interpolation scheme means your model behaves well in real world.

The surprising thing about large neural networks is that the difference in quality between the local minima goes down and makes it less and less relevant which one you end up in. The global minimum may also lead to overfitting so you probably don't even want to go there.

PartiallyTyped · on July 30, 2023

It is also the case that many of such local minima are locally identical in structure and you can jump between one another via permutations of the parameters.

magicalhippo · on July 30, 2023

As a simple example, if you have a NN with one input and output neuron and a single two-neuron hidden layer (with same activation function), you can swap the weights (and bias) of the two neurons in the hidden layer and the result will be the same. Right?

Is there something to gain by trying to eliminate or exploit such symmetries?

PartiallyTyped · on July 30, 2023

These symmetries come up in lotteries. The lottery ticket hypothesis says that within an overparameterized ann exists a smaller sparser neural network that behaves at least as good as the original one and learns faster.

Re the example; yes that is correct; a permutation is simply a row-wise shuffled identity matrix, it doesn’t affect the gradients or performance.