I wonder if this could be used as a step towards explaining innate abilities. You're born with a human brain capable or language. A giraffe is born with a brain capable of walking.
There's just some predefined wiring that is a good bet, and the expression into a phenotype or some other mechanism for encoding some of this knowledge in future generations.
Future AI can just start off with a good baseline and build from there.
Reading the paper, I was thinking about the following: Given the weights of two models w1 and w2, then at each neuron k compute some average of the absolute difference of the outputs of neuron k over the training set. Then perhaps the neurons with low differences are those that capture general knowledge shared by w1 and w2, just an idea.
(1) in different models even of the same exact type and training approach, neuron number k will typically have completely different roles between the models. This is because the neural net is built from layers of neurons where permuting the order of neurons in a layer gives an exactly equivalent network. Because of randomization in initialization and training, any permutation for the resulting weights is equally likely to be produced. So at the least you'd have to look for corresponding neurons k1 in net 1 vs k2 in net 2.
(2) most properties you might try to look for will be not correspond to a single neuron but rather as a relationship between many different neurons. And since neural nets are so flexible there are many ways to encode approximately the same function, there is no reason to expect for your chosen "general knowledge" item net 1 and net 2 will use the same number of neurons or even same general approach to encode that.
They started with a pre trained model so symmetries were already broken, so despite what randomness there is in the fine-tuning process it may have well followed the same groove every time in training.
I think you are absolutely right, but those sames problems apply when the paper claim that the average of two models gives a good model. So in that case the weight space could have additional properties that could make the proposed approach a little more plausible with some modifications. As you suggest, features are encoded in many different ways and in many neurons, so the suggested approach could only be applied for features that are encoded using only one neuron. To reduce a little the ways the features can be encoded, the proposal could be applied to an encoding of both models. Looking for matching neurons using as distance the L1-norm of the difference of outputs.
>Because of randomization in initialization and training
Randomization in initialization seems like a pragmatic thing to do from a 'make the math work' perspective, but a really counter-intuitive thing to do when comparing the training process of our own wetware substrate. I know it's not a fair comparison, but just an interesting thought to me.
Related is this one https://arxiv.org/abs/1912.02757 which as I understand finds that there's little to be gained moving around locally in weight space, vs finding another mode. So it's not surprising that a model positioned between two modes would function sort of like an ensemble model
Which of course raises the question of quantifying when averaging models (or, more generally, staying within a convex hull) is likely to remain in the same low cost (high performing) basin.
For what it’s worth, I usually default to swish activations, which seem to be popular in my corner of graph neural nets (materials and chemistry). Performance is about the same as ReLU, and I like swish because it doesn’t have a hard discontinuity.
Forgive my naive question - but what if the neural networks output range is -1 to +1, if the activation functions are ReLU doesn’t that mean a negative value cannot be produced?
A negative value cannot be produced but in the hidden layers the output of one neuron is multiplied by weights so the sign can be encoded in those weights.
The strategy for most of these things is that most of the network builds up a bunch of "shapes", and you have a final layer that projects those appropriately into the output space. The intermediate layers can use basically any activation they want that has desirable convergence properties, and at the end you might have a linear projection (or full MLP layer) followed by a sigmoid or other reshaping. The GPT family uses "softmax" -- an exponentially weighted norming function that scales all the outputs to sum to 1 (since they represent probabilities for each of the next tokens).
If your output range is bounded like that then you probably want a sigmoid or tanh activation (with some shifting and scaling maybe) on the output layer. But the hidden layers can still use ReLU without issue
A 'leaky relu' is used to retain some of the negative stuff and prevent neurons from dying young. I just googled around and now see "PReLU" which seems to also address negative value.
There's just some predefined wiring that is a good bet, and the expression into a phenotype or some other mechanism for encoding some of this knowledge in future generations.
Future AI can just start off with a good baseline and build from there.