Knowledge is a region in weight space for fine-tuned language models

markburns · on Feb 23, 2023

I wonder if this could be used as a step towards explaining innate abilities. You're born with a human brain capable or language. A giraffe is born with a brain capable of walking.

There's just some predefined wiring that is a good bet, and the expression into a phenotype or some other mechanism for encoding some of this knowledge in future generations.

Future AI can just start off with a good baseline and build from there.

maaaaattttt · on Feb 23, 2023

Not an expert at all (so not a rhetorical question) but wouldn't that be what people are doing with using pre-trained model for new tasks?

nothing0001 · on Feb 23, 2023

Reading the paper, I was thinking about the following: Given the weights of two models w1 and w2, then at each neuron k compute some average of the absolute difference of the outputs of neuron k over the training set. Then perhaps the neurons with low differences are those that capture general knowledge shared by w1 and w2, just an idea.

dzdt · on Feb 23, 2023

Two problems:

(1) in different models even of the same exact type and training approach, neuron number k will typically have completely different roles between the models. This is because the neural net is built from layers of neurons where permuting the order of neurons in a layer gives an exactly equivalent network. Because of randomization in initialization and training, any permutation for the resulting weights is equally likely to be produced. So at the least you'd have to look for corresponding neurons k1 in net 1 vs k2 in net 2.

(2) most properties you might try to look for will be not correspond to a single neuron but rather as a relationship between many different neurons. And since neural nets are so flexible there are many ways to encode approximately the same function, there is no reason to expect for your chosen "general knowledge" item net 1 and net 2 will use the same number of neurons or even same general approach to encode that.

PaulHoule · on Feb 23, 2023

They started with a pre trained model so symmetries were already broken, so despite what randomness there is in the fine-tuning process it may have well followed the same groove every time in training.

nothing0001 · on Feb 23, 2023

I think you are absolutely right, but those sames problems apply when the paper claim that the average of two models gives a good model. So in that case the weight space could have additional properties that could make the proposed approach a little more plausible with some modifications. As you suggest, features are encoded in many different ways and in many neurons, so the suggested approach could only be applied for features that are encoded using only one neuron. To reduce a little the ways the features can be encoded, the proposal could be applied to an encoding of both models. Looking for matching neurons using as distance the L1-norm of the difference of outputs.

barking_biscuit · on Feb 23, 2023

>Because of randomization in initialization and training

Randomization in initialization seems like a pragmatic thing to do from a 'make the math work' perspective, but a really counter-intuitive thing to do when comparing the training process of our own wetware substrate. I know it's not a fair comparison, but just an interesting thought to me.

blackbear_ · on Feb 23, 2023

Cool I guess but not really surprising given previous work on mode connectivity ((1) and related)

(1) https://arxiv.org/abs/1802.10026

version_five · on Feb 23, 2023

Related is this one https://arxiv.org/abs/1912.02757 which as I understand finds that there's little to be gained moving around locally in weight space, vs finding another mode. So it's not surprising that a model positioned between two modes would function sort of like an ensemble model

xpe · on Feb 23, 2023

Which of course raises the question of quantifying when averaging models (or, more generally, staying within a convex hull) is likely to remain in the same low cost (high performing) basin.

LChoshen · on Feb 24, 2023

All related, but why do any of it changes things about fusing models of "different" training, which is more interesting imo

timhigins · on Feb 23, 2023

I find this a good overview of the topic: https://damueller.com/#/blog-post/NNLLs

nothing0001 · on Feb 23, 2023

I wonder what happens when one change the activation function, is there some related results in that direction?

wcoenen · on Feb 23, 2023

ReLU is always used now, because it performs better than the alternatives.

rsfern · on Feb 23, 2023

For what it’s worth, I usually default to swish activations, which seem to be popular in my corner of graph neural nets (materials and chemistry). Performance is about the same as ReLU, and I like swish because it doesn’t have a hard discontinuity.

malux85 · on Feb 23, 2023

Forgive my naive question - but what if the neural networks output range is -1 to +1, if the activation functions are ReLU doesn’t that mean a negative value cannot be produced?

nothing0001 · on Feb 23, 2023

A negative value cannot be produced but in the hidden layers the output of one neuron is multiplied by weights so the sign can be encoded in those weights.

hansvm · on Feb 23, 2023

The strategy for most of these things is that most of the network builds up a bunch of "shapes", and you have a final layer that projects those appropriately into the output space. The intermediate layers can use basically any activation they want that has desirable convergence properties, and at the end you might have a linear projection (or full MLP layer) followed by a sigmoid or other reshaping. The GPT family uses "softmax" -- an exponentially weighted norming function that scales all the outputs to sum to 1 (since they represent probabilities for each of the next tokens).

rsfern · on Feb 23, 2023

If your output range is bounded like that then you probably want a sigmoid or tanh activation (with some shifting and scaling maybe) on the output layer. But the hidden layers can still use ReLU without issue

boredumb · on Feb 23, 2023

A 'leaky relu' is used to retain some of the negative stuff and prevent neurons from dying young. I just googled around and now see "PReLU" which seems to also address negative value.

p1esk · on Feb 24, 2023

Almost no one uses ReLU anymore. It's usually GeLU for llms, and Swish for images.

bilsbie · on Feb 23, 2023

Why is it good to ignore values less than zero?

wcoenen · on Feb 23, 2023

Intuitively it makes sense that certain nodes only "activate" when the concept/feature that they have learned is actually present in the input.

(Whether this is the correct interpretation, I'm not sure.)

hansvm · on Feb 23, 2023

ELU is pretty common, no?