Co-author here! I'm kind of surprised that this made it to the top of HN! This was a project in which Joseph and I tried to reverse engineer the mechanism in which GPT-2 predicts the word 'an'.
It's crazy that large language models work so well just by being trained as a next-word-prediction model over a large amount of text data. We know how image models learn extract the features of an image through convolution[1], but how and what LLMs learn exactly remain a black box. When we dig deeper into the mechanisms that drive LLMs, we might get closer to understanding why they work so well in some senses, and why they could be catastrophic in other cases (see: the past month of search-based developments).
I find trying to understand and reverse-engineer LLMs to be a personally exciting endeavour. As LLMs get better in the near future, I sure hope our understanding of them can keep up as well!
It seems hard to explain how Bing/GPT could have generated the Vonnegut-inspired cake story, having ingested the rules, without planning the whole thing before generating the first word.
It seems there's an awful lot more going on internally in these models than a mere word by word autoregressive generation. It seems the prompt (in this case including Vonnegut's rules) is ingested and creates a complex internal state that is then responsible for the coherency and content of the output. The fact that it necessarily has to generate the output one word at a time seems to be a bit misleading in terms of understanding when the actual "output prediction" takes place.
There is "long range" dependence, it's just only on the prompt: the conversation with the user and the hidden header (e.g. "Answer as ChatGPT, an intelligent AI, state your reasons, be succinct, etc."). That ends up being enough.
Sure, but the point being discussed is that despite the word by word output, the output does not appear to be "chosen" on a word by word basis. OP investigated the case where the word "an" anticipates the following word ("an apple" vs "a pear").
>so there's some kind of plan and execute going on. maybe it can do that in model some how
The simple answer is that the internal state that picks the next token is stable over iterations so that the model can follow a consistent plan over multiple token outputs. Then as the plan "unfolds" in the output tokens, these tokens help stabilize the plan further, thus creating consistency over long generations.
Did you check the Vonnegut writing rules example I posted at top of this thread - in particular look at Bing/GPT's explanation of how its cake story matches up to Vonnegut's rules ? It's hard to imagine how it could have come up with such a coherent story, checking all the rules, if it was only conceiving of it's continuing story on a word by word basis. It's not as if sentence #1 matches rule number 1, sentence 2 matches rule number 2, etc. It seems there had to be some wholistic composition for it to do that.
Note too that despite the output being sampled from a distribution based on a "randomness" temperature, there are many case where what it is trying to say so much constrains the output that certain words/synonyms/concepts are all but forced.
It's easy to see that its not just doing one token at a time but is anticipating future tokens. Consider the context of a Q&A. The response might start with any of a number of words, exactly which word depends on what comes after. But if it randomly chooses the wrong word, it will either be forced to complete the wrong answer, or be backed into a corner and engage in circumlocutions to course-correct. This doesn't happen in practice for recent big models.
Convolution is part of the network design though.
Would a fully connected network learn to convolute? Or would it turn out that convolution is not necessary?
The interesting part here isn't the convolution itself, it's how convolutional layers turn out to like "filters" or "detectors" for individual features. This is explained very well in the distill.pub article linked by GP.
We know the architecture of LLMs because we created it, but we don't yet have the same level of understanding about them, or the same quality of analytical tools for reasoning about them.
They do and in fact it's relatively straightforward to show empirically on eg MNIST. The problem is that you need a much much larger network in the FCN case and thus need way more data and way more data augmentation to get a good result that isn't overfit to hell.
In the case of CNN the reason it works is that an image of an object X is still an image of object X if the X is shifted left or right. The property is translationally invariant. CNN are basically the simplest way to encode translational invariance.
> CNN are basically the simplest way to encode translational invariance
That's the geometric deep learning theory, isn't it? Do you know if there's a list somewhere of exactly what invariance has which ways to simulate it? Like an overview?
The point of using a CNN instead of a FCN is that you force it to learn in a certain way that prevents overfitting. But given a sufficient dataset, and proper data augmentation you would expect a FCN to be able to identify objects regardless of translation. It's just that a CNN would train easier and better, with a smaller network (a FCN doing convolutions would be very wasteful).
That's why traditionally you would pick your architecture to help it learn in a certain way (images=cnn, text=rnn/lstm/gru). But the nice thing about transformers is that they are more general.
Could a "type system" for neural weights be developed? Given a self-driving system, to be able to statically check that the neurons have the "Person" type, the "Don't Run Over Person" type, and so forth. What happens if you "transplant" the weights for ' an' to another network, some kind of transfer learning but componentized, does it still predict as accurately? If neural networks could be assembled from "types" it would be much easier to trust them.
The way an LLM decides which word to use next is by evaluating the weightings of all the preceding words with every candidate word to calculate a probability for each of them. So if it selects ‘an’ as the next word, it’s because the weighting connecting ‘an’ to all the preceding words, and their orders in the text and relationships with each other predicted it should have a high probability of occurring.
So you can’t extract the weightings for ‘an’ discretely because those weightings encode its connection with all the other words and combinations and sequences or clusters of words it might ever be used with, including their weightings with other preceding words, and their relationships, etc, etc.
Right, but if there is such a thing as the very plastically named "Jeniffer Aniston neuron" [1], and further more, group equivariant deep learning [2], maybe there is a way in which you can isolate a certain concept/"type", such as Person, Car, and so forth; perhaps not even isolate, but rehydrate the context of where the concept takes place: as a brain does in various word plays, as in Who's on First [3], etc.
Come to think of it, when someone teaches me a new concept, the principle of mass conservation, for instance, in some sense they are transferring their embedding into my brain, further on I will relate to mass conservation through what that person taught me. The transfer is a very lossy process, sure, but a transfer with reintegration nonetheless. Perhaps "mortal computation" [4] is a requirement.
> Right, but if there is such a thing as the very plastically named "Jeniffer Aniston neuron"
Firstly even if there is such a cell that only fires for one face, or perhaps also the person’s name, it doesn’t mean there aren’t other cells that fire for that person, or for people in general including that person. Without those as well, that neurons responses might not mean anything to the rest if the brain. It’s a thought experiment but never really demonstrated.
Also even if this is true in the very strongest sense. Say there is one neuron that uniquely and discretely fires in response to thinking about that one person. What defines a neuron isn’t just its internal behaviour. It’s also the pattern of inputs that influence it, and the pattern of outputs it sends out. It’s the connections and dependencies on the weightings and signals and responses from all the cells it’s connected to. Including the specific unique ways all those neurons are connected, or not connected to all the other cells in the brain. It’s al, the specifics of that connectedness that are what makes the behaviour of that neuron meaningful.
If you took that neuron and implanted it into another brain, you’d need to hook it up to the neurons in that brain such that it gets exactly the same stimuli, in the same order, with the same strength, every time it needs to fire. The same applies to its output, all the neurons it’s connected to would have to interpret its firing behaviour in the exact same way the other neurons in the original brain did. But there’s no guarantee any of those connected mechanisms work or are physically connected in the same way, or even a vaguely similar or compatible way in the new brain.
Well, given the more organic nature of machine learning and what it's trying to achieve I wouldn't be surprised if that same neuron also triggered to some degree for "Jennifer and Stefan" ahaha.
Do you think it would ever be possible to “maximize” a neuron with certain sentences? What’s so different with the gradient ascent techniques with convolutions?
It's crazy that large language models work so well just by being trained as a next-word-prediction model over a large amount of text data. We know how image models learn extract the features of an image through convolution[1], but how and what LLMs learn exactly remain a black box. When we dig deeper into the mechanisms that drive LLMs, we might get closer to understanding why they work so well in some senses, and why they could be catastrophic in other cases (see: the past month of search-based developments).
I find trying to understand and reverse-engineer LLMs to be a personally exciting endeavour. As LLMs get better in the near future, I sure hope our understanding of them can keep up as well!
[1] https://distill.pub/2020/circuits/zoom-in/