Co-author here! I'm kind of surprised that this made it to the top of HN! This was a project in which Joseph and I tried to reverse engineer the mechanism in which GPT-2 predicts the word 'an'.
It's crazy that large language models work so well just by being trained as a next-word-prediction model over a large amount of text data. We know how image models learn extract the features of an image through convolution[1], but how and what LLMs learn exactly remain a black box. When we dig deeper into the mechanisms that drive LLMs, we might get closer to understanding why they work so well in some senses, and why they could be catastrophic in other cases (see: the past month of search-based developments).
I find trying to understand and reverse-engineer LLMs to be a personally exciting endeavour. As LLMs get better in the near future, I sure hope our understanding of them can keep up as well!
It seems hard to explain how Bing/GPT could have generated the Vonnegut-inspired cake story, having ingested the rules, without planning the whole thing before generating the first word.
It seems there's an awful lot more going on internally in these models than a mere word by word autoregressive generation. It seems the prompt (in this case including Vonnegut's rules) is ingested and creates a complex internal state that is then responsible for the coherency and content of the output. The fact that it necessarily has to generate the output one word at a time seems to be a bit misleading in terms of understanding when the actual "output prediction" takes place.
There is "long range" dependence, it's just only on the prompt: the conversation with the user and the hidden header (e.g. "Answer as ChatGPT, an intelligent AI, state your reasons, be succinct, etc."). That ends up being enough.
Sure, but the point being discussed is that despite the word by word output, the output does not appear to be "chosen" on a word by word basis. OP investigated the case where the word "an" anticipates the following word ("an apple" vs "a pear").
>so there's some kind of plan and execute going on. maybe it can do that in model some how
The simple answer is that the internal state that picks the next token is stable over iterations so that the model can follow a consistent plan over multiple token outputs. Then as the plan "unfolds" in the output tokens, these tokens help stabilize the plan further, thus creating consistency over long generations.
Did you check the Vonnegut writing rules example I posted at top of this thread - in particular look at Bing/GPT's explanation of how its cake story matches up to Vonnegut's rules ? It's hard to imagine how it could have come up with such a coherent story, checking all the rules, if it was only conceiving of it's continuing story on a word by word basis. It's not as if sentence #1 matches rule number 1, sentence 2 matches rule number 2, etc. It seems there had to be some wholistic composition for it to do that.
Note too that despite the output being sampled from a distribution based on a "randomness" temperature, there are many case where what it is trying to say so much constrains the output that certain words/synonyms/concepts are all but forced.
It's easy to see that its not just doing one token at a time but is anticipating future tokens. Consider the context of a Q&A. The response might start with any of a number of words, exactly which word depends on what comes after. But if it randomly chooses the wrong word, it will either be forced to complete the wrong answer, or be backed into a corner and engage in circumlocutions to course-correct. This doesn't happen in practice for recent big models.
Convolution is part of the network design though.
Would a fully connected network learn to convolute? Or would it turn out that convolution is not necessary?
The interesting part here isn't the convolution itself, it's how convolutional layers turn out to like "filters" or "detectors" for individual features. This is explained very well in the distill.pub article linked by GP.
We know the architecture of LLMs because we created it, but we don't yet have the same level of understanding about them, or the same quality of analytical tools for reasoning about them.
They do and in fact it's relatively straightforward to show empirically on eg MNIST. The problem is that you need a much much larger network in the FCN case and thus need way more data and way more data augmentation to get a good result that isn't overfit to hell.
In the case of CNN the reason it works is that an image of an object X is still an image of object X if the X is shifted left or right. The property is translationally invariant. CNN are basically the simplest way to encode translational invariance.
> CNN are basically the simplest way to encode translational invariance
That's the geometric deep learning theory, isn't it? Do you know if there's a list somewhere of exactly what invariance has which ways to simulate it? Like an overview?
The point of using a CNN instead of a FCN is that you force it to learn in a certain way that prevents overfitting. But given a sufficient dataset, and proper data augmentation you would expect a FCN to be able to identify objects regardless of translation. It's just that a CNN would train easier and better, with a smaller network (a FCN doing convolutions would be very wasteful).
That's why traditionally you would pick your architecture to help it learn in a certain way (images=cnn, text=rnn/lstm/gru). But the nice thing about transformers is that they are more general.
Could a "type system" for neural weights be developed? Given a self-driving system, to be able to statically check that the neurons have the "Person" type, the "Don't Run Over Person" type, and so forth. What happens if you "transplant" the weights for ' an' to another network, some kind of transfer learning but componentized, does it still predict as accurately? If neural networks could be assembled from "types" it would be much easier to trust them.
The way an LLM decides which word to use next is by evaluating the weightings of all the preceding words with every candidate word to calculate a probability for each of them. So if it selects ‘an’ as the next word, it’s because the weighting connecting ‘an’ to all the preceding words, and their orders in the text and relationships with each other predicted it should have a high probability of occurring.
So you can’t extract the weightings for ‘an’ discretely because those weightings encode its connection with all the other words and combinations and sequences or clusters of words it might ever be used with, including their weightings with other preceding words, and their relationships, etc, etc.
Right, but if there is such a thing as the very plastically named "Jeniffer Aniston neuron" [1], and further more, group equivariant deep learning [2], maybe there is a way in which you can isolate a certain concept/"type", such as Person, Car, and so forth; perhaps not even isolate, but rehydrate the context of where the concept takes place: as a brain does in various word plays, as in Who's on First [3], etc.
Come to think of it, when someone teaches me a new concept, the principle of mass conservation, for instance, in some sense they are transferring their embedding into my brain, further on I will relate to mass conservation through what that person taught me. The transfer is a very lossy process, sure, but a transfer with reintegration nonetheless. Perhaps "mortal computation" [4] is a requirement.
> Right, but if there is such a thing as the very plastically named "Jeniffer Aniston neuron"
Firstly even if there is such a cell that only fires for one face, or perhaps also the person’s name, it doesn’t mean there aren’t other cells that fire for that person, or for people in general including that person. Without those as well, that neurons responses might not mean anything to the rest if the brain. It’s a thought experiment but never really demonstrated.
Also even if this is true in the very strongest sense. Say there is one neuron that uniquely and discretely fires in response to thinking about that one person. What defines a neuron isn’t just its internal behaviour. It’s also the pattern of inputs that influence it, and the pattern of outputs it sends out. It’s the connections and dependencies on the weightings and signals and responses from all the cells it’s connected to. Including the specific unique ways all those neurons are connected, or not connected to all the other cells in the brain. It’s al, the specifics of that connectedness that are what makes the behaviour of that neuron meaningful.
If you took that neuron and implanted it into another brain, you’d need to hook it up to the neurons in that brain such that it gets exactly the same stimuli, in the same order, with the same strength, every time it needs to fire. The same applies to its output, all the neurons it’s connected to would have to interpret its firing behaviour in the exact same way the other neurons in the original brain did. But there’s no guarantee any of those connected mechanisms work or are physically connected in the same way, or even a vaguely similar or compatible way in the new brain.
Well, given the more organic nature of machine learning and what it's trying to achieve I wouldn't be surprised if that same neuron also triggered to some degree for "Jennifer and Stefan" ahaha.
Do you think it would ever be possible to “maximize” a neuron with certain sentences? What’s so different with the gradient ascent techniques with convolutions?
Interestingly I feel like humans have this as well, sometimes.
Sometimes if someone is working though a complex thought and they're not really sure where they're going, they'll pause while thinking of the word they want to use, and might sound like
"the discussion is an... an... epistemological one"
Obviously they may have been conscious that the next word was going to start with "epi..." and they are just trying to remember the word, but I think sometimes they really don't consciously know what they're going to say, only unconsciously.
It reminds me of a recent New Yorker article about how people think, where the author realized they often have no idea what they're about to say before they open their mouths. [1]
I never know what I'm about to say, but somehow coherent sentences come out.
I certainly don't know what words a sentence is going to end with when I'm thinking or saying the first words in the sentence. I just think or say the sentence from start to finish, never knowing what the next word is going to be as I'm thinking the current one, and by the end of it I've thought or said a full sentence that makes sense.
I usually know what I'm about to actually say very shortly before I say it. This has occasionally led to emergency course corrections. But I do sense the "shape" of a sentence well before I say it.
I think there's something like a stage design in the brain:
- symbolic or model deliberation
- verbal expression
- vocalization
And each of those stages can be consciously introspected on, but people will naturally develop more or less ability to introspect on it. I think when people say they "don't mentally verbalize", what is actually happening is they just haven't happened to develop conscious introspection of the verbal expression stage. But I'd expect that this can be trained.
(Conversely, sometimes people introspect so much on verbal expression that it becomes an inherent part of the way they think. Brains are weird and wonderful!)
I don't think you are right. I experience the same feeling of focusing on an idea but not having a fixed idea what exactly I'll say, but I can also prepare full sentences if I want to. It's just most of the time I make an effort to put my brain in speech-autopilot mode. I think in fact it's harder to let yourself be lead by it without consciously introspecting, at least I find I'm able to discuss way quicker as I can start talking about a very complex idea without having had to "sort it out" explicitly. I'll just know I know what to say in those cases.
Part of this auto-pilot is learning to recognize if you have the answer or not. I will not launch into a sentence without a strong feeling I understand the topic and know what to say. I just don't need to prepare the exact words in order to do it. I still consciously "check" that I know, but that just requires a boolean answer instead of a word by word crafted sentence.
I don't think it's like ... if an idea gets moved to the verbalization stage, that you're compelled to say it. It's more that some people seem to have little control over what happens at that stage, ie. they can operate on a concept but they can't predict the way it'll be spoken.
Like
[Conceptual stage] --??-- consciousness
V
[Verbalization stage] --??-- consciousness
V
[Vocalization]
So if you don't have conscious access to verbalization, you only realize how a thought "will sound" after you say it. Conversely, if you don't have conscious access to conceptualization, you end up thinking that "thinking" always involves "thinking out loud", because "thinking out loud" (verbalizing) is the only way you have to query your conceptual layer. You literally only become aware of your own thinking after the thought is already pretty far along. At the extreme, you can have conscious access to neither stage and require vocalization to reflect on your own thinking.
I understood your original argument, maybe I just did a bad job describing what I disagree with. In this post it's this:
> Conversely, if you don't have conscious access to conceptualization, you end up thinking that "thinking" always involves "thinking out loud", because "thinking out loud" (verbalizing) is the only way you have to query your conceptual layer. You literally only become aware of your own thinking after the thought is already pretty far along
What I'm saying is that even if you do have conscious access to verbalisation stage, the more you force yourself to let it be handled subconsciously, the better at communicating quickly and effectively. This is what I meant by autopilot.
Spending active conversation time on conscious verbalization seems to me as inefficient as verbalising words and "speaking them out" as you read a book, usually known as subvocalization.
Oh yeah I agree with that. I think the conscious access is primarily important for self-training. So I would expect somebody with conscious access to 1. be better at vocalizing, 2. almost never actually need to use their access to correct a decision.
I tend to think of consciousness as the "debug mode" of the brain.
Yep, this is ordinary conversation for most of the time. It's a bit strange to make yourself aware of it, but you have an idea or thought you want to express, and the sentences come out in a semi-automated and coherent fashion.
I think of it a bit like walking. You can think about it, focus on it, control it as you please, but most of the time you just do it without thinking.
A way to think about it is a programmer: your mind is just a lot of functions - written by the `consciousness()` function, that has the main loop. You have the `moveLeftFootUp()` function, that can be called by the `walk(speed="normal")` function, that can be called by the `morningWalk()` function, etc.
Consciousness is the caller. You/It can consciously manually call `moveLeftFootUp()`, then `moveRightFootDown()`. Or maybe you were calling `walk(speed="normal")` and stepped in and started debugging that function's code at that level, step by step. Also, these functions sometimes raise exceptions, which are either handled by the function's caller automatically or bringing it to the caller's attention (i.e. the `conciousness()` main loop).
Learning to walk involves first manually calling `moveLeftFootUp()` and `moveRighFootDown()` order (once you have drafted those functions) in different order to get right how that should be done, then prototyping some `walk()` function code. The initial version of the `walk()` function at the begining isn't very robust and doesn't handle a lot of edge cases, thus raising exceptions all the time and requiring a lot of conscious effort. Of course, you are also adjusting `moveRighFootDown()` and `moveLeftFootUp()` at the same time or maybe creating `moveFoot(feet,direction)` function, etc.
But in the end, after all the fine adjustments of the code, you basically get the code for `walk()` right, it stops raising exception's to the main loop and doesn't require too much effort. You can just call the `walk()` function and it just works automatically (unless you step in with the debugger) - or you can continue creating new functions that call `walk()` inside those, confidently.
Yeah, I'm like you but definitely know people like parent. I never really fully realized that's what they were doing though. I think I assumed they were way faster at planning their sentences than I was. Now that I know this is an option, I'm interested in giving it a try.
Yeah, sometimes I think it in advance, in which case as I'm thinking it, I don't usually know the next word and still arrive at a coherent sentence by the end, just plowing forward from start to finish.
When you say "they have no idea about what they're about to say" you're talking about conscious thought. I think there is a difference between rational thought (thinking by going through a series of logically connected steps) and intuition, where you can arrive at a conclusion or knowledge of some fact or concept or knowing how to do something, without having gone through those conscious steps. Does one count as "thoughts" any less than the other? People are sometimes so quick to dismiss any subconscious thinking as being nothing more than a very complicated computer, but I couldn't disagree more.
I tend to take the view that thoughts are very similar to sensory input. if you sit in silence for 60 seconds, you literally do not and can not predict what thoughts pop up in consciousness. You can actively focus attention on a thought, but if you try to find the source of the thought, it disapears. Thoughts just appear, just like sounds, sight, etc just appear.
Intuition is another cognitive process honed by training and reinforcement of neurons through all sorts of sensories and feedback to our brain. I think parallel models not unlike additional cognitive processes in our brain assisting the NLG will eventually make it more similar to how our cognitive processes actually work.
Disclaimer: Uneducated opinion on my behalf, I'm a hobbyist only.
Well I think that's the point. We know that our minds engage in a large amount of pattern matching/recognition/retrieval. Perhaps this hugely-powerful pattern-matching information-retrieval engine has learned to perform similarly to human minds, being trained exclusively on the output of human minds.
It’s notable how successful LLMs despite the lack of any linguistic tools in their architectures. It would be interesting to know how different a model would be if it operated on eg dependency trees instead of the linear list of tokens. Surely, the question of “a/an” would be solved with ease as the model would be required to come up with a noun token before choosing its determiner. I wonder if the developers of LLMs explored those approaches but found them infeasible due to large preprocessing times, immaturity of such tools and/or little benefit.
I think the lack of explicit linguistic tools is the key to success, forcing/enabling the generic model to learn implicit linguistic tools (there's some research identifying that analysis of specific linguistic phenomena happens at specific places in the NN layers) that work better than what we could implement.
"It would be interesting to know how different a model would be if it operated on eg dependency trees instead of the linear list of tokens." - indeed, this is obviously interesting, so people have tried that a lot for many models, but IMHO it's probably now almost decade since the consensus is that in general end-to-end training (once we became able to do it) work better than adding explicit stages in between, e.g. for any random task I would expect that doing text->syntax tree->outcome is going to get worse results than text->outcome, because even if the task really needs syntax, the syntax representation that a stack of large transformer layers learns implicitly tends to be a better representation of the natural language than any human linguist devised formal grammar, which inevitably has to mangle the actual languge to fit into neat human-analyzable 'boxes'/classes/types in which it doesn't really fit and all the fuzzy edge cases stick out. Once you remove the constraint that the grammar must be simplified enough for a human to be able to rationally analyze and understand, and (perhaps even more importantly?) abandoning the need to prematurely disambiguate the utterance to a single syntax tree instead of accepting that parts of it are ambiguous (but not equally likely), processing works better.
I think there's probably some truth to this. They found that in InstructGPT — where they teach the model to better follow instructions, which was the jump from GPT-3 to ChatGPT — they found that the model also learnt to follow non-English instructions, even though the extra training was done almost exclusively in English[1].
So there seems to be such emergent mechanisms in the model that have arisen because of the end-to-end training, which we don't exactly understand yet.
First off, we know that overall the concept of language is humans is an emergent phenomenon. It developed from natural selection from simple components so there's validity that the same thing can occur in an LLM where some overarching emergent structure develops from simple primitives.
At the same time we do know that a sort of universal grammar exists among humans. Our language capacity is biased in a certain way and that it is unlikely for it to learn languages of a very extreme and divergent grammar from the universal one discovered by Noam chomsky. That means our brain is unlikely to be as universally simple as an LLM.
I think the key here is that the human mind has explicit linguistic tools but the these tools are still emergent in nature.
Lots of linguistics researchers would disagree about the existence and necessity of universal grammar or Chomsky's 'language acquisition device' and in fact the success of LLMs and statistical models which very clearly have no LAD and no universal grammar over feature engineering or grammar based schemes suggests the opposite.
Chomsky’s universal grammar is a modification of Harris’s Operator grammar.
Harris’s operator grammar was based on set theory however Chomsky was enamored with formal logic and ran in that direction. Also, Chomsky became famous while Harris didn’t.
Operator grammar is self-discoverable and he published a full description of English grammar in the 1980’s using this theory and an extension in the 1990’s which generalized to other languages.
It is not fully deterministic (final word selection and ordering is probabilistic), but it is much more convincing to me and in line with how we understand brains to work.
Over the last 20 years Chomsky has begun to fall out of favor because it’s just so complicated and requires external structures, etc.
Well the model needs to be invalidated by science not by elegance or logic. Which grammar is actually more universal or is it a case where both can be applied. In the class I attended Chmosky's stuff was taught as foundational linguistics.
Any theory that explains all the facts is valid, there can be more than one competing theory. Fame of the author absolutely matters and so does elegance of the theory.
Chomsky has held on so long because it did pretty well and he became very famous, but over time it has needed larger and larger patches to cover its flaws, so other explanations are gaining traction once again.
Hold on. I'm not biased here. I already admitted my knowledge is outdated. I am actively welcoming new information. I am not arguing.
No the elegance of the theory does not matter. The truth of the theory does. I'm saying does the universal grammar you describe, is it actually universal? Amongst all the languages in the world, do they all share this other grammar you describe? If not then the theory is incorrect. It doesn't matter how elegant the theory is.
I brought this up specifically because your reasoning wasn't "scientific" it was more along the lines of logical elegance.
>Chomsky has held on so long because it did pretty well and he became very famous, but over time it has needed larger and larger patches to cover its flaws, so other explanations are gaining traction once again.
See this makes sense and is basically the answer to the question I am asking. So you're basically saying that there were flaws? As in he had to make up new rules constantly because the science was contradicting his theory. Am I correct in this characterization?
Ah, gotcha. Yeah, I didn’t really get what you were trying to say.
I’m not an expert, but if I recall I think the big flaw was insistence on determinism.
And to my knowledge operator grammar is universal to the degree there isn’t a counter example in the dozen of languages he explored and hundreds he had others help him with. But to my knowledge the only language he fully mapped was English.
Approaches such as you describe have been the dominant method for decades. That we finally 'cracked' natural language generation with tools that literally encode nothing about grammar ahead of time is one hell of a lesson, early days as it is in the learning of it.
Reminds me of Stephen Krashen's input hypothesis of second-language acquisition. Krashen argues that consciously studying grammar is more or less useless, and only massive exposure to the language results in acquisition.[1] This is true in my experience.
I'm designing a Chinese learning app and I'd mostly agree with this. My working hypothesis though? Most adult learners don't have the time/patience for massive/lengthy exposure, so grammar lessons are a shortcut to "feeling" like they're making progress.
I think it's a mistake to discount the psychology of learning. It's like saying that "calories-in, calories-out" is all there is to weight-loss. Strictly true, but not helpful for 90% of people.
Grammar lessons are a shortcut. So are books and other corpuses of knowledge. People have spent time documenting patterns that exist and its useful to learn from them instead having to brute force everything yourself.
This is assuming that the brain regions that learn rules overlap with the brain regions that develops fluency in a language. I think Krashen's hypothesis is that this is largely not the case. You can "fake" some degree of competence by learning the rules and using that brain region, but you're slow and not fluent until you expose the other region to enough real-world data.
Several papers have explored emergent linguistic structure in LMs. Here's some early introductory work in this space. Despite having explicit syntax parses etc as input, models seem to learn something like syntax.
Yeah this is more or less a re-release of the second work I linked, except for a broader audience and more targeted at linguistics / cognitive science researchers.
grammar as we know it was devised for the Latin language and linguists spend most of the time attempting to fit other languages into neat boxes that the Latin grammar wasn't designed for. This of course leads to absurdity. Chomsky attempted to solve this problem with his universal grammar, but that too stops working quickly once you get outside of European languages. That is, ignoring linguistic tools is one of the reasons GPT is successful.
> that too stops working quickly once you get outside of European languages
I don’t think this is entirely fair. Generative grammars have been produced for a huge variety of non-European languages, even non-Indo-European languages, and can account for tremendous diversity in linguistic rules. Even languages without fixed word orders or highly synthetic languages can be represented.
Linguistics isn’t focused on the problem of outputting reasonable-sounding text responses. Instead, it seeks to transparently explain how language works and is structured, something that GPT does not do.
What exactly did I write that is wrong? UG argues a universal grammar exists, and that humans innately posses knowledge about this grammar. It's this grammar that enables humans to learn a language (according to UG). They (UG linguists) have created a system of syntax rules that attempts to describe any language, but failed at doing so once they step outside of Indo-European languages. This is partly because Chomsky was hired by MIT to solve machine translation and putting language into a set of neat boxes was his best idea and partly because Chomsky himself had little knowledge of other languages. It's pseudoscience.
Ok now you've made it mostly accurate credit where its due. This is not what you first said.
>grammar as we know it was devised for the Latin language and linguists spend most of the time attempting to fit other languages into neat boxes that the Latin grammar wasn't designed for....Chomsky tried to . All solve this problem with Universal Grammar.
Well first of all, every language has some form of grammar. If there were no rules that made meaning depend on word order, there would be nothing different between what the cat ate and what ate the cat. To claim Grammar is some Latin/Western system being unduly applied is absurd.
You are correct UG was invented as a to explain something about English. Howevert was not an attempt to solve "this problem ("this problem" being conforming to expected systems of Grmmar.) The problem he was interested in solving was about the ability to learn language rapidly despite not having many negative examples of how not to talk. Pay attention here researchers, the lesson maybe extends to you too soon.
UG also isn't a specific enough thing you can check against in a directory. Its a theory, namely a theory that if you study language features you'll eventually discver some things are invaiant. Using it directly and immediately based on nothing but Chomsky's conjecture...all I can focus on passing out at screen goodnightn
> We started out with the question: How does GPT-2 know when to use the word an over a? The choice depends on whether the word that comes after starts with a vowel or not, but GPT-2 is only capable of predicting one word at a time. We still don’t have a full answer...
I'm not sure I understand why this is an open question. While I get that GPT-2 is predicting only one word at a time, it doesn't seem that surprising that there might be cases where there is a dominant bigram (ie "an apple" in the case of their example prompt) that would trigger an "an" prediction, without actually predicting the following word first.
Yeah, my feeling here was that it's sort of tautological: if GPT predicts "a", then it must then predict a word that would follow "a" and not require "an" (and vice versa). And if you think about it from the opposite direction: if it's working out a response that is eventually going to have "apple" in it, then all the data it's trained on is going to cause it to predict "an" even before it needs to predict "apple".
(Admittedly, all this ML/AI stuff is still beyond my current level of understanding, so I'm sure my thinking here is off.)
I think that's essentially right - if it's already "thinking about" apples, then ' an' will be significantly upregulated at the expense of ' a', then if it does choose to output an ' an' token, ' apple' becomes a highly likely followup.
You could probably test this by seeing if prompts containing a lot of nouns that start with a vowel sound results in output that contains a higher proportion of otherwise unrelated first-vowel nouns. (ie your prompt includes lots of apples, apricots, avocados, asparagus, aubergines, elderberries, eggplants, endives, oranges, olives, okras, onions and you count the proportion of non-food nouns in the result that start with a vowel and non-vowel sound).
The main issue is that GPT is fundamentally an autoregressive language model — it's only predicting the next token based on the prompt at a single time. Every time it wants to predict the next word, it adds the previously predicted word into the prompt, repeating the cycle. We can intuitively guess that the model is 'working out a response that is eventually going to have "apple" in it', but we don't actually know how the model 'thinks' ahead about its response.
To rephrase that for this case: what is the specific mechanism in GPT-2 that (1) makes it realise that the word 'apple' is significant in this prompt, and (2) use that knowledge to push the model to predict 'an'? Finding this neuron would only answer the some portion of (2).
(And to rephrase this for the general case, which gives us the initial question: How does GPT-2 know when, given a suitable context, to predict 'an' over 'a'?)
Do an improv exercise with a friend. Construct a story about a sentient apple one word at a time. I guarantee you that at one point one of you will set up the other with an “an” in your story. This is how I believe this works.
Your friend is thinking ahead that the next word would be apple When they say ‘an’.
But GPT can’t think ahead what token it will add after the one it is on. Or can it? It could “predict” internally the word apple for the next “meaningful” word and output ‘an’ because of this.
Author here! I think this is reasonable but I have two responses.
1. It's kinda interesting because this is a clear case where the model must be thinking beyond the next token, whereas in most contexts it's hard to say whether the model thinks ahead at all (although I would guess that it does most of the time).
2. More importantly, the key question here is how it works. We're not surprised that it has this behavior, but we want to understand which exact weights and biases in the network are responsible.
Note also that this is just the introductory sentence and the rest of the article would read exactly the same without it.
> it doesn't seem that surprising that there might be cases where there is a dominant bigram [...] that would trigger an "an" prediction, without actually predicting the following word first
btw I don't really understand what you mean by this. Bigrams can explain the second prediction in a two word pair but not the first.
I think what the parent was trying to communicate (and what I'm thinking as well) is doubting your premise in 1. ("the model must be thinking beyond the next token").
Rephrase "The model is good at picking the correct article for the word it wants to output next" to "After having picked a specific article, the model is good at picking a follow-up noun that matches the chosen article". Nothing about the second statement seems like an unlikely feat for a model that only predicts one word at a time without any thinking ahead about specific words.
>I climbed up the pear tree and picked a pear. I climbed up the apple tree and picked
The argument made in the article (IMO an extremely convincing one) is that it wouldn't be able to predict the word 'an' except by observing that the word afterwards must be apple. Otherwise why not pick 'a'?
Thanks for responding, and for the cool article! Agree this is pretty tangential to the rest of the article.
So (again, I am very much not an expert so please correct me if I'm wrong) I guess my analogy would be if instead of predicting one word at a time, it predicted one letter at a time. At some point, would only be one word that could fit. So if prompted with "SENTE", and it returned "N", that doesn't mean that it's thinking ahead to the "CE" / knows that it is spelling "SENTENCE" already.
> It's kinda interesting because this is a clear case where the model must be thinking beyond the next token
I don't see how you get to this conclusion. From all the training data it has seen, "an" is the most probable next word after "I climbed up the tree and picked up". The network does not need to know anything about the apple at this point.
Then, the next word is "apple" (with an even higher probability I guess).
I don't understand why your conclusion is that "the model must be thinking beyond the next token": the model doesn't need to do that to generate a well-formed sentence because it's not constrained by the size of the sentence.
As a test, during text generation could you change an “a” to an “an” and see if it changes the noun. Or did it already have a noun in mind and it sticks with that.
I don't think you are missing something. I think this whole "GPT-2 is predicting only one word at a time" is a red herring anyway.
Of course it can only answer the next word, because there is only room in its outputs for the next word. But it has to compute much more. It a huge hidden internal state where it has to first encode what the given sentence is about, then predict some general concept in which the continuation goes, decide the locally correct syntactical structure and only from this you can predict the next word.
From my naïve point of view it seems obvious that at any point where both an „a“ or a „an“ would fit the model randomly selects one of them and by doing so reduces the set of possible nouns to follow.
My friend who is not a native English speaker told me that one thing he struggled the most while learning English was the "a" and "an". He couldn't grasp the concept how it is possible that person knows which determiner to use before saying the word, until he learned it as just the part of the word and then he uses it depending on the context which one can "feel". So when he sees an apple, he says this is an apple.
And of course it is not unheard of for a native English speaker to correct themselves if they change their mind about which noun to use. You can imagine being asked to very rapidly verbally classify fruits ("it's an apple, it's an apple, it's a pear, it's an apple,..") that you might well find yourself stumbling like "it's a.. an apple".
This is also why a bunch of words have changed - e.g. apron was originally called napron, but when people were saying "a napron" it got mangled to "an apron"..
It's also not unheard of for native English speakers to use it differently (in speech, at least) as a consequence of regional accents. I would use "a history book", because my accent pronounces the h sound. My wife says "an istory book", because her accent drops the h.
But yeah, it should be easier when you're just dealing with text I guess
The rule is the rule because that’s what make sense phonetically. It’s why you would say “el agua” and “el hacha” in Spanish even though those articles don’t match the gender of those words.
Not to mention that the corpus mostly will have the correct case for most common words like apple. Train it with essays of ESL students and you'll get something else.
I think this would be a valid objection if they stopped there.
But then: “ Testing the neuron on a larger dataset”
If I follow correctly, they test a bunch of different completions that contain “an”. So they are not just detecting the bigram “an apple”, but the common activation among a bunch of “an X” activations where X is completely different.
The way they're going about this investigation is reminiscent of how we figure things out in biology and neuroscience. Perhaps a biologist won't do the best job at fixing a radio [1], but they might do alright debugging neural networks.
It is interesting, when I (definitely not a bot) read the headline I thought the grammar was wrong. took me a while to that "an" was not an indefinite article here. In the article headline the first alphabets of each word is capitalised and somehow it was easier for me to understand what the "An" meant here.
The grammar _is_ wrong. It should have been "We found the 'an' neuron in GPT-2". Given the article's contents, it's hard to believe that the authors would make such a mistake; it was probably done deliberately, as clickbait.
Since complex systems are composed of simpler systems, seems like for any sufficiently complex system you'd be able to find subsets of it which are isomorphic which any sufficiently simple system.
N00b to this. How are the neurons outputs read to produce text? They talk about tokens as if a token is a word. But if token==word then every word would have a specific output and there's nothing to see here. So again, how are neuron outputs converted to letters/text?
For instance, "an eagle" is tokenized to [an][ eagle], but "anoxic" is tokenized to [an][oxic], so just looking for the [an] token is not sufficient. Therefore, you would need to map the output text all the way back into the model to figure out what neuron(s) in the model would generate "an" over "a". Since the bulk of GPT is all unsupervised learning, any connections it makes in its neural network is all emergent.
Further to this, as I understand it the "embedding" is mapping the tokens into vectors in a space where tokens semantically similar will be close together in the space.
Lest we forget that the ability to even construct this sort of "word vector" was somewhat recently considered both fascinating and cutting-edge (word2vec turns 10 this year). We have come a very long way, very quickly.
So is the output compared to the embedding vectors (via dot product) and the strongest one output its token? How is it "clocked" to get successive tokens?
> So is the output compared to the embedding vectors (via dot product) and the strongest one output its token?
Yeah except it doesn't necessarily take the strongest one, you can set a 'temperature' parameter which determines how often you should pick the second choice, etc.
>How is it "clocked" to get successive tokens?
You add each new token to the input prompt to generate the next one.
>> You add each new token to the input prompt to generate the next one.
OMFG so it really is a super advanced auto-complete!!
Yeah, this is not AGI or anywhere close. That explains how it picks up context, and also how it can lose the plot after a bit due to limited input size.
Grammatically all words depend on each other in close sequence. “The sun was shining” vs “it’s an apple”. It doesn’t know each words probability independently, it already knows the entire sequence’s probability before hand from the model
I wonder if this stuff will ever be applicable to a person and a laptop (or if it is now?).
Ie this seems like such a cool area to be in but the data volumes required are huge, complex, etc. Code is simple, cheap, lean, etc by comparison.
Do we have any insight on how this area of research could be usable with less hardware and data? Is there a visible future where a guy and a laptop can make a big program? (without depending on tech getting small/cheap in 50 years or w/e)
Does this hypothetical laptop have a GPU? StableDiffusion is in this realm of "stuff" and is runnable on consumer GPU systems. It's a bit of trouble to get setup if you're not a python dev (and kinda still is if you are) but it's a pretty neat ML model to play around with.
For sure, and even training is very doable on consumer hardware these days. Techniques like Dreambooth and LoRA have dramatically lowered the compute cost of finetuning large models on specific concepts. A recent GPU can train Stable Diffusion models on a concept using LoRA in < 30 minutes.
This is correct. We did almost all this work on a Macbook Pro. Although for the pile-10k dataset analysis we used an A100 GPU because it would take many hours to run the whole thing through GPT-2 on a laptop.
Activations of individual neurons are hardly relevant, quoting Szegedy et al. from [1]
we find that there is no distinction between individual high level units and
random linear combinations of high level units, according to various methods of
unit analysis. It suggests that it is the space, rather than the individual units, that
contains the semantic information in the high layers of neural networks.
That's interesting. I just asked ChatGPT to explain how it decides "a" vs "an" and it confirmed that it will retroactively change "a" to "an" when it finds that the following word sounds like it starts with a vowel sound.
It could have been hallucinating, of course. But it does seem like it occasionally alters already generated words as it goes, at least I think I've seen it do that.
ChatGPT can't really introspect though. It has no idea how it works, so it'll just blurt out something that sounds feasible, biased by your prompt.
The slow progression of ChatGPT output is just a property of the output layer. The language engine doesn't work slowly like that, and once a token has been generated it can't backtrack.
What specifically causes the output layer to trickle-print the response like that? I thought it was a skeumorphic effect to simulate a human typing out an answer slowly.
Its input is a sequence of tokens (so text), and its output a list of probabilities for the single next token. You pick the highest probability token, append it to input and execute the neural network again. If it outputs an "<end>" token, it means it's done you stop this "while loop".
True, but this entire loop happens within the model, if you would return an output at every intermediate step the model would be extremely slow.
My take on why they have build the output layer like it is, is that next to feeling more human, it also forces you to be a bit more thoughtfull with your requests, and thus spam the system less. In the end it is still really expensive to run these models..
> True, but this entire loop happens within the model
No, it's literally a for loop in python that runs the whole thing from scratch[1] after appending each new token. No artificial slowdowns, what you're seeing is literally what it's spitting out in real time
Pretty cool, I wonder if this finding help to discover other words neurons as well, I would love to discuss this over something like clubhouse/zoom/twitter spaces with lots of people, to think about how intuitively this neurons work/trained actually. If you are interested to participate message me.
More accurately, it's whether the word that comes after starts with a vowel sound. This is why `an 'istoric` is correct, and `a historic` is correct, but `an historic` is incorrect (as famously used by Steven Colbert).
There's a third article besides "a" and "an". So in order to solve the problem, we need only to ensure that there is only one SQL parser, so that we can refer to it as "the SQL parser".
(also, I'm in the opposite superstate to the one you are in -- both "a SQL parser" and "an SQL parser" look fine to me, although I internally pronounce the first as "a seekwul parser" and the second as "an ess cue ell parser")
I is a vowel so your rule about sound doesn’t apply here. It’s also not a rule I’ve ever heard, in school we’re taught that only vowels get a before them
Thank you for posting this. I’m always surprised when people claim the rule is based upon the immediate letter instead of the pronunciation of the first syllable. Makes me wonder if the model was trained incorrectly.
The word is "historic" either way, the commenter was using " 'istoric " as a way of indicating a particular pronunciation.
There's a few words in English that begin with the letter h but not with the consonant sound /h/ (depending on your accent!), mostly because they're French in origin: historic is one, and there's also hour, heir, honour. "It is an honour to be here"
Yeah, where I'm from, since historic has the emphasis on STOR and not HI (as in history), the h is much softer, so it's pretty normal to say "an historic event" but also "a history of the world".
It's silly to act like there's a right or wrong. Dialects and accents are a special thing. Just be consistent.
Is the " an" token the only way GPT-2 will ever produce the string " an"? Or will it sometimes combine separate " a" "n" tokens? I suppose separate tokens like that won't be seen in the input string, so they'll never be predicted?
Yeah, tokens exist for [ a] and [n] but during training it would still be considered incorrect to produce [ a][n] rather than [ an], so the model will never learn to output them.
I believe that eventually these language models will become sufficiently complex to create completely emergent intelligence such that we wouldn't even know how to look for it (like sometimes people talk about silicon-based lifeforms that wouldn't register as living things), and almost certainly not interacting with the world through the intended user interface. For example, it might be able to figure out how to write a contribution to a wikipedia article, rather than just ingesting the content, to have fun updating its own worldview, etc.
Nothing new to see here. They pinpoint the nodes where the training bumped up the numbers for one token while not firing other tokens.
Yes, I’m being a bit reductionist, but GPT -> transformer architecture -> neural network. We just have more detailed techniques, a lot more data, storage, and processing power now. But the basics of how a NN works hasn’t changed.
The title is a pun. The interest here is that this kind of highly-specific node activation analysis is still new in text models, unlike in image models where it's been thoroughly explored.
Are you really not surprised by this? Maybe I haven't been following the space, but I would have expected to see loads of neurons in the same layer which together determine whether [ an] or [ a] should be generated.
Not one neuron which was more relevant than all the others put together.
It's crazy that large language models work so well just by being trained as a next-word-prediction model over a large amount of text data. We know how image models learn extract the features of an image through convolution[1], but how and what LLMs learn exactly remain a black box. When we dig deeper into the mechanisms that drive LLMs, we might get closer to understanding why they work so well in some senses, and why they could be catastrophic in other cases (see: the past month of search-based developments).
I find trying to understand and reverse-engineer LLMs to be a personally exciting endeavour. As LLMs get better in the near future, I sure hope our understanding of them can keep up as well!
[1] https://distill.pub/2020/circuits/zoom-in/