Some ~72 years ago in 1951, Claude Shannon released his Paper, "Prediction and Entropy of Printed English", an extremely fascinating read now.
It begins with a game. Claude pulls a book down from the shelf, concealing the title in the process. After selecting a passage at random, he challenges his wife, Mary to guess its contents letter by letter. The space between words will count as a twenty-seventh symbol in the set. If Mary fails to guess a letter correctly, Claude promises to supply the right one so that the game can continue.
In some cases, a corrected mistake allows her to fill in the remainder of the word; elsewhere a few letters unlock a phrase. All in all, she guesses 89 of 129 possible letters correctly—69 percent accuracy.
Discovery 1: It illustrated, in the first place, that a proficient speaker of a language possesses an “enormous” but implicit knowledge of the statistics of that language. Shannon would have us see that we make similar calculations regularly in everyday life—such as when we “fill in missing or incorrect letters in proof-reading” or “complete an unfinished phrase in conversation.” As we speak, read, and write, we are regularly engaged in predication games.
Discovery 2: Perhaps the most striking of all, Claude argues that that a complete text and the subsequent “reduced text” consisting of letters and dashes “actually…contain the same information” under certain conditions. How?? (Surely, the first line contains more information!).The answer depends on the peculiar notion about information that Shannon had hatched in his 1948 paper “A Mathematical Theory of Communication” (hereafter “MTC”), the founding charter of information theory.
He argues that transfer of a message's components, rather than its "meaning", should be the focus for the engineer. You ought to be agnostic about a message’s “meaning” (or “semantic aspects”). The message could be nonsense, and the engineer’s problem—to transfer its components faithfully—would be the same.
a highly predictable message contains less information than an unpredictable one. More information is at stake in (“villapleach, vollapluck”) than in (“Twinkle, twinkle”).
Does "Flinkle, fli- - - -" really contain less information than "Flinkle, flinkle" ?
Shannon concludes then that the complete text and the "reduced text" are equivalent in information content under certain conditions because predictable letters become redundant in information transfer.
Fueled by this, Claude then proposes an illuminating thought experiment: Imagine that Mary has a truly identical twin (call her “Martha”). If we supply Martha with the “reduced text,” she should be able to recreate the entirety of Chandler’s passage, since she possesses the same statistical knowledge of English as Mary. Martha would make Mary’s guesses in reverse.
Of course, Shannon admitted, there are no “mathematically identical twins” to be found, but and here's the reveal, “we do have mathematically identical computing machines.”
Those machines could be given a model for making informed predictions about letters, words, maybe larger phrases and messages. In one fell swoop, Shannon had demonstrated that language use has a statistical side, that languages are, in turn, predictable, and that computers too can play the prediction game.
There was a fun recent variant on this game using LLMs, asking GPT3 (3.5?) to encode text in a way that it will be able to decode the meaning. Some of the encodings are insane:
This is super interesting. Are there more examples I can see? The one in the article is a famous song which makes me wonder if it's really "decompressing" the data, or just being hunted towards a very common popular pattern of tokens.
Whenever a human remembers something specific, they actually don't. Instead, they remember a few small details, and the patterns that organize them. Then, the brain hallucinates more details that fit the overall pattern of the story. This phenomenon is called "Reconstructive Memory", and is one reason why eyewitness testimony is unreliable.
An LLM is similar to memory: you feed its neurons a bunch of data that has meaning encoded into it, and it becomes a model for any patterns present in that data. When an LLM generates a continuation, it is continuing the patterns that it modeled; including the data it was trained on, and whatever prompt it was given.
Natural language solved! Right?
---
Not so fast!
The human mind performs much more than memory reconstruction. How else would we encode meaning into the semantics of language, and write that data into text?
There is more to this story. There is more to... story.
Remember when I said the information is "moved"? Where did it go? More importantly, how can we use that data?
---
Let's consider a human named Dave, reading a book about boats. What is Dave using to read it? The short version: empathy.
Story is held together with semantics. Semantics are already-known patterns that define logical relationships between words.
When Dave reads the statement, "Sue got in the boat", he interprets that information into meaning. Sue is a person, probably a woman. She entered the top of a vessel that was floating on water.
But Dave was wrong! Sue is a cat, and the boat was lying on dry beach.
Here's the interesting part: Dave was totally correct until I declared otherwise. His interpretation matched his internal worldview: all of the ambiguity present in what he read was resolved by his assumptions. Making false assumptions is a completely valid result of the process that we call "reading". It's a feature.
After Dave overheard me talking to you just now, he learned the truth of the story, and immediately resolved his mistake. In an instant, Dave took the semantic information he had just read, and he reread it with a completely different worldview. But where did he read it from?
His worldview. You see, after reading the book, its meaning was added neatly into his worldview. Because of this, he was prepared to interpret what I was telling you: about Sue being a cat and so on. Dave performed the same "reading" process on my new story, and he used the statement he read in the book to do just that.
---
Worldview is context. Context is the tool that resolves ambiguity. We use this tool to interpret story, particularly when story is ambiguous.
So what is context made of? Story.
It's recursive. Everything that we read is added to our internal story. One giant convoluted mess of logic is carved into our neurons. We fold those logical constructs together into a small set of coherent ideas.
But where is the base case? What's the smallest part of a story? What are the axioms?
This is the part I struggled with the most: there isn't one. Somehow, we manage to perform these recursive algorithms from the middle. We read the story down the ladder of abstraction, as close to its roots as we can find; but we can only read the story as far as we have read it already.
We can navigate the logical structure of ideas without ever proving that logic to be sound. We can even navigate logic that is outright false! Constraining this behavior to proven logic has to be intentional. That's why we have a word for it: mathematics. Math is the special story: it's rooted in axioms, and built up exclusively using theorems.
Theorems are an optimization: they let us skip from one page of a story to another. They let us fold and compress logical structure into a something more practical.
---
LLMs do not use logic at all. The logic of invalidation is missing. When story categorizes an idea into the realm of fiction, the LLM simply can't react accordingly. The LLM has no internal story: only memory.
The comment was just to tell a fascinating story of the conceptual origins of what we have today. But the predictor Claude imagined actually works quite a bit differently than what we have today.
Yes Shannon argued meaning and semantic wasn't necessary but today, we know that our language models develop meaning and semantics. We know they build models. We know they try to model the casual processes that generate this data and implicit structure that was never explicitly stated in the text can find themselves emerging in the inner layers.
>LLMs do not use logic at all. The logic of invalidation is missing.
This is a fascinating idea that i see that just doesn't square with reality. In fact, this is all they do. What do you imagine training to be ?
Prediction requires a model of some sort. It need not be completely accurate, or how you imagine it. But to performantly make predictions, you must model your data in some way.
The important bit here is that the current paradigm doesn't just stop at that. Here, the predictor is learning to predict.
We have some optimizer that is tirelessly working to reduce loss. But what does a reduction in loss of internet skill data distribution mean?
It means better and better models of the data set. Every single time a language model fails a prediction, it's a signal to the optimizer that the current model is incomplete, insufficient in some way, work needs to be done, and work will be done, bit by bit. The models in a LLM at any point in time, A, is different from the models at any point in time, B during the training process but it's not a random difference. It's a difference that trends in the direction of a more robust worldview of the data.
This is why language model don't bottleneck on some arbitrary competence level humans like to shoe-horn it on.
There is a projection of the world in text. Text is the world and the language model is very much interacting with it.
The optimizer may be dumb but this restructuring of neurons to better represent the world as seen in the text is absolutely happening.
> >LLMs do not use logic at all. The logic of invalidation is missing.
> This is a fascinating idea that i see that just doesn't square with reality. In fact, this is all they do. What do you imagine training to be ?
I could have been more clear, but I didn't want to write a novel. The ambiguity here is what they invalidate: memory reconstructions, not logical assertions.
An LLM can't tell the difference between fact and fiction, because it can't apply logic.
Better memory will never suddenly spawn itself the feature to think objectively about that memory. LLMs improve, yes, but they didn't start as a poor-quality equivalent to human thought. They started out as a poor quality equivalent to human memory.
> There is a projection of the world in text. Text is the world and the language model is very much interacting with it.
The language model becomes that world. It does not inhabit it. It does not explore. It does not think, it only knows.
>An LLM can't tell the difference between fact and fiction, because it can't apply logic.
Not true. They can differentiate it just fine. Of course being able to tell the difference and being incentivized to communicate it are 2 different things.
Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback - https://arxiv.org/abs/2305.14975
The article focuses on linguistic theories that regard natural language in a somewhat narrow sense: language as a cognitive phenomenon, language as a symbolic system, language as something that can be modeled and imitated by a computer. While those aspects are important, language is also a social phenomenon, something used for interaction among people for many different purposes.
Both traditional approaches to natural language—the cognitive and the social—face a challenge, because language is no longer used only for communication between human beings. What does it mean for linguistics now that human beings are increasingly using natural language to interact with computers? I don’t know the answer, but it’s a question that deserves attention.
> What does it mean for linguistics now that human beings are increasingly using natural language to interact with computers?
Yeah, it's going to be strange. So much of our etiquette and norms and basic cultural assumption have to do with this idea:
This communication protocol of "language" is for humans. If something else is "on the wire" with us, it must be like us, and worthy of our attention or cycles. But this is increasingly going to be very untrue.
Well, in my case, I was struggling to come up with a coherent understanding for myself about how views of language might need to change when interaction with LLMs is taken into account. ChatGPT’s response helped me think about the issue more clearly. Others might not find the same clarity.
More probably (most) humans will realize that talking to computers is not very interesting or productive, and decide it is in fact not worthy of our attention or cycles. Much like a an extremely fancy phone menu system.
When language is regarded as a symbolic system, then what this means, is that it is a structure of relations of symbols emerging from social interactions. This is very fundamental to linguistics, as language is something precisely not ruled by some magical rules, but emerges from the social, that's how you can explain embedded power structures, change, etc. The connection between subject-subject and subject-object, and object-object is an active field of study. It is part of my PhD thesis, too.
Text is not the important thing here at all. audio language models exist, they are relatively trivial to engineer and they will model audio just as well vs they'll model text or any arbitrary sequence to sequence data.
Linguistics is useful fiction. It doesn't accurately describe the complexity of language anymore than Newton's laws accurately describe gravity. You'd think people would understand that now with symbolic approaches failing this sort of stuff after decades of trying but i guess not. We don't know how we work. We really really don't.
The obsession with how we think we work over methods that are actually working is really something to behold.
A fun pastime is to go find old papers by philosophers and linguists that make concrete claims about things language models will never be able to do, and then go type their examples into ChatGPT and see the model do just fine.
I know a philosopher with four legs and a fondness for balls and bones. Without using language, he knows if I’m happy or sad and can usefully alter my state. Get ChatGPT to repeat this trick.
You are aware the same transformer architecture that underpins LLMs has shown great success in multimodal applications? Such as successfully modeling sound and music?
So the toddler gets to start learning with the most complicated known entity in the universe in their head, a brain honed by millions and millions of years of evolution into computing extremely efficiently, with the ability to generate languages from scratch if no examples are provided or learn them by default if any examples are provided, while the LLM gets to start with... a table of random numbers? This thought experiment is designed to make LLMs look less impressive?
Compared to an LLM, the corpus they use to become proficient at language (and countless other things alongside) is achingly small. It’s also embodied so it can relate sounds and utterances to the world and actions. Thus comparing outputs between to two entities is really like comparing apples to oranges, it’s not that the pronouncements of linguists and philosophers are useless in the face of it, they’re just in a different domain.
don't agree. the amount of written words we feed in, yes. but a tottler is a multimodal system including not only audio/images but actually video, plus other senses like touch or smell. This complex input 24/7 is a lot of data that happens, and some brain relations to classify the world in form before it can spell out a label (word) for it.
Even just looking at a simple word (like "apple"), the human already has a lot of contextual information about it, mostly how its presented and how the presenter frames it (like yummy, tasty fruit! like other foods in the fridge, ...), and then people say the word physically pointing at the thing.
My guess thats an order of magnitude more total input than our LLMs get solely via text or other standalone channels for training.
>Compared to an LLM, the corpus they use to become proficient at language (and countless other things alongside) is achingly small.
But that's only true if you assume humans start at zero and the millions of years of natural selection that gave us our incredible brains gives no advantage to learning, which seems facially absurd. The toddler probably sees less data in that five years than the LLM sees in a few months, but it's worth remembering how rich human sensory data is, smell is heavily tied to memory and is incredibly complex, taste and touch sensitivity in the mouth is high enough that toddlers stuff new things in their mouth to understand them better, human hands are incredibly sensor dense, and that's before we get to the classics of 1000fps vision consisting of partially high definition and partially upscaled low definition video, or audio that contains huge amounts of information not only about material properties, energy magnitudes, or space layouts and shapes, but also a huge amount of semantic data through speech and language for most children. But unlike the LLM, they didn't start with a random selection of numbers, they started with a human brain which is incredibly impressive. Frankly, it is genuinely surprising that LLMs are able to rival us as well as they do considering how much less complex they are than our brains.
>it’s not that the pronouncements of linguists and philosophers are useless in the face of it, they’re just in a different domain.
I go back and forth on whether this is true. I see the argument, but it also makes intuitive sense that study of an AI system whose intelligence comes entirely from human language can probably yield insights about human language. It doesn't seem any less plausible than studying the way slime molds feed on oat flakes to gain insights into public transport design and infrastructure, which is something that we have done to significant success.
A philosopher would likely retort something along the lines of “it doesn’t come from human language, it comes from statistically modelling a giant corpus of readymade text, which isn’t the same thing as language but an encoding of it”.
> Let it passively listen to the same sounds and speech as a baby hears in its first five years, over five years of operation.
First of all, a baby does not passively listen. It sees and touches and smells and feels, and moves and makes sounds, and observes reactions, and correlates all that with the sounds it hears.
Secondly, with the above in mind, why on Earth would you expect it to "do absolutely nothing of any use"? It feels obvious it'll start honing in on language, with even better grounding (but worse performance) than a LLM. And I fully expect there to be a to merge such model with an LLM, the same way people combine LLMs with image generators.
"Have you seen the paper that's called Modern Language Models Refute Chomsky's Approach to
Language? The one by Steven Piantadosi. If so, what do you make of it?
Well, I mean, unlike a lot of people who write about this, he does know about
large language models, but the article makes absolutely no sense. It has a minor problem
and a major problem. The minor problem is that it's beyond absurdity to think that you can
learn anything about a two
bunch of supercomputers scanning 50 terabytes of data, looking for statistical regularities,
and stringing them together. To think that from that you can learn anything about an infant is
so beyond absurdity that it's not worth talking about. That's the minor problem.
The major problem is that, in principle, you can learn nothing. In principle,
make it 100 terabytes, use 20 supercomputers. Just bring out the, in principle, impossibility.
Even worse, for more clearly, for a very simple reason, these systems work just as well for
impossible languages that children can't acquire as they do for possible languages that children
do acquire. It's kind of as if a physicist came along and said, I've got a great new theory.
It accounts for a lot of things that happen, a lot of things that can't possibly happen,
and I can't make any distinction among them. We would just laugh. I mean, any explanation of
anything says, here's why things happen, here's why other things don't happen. You can't make
that distinction. You're doing nothing. That's irremediable. It's built into the nature of the
systems. The more sophisticated they become at dealing with actual language, the more it's
demonstrated that they're telling us nothing, in principle, about language, about learning,
or about cognition. So, there's a minor problem with this paper, and a major one.
The minor problem is the simple absurdity of thinking that looking at anything of the scale
could tell you about what's happening. The major problem is you can't do it in principle.
Yes, I heard you describe this as if a physicist came along and said,
I have a theory, and it's two words. Anything goes.
Well, that's basically the large language models. You give it a system that's designed in order to
violate the principles of language, it'll do just as well. Maybe better, because it can use simple
algorithms that aren't used by natural language. So, it's basically, like I said, or like I say,
suppose some guy comes along with an improvement on the periodic table,
says I got a theory that includes all the possible elements, even those that haven't
been discovered, and all kinds of impossible ones, and I can't tell any difference.
It's not an improvement on the periodic table. That's telling you nothing about chemistry.
That's built into the design of the system. It's not remediable."
Some ~72 years ago in 1951, Claude Shannon released his Paper, "Prediction and Entropy of Printed English", an extremely fascinating read now.
It begins with a game. Claude pulls a book down from the shelf, concealing the title in the process. After selecting a passage at random, he challenges his wife, Mary to guess its contents letter by letter. The space between words will count as a twenty-seventh symbol in the set. If Mary fails to guess a letter correctly, Claude promises to supply the right one so that the game can continue.
In some cases, a corrected mistake allows her to fill in the remainder of the word; elsewhere a few letters unlock a phrase. All in all, she guesses 89 of 129 possible letters correctly—69 percent accuracy.
Discovery 1: It illustrated, in the first place, that a proficient speaker of a language possesses an “enormous” but implicit knowledge of the statistics of that language. Shannon would have us see that we make similar calculations regularly in everyday life—such as when we “fill in missing or incorrect letters in proof-reading” or “complete an unfinished phrase in conversation.” As we speak, read, and write, we are regularly engaged in predication games.
Discovery 2: Perhaps the most striking of all, Claude argues that that a complete text and the subsequent “reduced text” consisting of letters and dashes “actually…contain the same information” under certain conditions. How?? (Surely, the first line contains more information!).The answer depends on the peculiar notion about information that Shannon had hatched in his 1948 paper “A Mathematical Theory of Communication” (hereafter “MTC”), the founding charter of information theory.
He argues that transfer of a message's components, rather than its "meaning", should be the focus for the engineer. You ought to be agnostic about a message’s “meaning” (or “semantic aspects”). The message could be nonsense, and the engineer’s problem—to transfer its components faithfully—would be the same.
a highly predictable message contains less information than an unpredictable one. More information is at stake in (“villapleach, vollapluck”) than in (“Twinkle, twinkle”).
Does "Flinkle, fli- - - -" really contain less information than "Flinkle, flinkle" ?
Shannon concludes then that the complete text and the "reduced text" are equivalent in information content under certain conditions because predictable letters become redundant in information transfer.
Fueled by this, Claude then proposes an illuminating thought experiment: Imagine that Mary has a truly identical twin (call her “Martha”). If we supply Martha with the “reduced text,” she should be able to recreate the entirety of Chandler’s passage, since she possesses the same statistical knowledge of English as Mary. Martha would make Mary’s guesses in reverse.
Of course, Shannon admitted, there are no “mathematically identical twins” to be found, but and here's the reveal, “we do have mathematically identical computing machines.”
Those machines could be given a model for making informed predictions about letters, words, maybe larger phrases and messages. In one fell swoop, Shannon had demonstrated that language use has a statistical side, that languages are, in turn, predictable, and that computers too can play the prediction game.