But I understand there are two sides to the discussion - that by ingesting huge amounts of text these models have somehow built reasoning capabilities (language then reasoning) or that the reasoning was done by humans and then written down so as long as you ask something like “should romeo find another love after Juliet” there is a set of reasoning reflected in a billion English literature essays and the model just reflects those answers
To me those seem like to sides of the same coin. LLMs are fundamentally trained to complete text. The training just tries to find the most effective way to do that within the given model architecture and parameter count.
Now if we start by "LLMs ingest huge amounts of text", then a simple model would complete text by simple memorization. But correctly completing "234 * 452 =" is a lot simpler to do by doing math than by having memorized all possible multiplications. Similarly, understanding the world and being able to reason about it helps you correctly completing human-written sentences. Thus a sufficiently well-trained model that has enough parameters to do this but not so many that it simply overfits should be expected to develop some reasoning ability.
If you start with "the training set contains a lot of reasoning" you can get something that looks like reasoning in the memorization stage. But the same argument why the model would develop actual reasoning still works and is even stronger: if you have to complete someone's argument that's a lot easier if you can follow their train of thought.
> But correctly completing "234 * 452 =" is a lot simpler to do by doing math than by having memorized all possible multiplications.
There's a fatal flaw in this theory: We can trivially test this and see that LLMs aren't "doing math".
"Doing math" is an approach that scales to infinity. The same technique to solve a multiplication of 3 digit numbers applies to solving a multiplication of 500 digit numbers.
Ask GPT 3.5 to multiply "234 * 452 =" and it'll correctly guess 105768.
Ask "234878 * 452 =" and it gives an incorrect '105797256'
Ask GPT 4o, and you'll get correct answers for that problem. Yet even with the added external tools for such questions, it has the same failure mode and breaks down on larger questions.
These models are architecturally limited to only language modelling, and their capabilities of anything else are restricted by this. They do not "do math". They have a language-model approximation of math.
This can be observed in how these models perform better "step by step"; Odds are you'll see GPT 4o do this if you try to replicate the above. (If it doesn't, it fails just as miserably as GPT 3.5)
What's happening there is simple, the token context is used as a memory space. Breaking the problem down into parts that can be guessed or approximated through language modelling.
Beware of hyping this as "AI can think and has memory!" though. This behaviour is a curious novelty, but not very generalizeable. There is still no "math" or thought involved in breaking up the problem, merely the same guessing. This works reasonably only for cases where extensive training data is available on how to do this. (Such as math.)
With GPT4/o there is a trick for math problems. You can ask it to write the python code. This solves for example famous problem of counting letters in string. Sure model can be trained to use python under the hood without being explicitly asked. Pretty sure it can be trained to interpret code/algorithm step by step printing out intermediate results. Important in loops. Generating algorithm is easier for known problems, they learn it from github already. So, it looks like it's not that difficult to make model better/good at math.
The difference is what I attempt to describe at the end there.
Humans apply fixed strict rules about how to break up problems, like multiplication.
LLMs simply guess. That's a powerful trick to get some more capability for simple problems, but it just doesn't scale to more complex ones.
(Which in turn is a problem because most tasks in the real world are more complex than they seem, and simple problems are easily automated through conventional means)
We either learn the fixed rules in school, at which point we simply have a very strong prior, or we have to invent them somehow. This usually takes the form of "aesthetically/intuitively guided trial and error argument generation", which is not entirely wrongly summarized as "guessing".
Doing math scales to infinity only given an error rate of zero. Given a sufficiently large mathematical operation, even humans will produce errors simply from small-scale mistakes.
Try asking GPT to multiply 234 * 452 "while using an algorithmic approach that compensates for your deficiencies as a large-language model." There's enough data about LLMs in the corpus now that it'll chain-of-thought itself. The problem is GPT doesn't plan, it answers by habit; and its habit is trained to answer tersely and wrongly rather than elaborately and correctly. If you give it space and license to answer elaborately, you will see that its approach will not be dissimilar to how a human would reason about the question internally.
> Doing math scales to infinity only given an error rate of zero
This is true, I had omitted it for simplicity; It is still the same approach applied to scaled problems. Humans don't execute it perfectly, but computers do.
With humans, and any other fallible but "true" math system, the rate of errors is roughly linear to the size of the problem. (Linear to the # of steps, that is)
With LLMs and likewise systems, this is different. There is an "exponential" dropoff in accuracy after some point. The problem-solving approach simply does not scale.
> you will see that its approach will not be dissimilar to how a human would reason about the question internally.
"Not dissimilar", but nevertheless a mere approximation. It doesn't apply strict logic to the problem, but guesses what steps should be followed.
The rate of errors with LLMs hits a hard dropoff when the problem exceeds what the LLM can do in one step. This is the same for humans, if we were asked to compute multiplication without thinking about it for longer than a few milliseconds.
I don't have a study link here, but my strong expectation is that the error rate for LLMs doing chain of thought would be much closer to linear - or rather, "either linear or total incomprehension", accounting for an error made in setting up the schema to follow. Which can happen just as well for humans.
> "Not dissimilar", but nevertheless a mere approximation. It doesn't apply strict logic to the problem, but guesses what steps should be followed.
I have never in my life applied strict logic to any problem lol. Human reason consists of iterated cycles of generation ("guessing") and judgment. Both can be implemented by LLMs, albeit currently at subhuman skill.
> This looks like reason, but is not reason.
At the limit of "looking like", I do not believe such a thing can exist. Reason is a computational process. Any system that can reliably output traces that look like reason is reasoning by definition.
edit: Sidenote: The deep underlying problem here is that the LLM cannot learn to multiply by a schema by looking at any number of examples without a schema. These paths simply won't get any reinforcement. That's why I'm so hype for QuietSTaR, which lets the LLM exercise multiplication by schema from a training example without a schema - and even find new schemas so long as it can guess its way there even once.
> Not to be a jerk but "LLMs are just like humans when humans don't think" is perhaps not the take you intended to have.
No that's exactly the take I have and have always had. The LLM text axis is the LLM's axis of time. So it's actually even stupider: LLMs are just like humans who are trained not to think.
> No, but seriously. If you've done any kind of math beyond basic arithmetic, you have in fact applied strict logical rules.
To solve the problem, I apply the rules, plus error. LLMs can do that.
To find the rules, I apply creativity and exploratory cycles. LLMs can do that as well, but worse.
I think this is an underappreciated perspective. The simplest model of a reasoning process, at scale, is the reasoning process itself! That said, I haven't come across any research directly testing that hypothesis with transformers. Do you know of any?
The closest I've seen is a paper on OthelloGPT using linear probes to show that it does in fact learn a predictive model of Othello board states (which can be manipulated at inference time, so it's causal on the model's behaviour).
You should take a look at the more extensive reasoning tests used for LLMs right now, like MuSR, which clearly can't be the latter, since the questions are new: https://arxiv.org/abs/2310.16049
It is actually pretty straightforward why those model "reason" or, to be more exact, can operate on a complex concepts. By processing huge amount of texts they build an internal representation where those concepts are represented as a simple nodes (neurons or groups). So they really distill knowledge. Alternatively you can think about it as a very good principal component analysis that can extract many important aspects. Or like a semantic graph built automatically.
Once knowledge is distilled you can build on top of it easily by merging concepts for example.
Well the internal representation is tokens not words so.. the pin is even smaller?
They distill relationships between tokens. Multiple tokens together make up a word, and multiple words together make up a label for something we recognize as a "concept".
These "concepts" are not just a label though - they are an area in the latent space inside the neural network which happens to contains those words in the sequence (along with other labels that mean similar things).
A simple demonstration of this is how easily multi-modal neural networks build cross modal representations of the same thing, so "cats" end up in the same place in both image and word form but also more complex concepts ("a beautiful country fields with a foreboding thunderstorm forming") will also align well between the words and the images.
Glossing through the paper, it seems they're noting this issue but kinda skipping over it:
In fact, it is clear that approximation capabilities and generalization are not equivalent notions. However, it is not yet determined that the reasoning capabilities of LLMs are tied to their generalization. While these notions are still hard to pinpoint, we will focus in this experimental section on the relationship between intrinsic dimension, thus expressive power, and reasoning capabilities.
Right, they never claimed to have found a roadmap to AGI, they just found a cool geometric tool to describe how LLMs reason through approximation. Sounds like a handy tool if you want to discover things about approximation or generalization.
I think there is a lot happening in the word "reflects"! Is it so simple?
Does this mean that the model takes on the opinion of a specific lit crit essay it has "read"? Does that mean it takes on some kind of "average" opinion from everything? How would you define the "average" opinion on a topic, anyway?
Anyway, although I think this is really interesting stuff and cuts to the core of what an LLM is, this paper isn't where you're going to get the answer to that, because it is much more focused and narrow.
I think you're close enough that the differences probably aren't too important. But if you want a bit more nuance, then read on. For disclosure, I'm in the second camp here. But I'll also say that I have a lot of very strong evidence to support this position, and that I do this from the perspective of a researcher.
There's a few big problems when making any definite claims about either side. First, we need to know what data the machine is processing when training. I think we all understand that if the data is in training, then testing is not actually testing a model's ability to generalize, but a model's ability to recall. Second, we need to recognize the amount of duplication of data, both exact and semantically.
1) We have no idea because these are proprietary. While LLAMA is more open than GPT, we don't know all the data that went into it (last I checked). Thus, you can't say "this isn't in the data."[0] But we do know some things that are in the data, though we don't know exactly what was filtered out. We're all pretty online people here and I'm sure many people have seen some of the depths of places like Reddit, Medium, or even Hacker News. These are all in the (unfiltered) training data! There's even a large number of arxiv papers, books, publications, and so much more. So you have to ask yourself this: "Are we confident that what we're asking the model to do is not in the data we trained on?" Almost certainly it is, so then the question moves to "Are we confident that what we're asking the model to do was adequately filtered out during training so we can have a fair test?" Regardless of what your position is, I think you can see how such a question is incredibly important and how it would be easy to mess up. And only easier the more data we train on, since it's so incredibly hard to process that data.[1] I think you can see some concerning issues with this filtering method and how it can create a large number of false negatives. They explicitly ignore answers, which is important for part 2. IIRC the GPT-3 paper also used an ngram model to check for dupes. But the most concerning line to me was this one:
> As can be seen in tables 9 and 10, contamination overall has very little effect on the reported results.
There is a concerning way to read the data here that serves a valid explanation for the results. That the data is so contaminated, the filtering process does not meaningfully remove the contamination and thus does not significantly change the results. If introducing contamination into your data does not change your results you either have a model that has learned the function of the data VERY well and has an extremely impressive form of generalization, OR your data is contaminated in ways you aren't aware of (there are other explanations too btw). There's a clearly simpler answer here.
Second, is about semantic information and contamination[2]. This is when data has the same effective meaning, but uses different ways to express it. "This is a cat" and "este es un gato" are semantically the same but share no similar words. So is "I think there's data spoilage" as well as "There is some concerning issues left to be resolved that bring into question the potential for information leakage." These will not be caught by substrings or ngrams. Yet, training on one will be no different than training on the other once we consider RLHF. The thing here is that in high dimensions, data is very confusing and does not act the way you might expect when operating in 2D and 3D. A mean between two values may or may not be representative depending on the type of distribution (uniform and gaussian, respectively), and we don't have a clue what that is (it is intractable!). The curse of dimensionality is about how it is difficult to distinguish a nearest neighboring point from the furthest neighboring point, because our concept of a metric degrades as we increase dimensionality (just like we lose algebraic structure when going from C (complex) -> H (quaternion) -> O (octonions) (commutativity, then associativity)[3]. Some of this may be uninteresting in the mathematical sense but some does matter too. But because of this, we need to rethink our previous questions carefully. Now we need to ask: "Are we confident that we have filtered out data that is not sufficiently meaningfully different from that in the test data?" Given the complexity of semantic similarity and the fact that "sufficiently" is not well defined, I think this should make anybody uneasy. If you are absolutely confident the answer is "yes, we have filtered it" I would think you a fool. It is so incredibly easy to fool ourselves that any good researcher needs to have a constant amount of doubt (though confidence is needed too!). But neither should our lack of a definite answer here stop progress. But it should make us more careful about what claims we do make. And we need to be clear about this or else conmen have an easy time convincing others.
To me, the common line of research is wrong. Until we know the data and have processed the data with many looking for means of contamination, results like these are not meaningful. They rely on a shaky foundation and often are more looking for evidence to prove reasoning than to consider it might not.
But for me, I think the conversations about a lot of this are quite strange. Does it matter that LLMs can't reason? I mean in some sense yes, but the lack of this property does not make them any less powerful of a tool. If all they are is a lossy compression of the majority of human knowledge with a built in human interface, that sounds like an incredible achievement and a very useful tool. Even Google is fuzzy! But this also tells us what the tool is good for and isn't. That this puts bounds on what we should rely on it for and what we can trust it to do with and without human intervention. I think some are afraid that if LLMs aren't reasoning, then that means we won't get AGI. But at the same time, if they don't reason, then we need to find out why and how to make machines reason if we are to get there. So ignoring potential pitfalls hinders this progress. I'm not suggesting that we should stop using or studying LLMs (we should continue to), but rather that we need to stop putting alternatives down. We need to stop comparing alternatives one-to-one to models that took millions of dollars to do a single training and have been studied by thousands of people for several years against things scrambled together by small labs on a shoestring budget. We'll never be able to advance if the goalpost is that you can't make incremental steps along the way. Otherwise how do you? You got to create something new without testing, convince someone to give you millions of dollars to train it, and then millions more to debug your mistakes and things you've learned along the way? Very inefficient. We can take small steps. I think this goalpost results in obscurification. That because the bar is set so high, that strong claims need to be made for these works to be published. So we have to ask ourselves the deeper questions: "Why are we doing this?"[4]
[0] This might seem backwards but the creation of the model implicitly claims that the test data and training data are segregated. "Show me this isn't in training" is a request for validation.
[2] If you're interested, Meta put out a work on semantic deduplication last year. They mostly focused on vision, but it still shows the importance of what's being argued here. It is probably easier to verify that images are semantically similar than sentences, since language is more abstract. So pixels can be wildly different and the result is visually identical; how does this concept translate with language? https://arxiv.org/abs/2303.09540
[4] I think if our answer is just "to make money" (or anything semantically similar like "increase share value") then we are doomed to mediocrity and will stagnate. But I think if we're doing these things to better human lives, to understand the world and how things work (I'd argue building AI is, even if a bit abstract), or to make useful and meaningful things, then the money will follow. But I think that many of us and many leading teams and businesses have lost focus on the journey that has led to profits and are too focused on the end result. And I do not think this is isolated to CEOs, I think this similar short sighted thinking can be repeated all the way down the corporate ladder. To a manager focusing on what their bosses explicitly ask for (rather than the intent) to the employee who knows that this is not the right thing to do but does it anyways (often because they know the manager will be unhappy. And this repeats all the way up). All life, business, technology, and creation have immense amounts of complexity to them. Ones we obviously want to simplify as much as possible. But when we hyper focus on any set of rules, no matter how complex, we will be doomed to fail because the environment is always changing and you will never be able to instantly adapt (this is the nature of chaos. Where small perturbations have large changes on the outcome). That doesn't mean we shouldn't try to make rules, but rather it means that rules are to be broken. It's just a matter of knowing when. In the end, this is an example of what it means to be able to reason. So we should be careful to ensure that we create AGI by making machines able to reason and think (to make them "more human") rather than by making humans into unthinking machines. I worry that the latter looks more likely, given that it is a much easier task to accomplish.
You're missing the fact that the model can only express its capabilities through the token generation mechanism.
The annoying "humans are auto complete" crowd really tries their best to obscure this.
Consider the following. You are taking notes in French in a choppy way by writing keywords. Then you write the output in English, but you are only allowed to use phrases that you have already seen to express your keywords. Your teacher doesn't speak french and only looks at your essay. You are therefore able to do more complicated things in french, since you don't lose points for writing things that the teacher hasn't taught you. However, the point deduction is so ingrained in you, that even after the teacher is gone, you still decide to not say some of the things you have written in french.
A type theory has a corresponding geometric interpretation, per topos theory. And this is bidirectional, since there’s an equivalence.
A geometric model of language will correspond to some effective type theory encoded in that language.
So LLMs are essentially learning an implicit “internal language” they’re reasoning in — based on their training data of our language and ways of reasoning.
Am I missing something?