Wow, a lot of grumpiness in here. If it's true that adding like 20 or so tokens to encode column location / decimal spot triples math performance in out of band tasks, that's a big deal. It's a simple fix, it improves performance A LOT, and they even indicate it's not just a party trick, in that the LLM can use the information to do better on related tasks like sorting and list making.
This is basically free to add, and there's no reason it shouldn't be made part of standard tokenization.
I'm more interested in the question of how we can find other useful concepts for data -> embedding space like this; can we incept our tokenization inception so it has more inception?
This is cool, but special casing digits is unsatisfying.
It makes me think that the authors have correctly identified an issue (positional embeddings) but don't propose a general solution.
I'm not sure if such a thing is possible, but if it is, it would feel more complete. (Fwiw, positional embeddings have had issues for a long time! So a general solution to this would benefit more than just arithmetic. Helpfully, we now have a really good specific example to serve as a baseline for any generalization we seek)
but it makes sense to have a different encoding. Mathematics is a completely different language. Maybe we should have more than one class of encodings.
There were some recent posts (either here or reddit) supporting the claim that different regions activate when reading programs vs when reading text. If we take that to be true; and squint just enough, one could claim that arithmetic and mathematics should be treated differently to language.
I would only find that satisfying (from a snobbish and impractical perspective) if we were able to have the model decide: 1) what encoding should this section use? 2) how should I train this encoding?
A mixture of experts but for encodings is interesting, though!
For arbitrary documents and queries, how do we reliably segment the text between those two different languages? And if we can do that, why can't the model do it implicitly?
I'm with you. I get that this is akin to asking a human, because we're trying to reason, so we will bring along (assumedly) unavoidable deficiencies of human reasoning. But if I were to ask a human genius this question, ne would grab a calculator and employ it as ne did the rest of ner reasoning.
So it seems like we should probably teach LLMs to "use a calculator", rather than try to get them to be more right when doing math 'in their head'.
Solving that will be a much bigger deal but it's at odds with producing a highly accurate emulation of human thought and language. Language models can serve as tools to understand and experiment with logic formulated as natural language but it isn't their primary purpose. What you're asking is equivalent to creating an auditable trace of everything that goes into making a statement which is pretty much impossible even for the person making a statement. We can get close by limiting ourselves to narrow domains like mathematics but even then someone can come along and question the premises on which we construct such a system. I'm not saying it isn't worth pursuing, it just isn't the standard that we should hold a model to when we ourselves are incapable of it. The goal here is to create a system capable of doing the things that a human can do. If you prefer to have a system that behaves within the confines of a mathematical formalism with well defined rules then build that model instead.
It's entirely possible. Don't use LLMs for math. Use the computers we already have that have been capable of doing math accurately for a century. Right tool, right job.
An LLM in isolation is not a general purpose system, but with ChatGPT at least, most of the time you don't need to ask. In fact, it's increasingly difficult to force it to do "manual" maths, as it's strongly predisposed to do things like write and evaluate Python code instead of doing it "manually".
E.g. I verified just now and asked it to multiply two huge numbers, and it immediately spat out Python code, then evaluated it and gave me the result, rather than try to add the numbers itself.
A basic transformer architecture performs only a bounded amount of computation per generated token, so it can never emulate a machine computing sufficiently hard problems.
> This is basically free to add, and there's no reason it shouldn't be made part of standard tokenization.
This is muchhhhh different from how tokenization works today. Adding tokens to the vocabulary is free, everything outside that (i.e. string -> tokens) is going to be a major pain in the ass. Doable but annoying and error prone
Good old software development. :( Recent case studies:
- llama.cpp wasn't tokenizing properly, and it came to a head with llama3. Essentially every local model before May 2024 is soft-deprecated, new ones have to indicate the proper tokenizer, and that currently only covers a small subset of popular models
- I recently had to review 41 Phi-3 and Llama 3 models, only 3 had the right tokenizer set
Not saying it's impossible, and we definitely should, and I bet it 100% happens, but...*shudders*
Meanwhile, I just wrote a custom tokeniser for my fan control experiment.
It features such amusements as:
- Tokens representing the current time of day and day of week, with half-hour granularity. [14:30][Monday], as the debugger reports.
- An entirely separate set of numeric tokens for CPU usage and such, on a logarithmic scale. Also features tokens for digit position, measured from the right.
- A hardcoded text tokeniser for executable paths. [/nix/store](..cut..)/bin/executable name. I didn't feel like using the usual approach, so I built a huffman compressor to generate the tokens for arbitrary text, because why not.
- Tokens representing program state - "just started", "long-running", etc.
- Tokens representing the fact that the following text is from `tail -f ~/.bash_history`.
- Start-of-segment tokens for each of the above, and also for GPU and CPU core complex power usage.
It's not that many tokens in total, and the input is structured data, so why not represent it as such? I still had sixty-five thousand tokens for the text tokeniser.
And when engineers accumulate enough related hacks, scientist-types may discover a pattern and find a proper, general solution. But they wouldn't get there without the pile of hacks that are effectively meta-level empirical evidence.
AI research has mostly progressed when there’s been enough processing power to avoid needing to use the old style of hacks rather than any sort of generalization going on.
AlphaZero vs Stockfish wasn’t some outgrowth of existing methods. They basically throw the old style away and started over.
Object recognition, LLM’s etc all involved throwing what used to be unimaginable levels of data and compute at a problem that “suddenly” worked. Not saying the people at OpenAI aren’t clever, but instead that it wouldn’t have worked in 2000.
It's also obvious and it's hacky. Frankly I'm stunned this hasn't been tried yet. The people thinking this is a stepping stone to More Intelligence are missing the forest for the trees.
Deep learning is always and only ever about representing data abstractly. The more abstractions you can make irrelevant (why would you have to learn how to do math when the base-10 perspective on ASCII-digits is already provided for you?) the more you've biased your architecture to readily learn and understand the problem space.
Intelligence doesn't exist where Divine Creator gave you access to this or that faculty. It's developing those faculties yourself by reasoning through the process of composing your own mental model about the problem.
ASCII digits do not always imply base-10 numbers, they can also be identifiers (e.g. phone numbers), parts of words (IPv6, Log4j), and used in various 'written slang' such as g2g, 4ever, m8 for mate, etc, etc.
And, crucially, I'd argue that for in "chatbot" tasks those other uses are more common than arithmetic, so arbitrary focus to specifically optimize arithmetic doesn't really make sense - the bitter lesson is that we don't want to bias our architecture according to our understanding of a specific problem space but rather enable the models to learn the problem space directly from data.
Stepping one level out in the metacognition hierarchy is the key. "Learning to learn" as it were. It is only the relative ease of implementation and deployment of feedforward models like Transformers that makes it seem like we have reached an optimum but we desperately need to move beyond it before it's entrenched too thoroughly.
Okay, but it does seem that this hack is in the entirely opposite direction; a pure transformer is more towards "learning to learn" than any special preprocessing to explicitly encode a different representation of numbers.
We probably do have to move beyond transformers, but not in the direction of such hacks, but rather towards even more general representations that could encode the whole class of all such alternate representations and then learn from data which of them work best.
You seemingly missed the part where the next model could learn how to generate its own hierarchical position embeddings. The problem here is obviously that you want the model to look at position i in object a and object b where the position i was chosen by a previous layer. If anything, the answer is probably to just have a dynamic position input from the model into the RoPE embedding, then it can learn the ideal position encoding on its own.
I think the problem here is that 'understanding' is not the same as curve fitting.
If all one is doing is giving a model lots of data and fitting curves it's not really 'understanding' but brute forcing it's way (with gradient descent) and then storing the weights and finally approximate the solution when a query is passed in.
This is not the same as understanding. Human intelligence can operate deterministically as well as non-deterministically. We can listen to language, which is by it's nature non-deterministic and convert that into deterministic operations and vice a versa. IE we can operate on some logic and explain it in multiple ways to other people.
Understanding requires much less data than brute forcing your way into pattern recognition.
When you see a simple number like this 2 * 4 you are able to understand that it's equivalent to 2 + 2 + 2 + 2 and that in turn means 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 <- Count that and you've got your answer.
Because you 'understand' this basic concept and all the operations in between you are able to compute more examples. But you only need to understand it once. Once you understand multiplications and additions and all the tricks in between you are able to compute 23 * 10 without being fed 23 * 10 as prior data. Understanding is very different from fitting a curve. You can reach conclusions and understanding through pattern recognition, but it's important to differentiate 'approximation' from 'calculation'. If you understand something in it's entirety you should be able to calculate an outcome deterministically.
Right now LLMs lack 'understanding', and seems to only 'approximate' which may seem like 'understanding' but is actually not.
I think you are mixing layers of abstraction. To make a crude but I think not unhelpful analogy: 'Understanding' is a natural language concept that is our way to describe whats happening in our heads, and like most other such concepts is resistant to any clear definition and will exhibit sorites type paradoxes when one is attempted. It belongs to the presentation layer of the stack. While the process of curve fitting, however it is implemented, with whatever NN structure (like transformers) or maybe something else entirely belongs to the physical layer of the stack -- akin to frequency modulation.
While I am unsure whether LLMs are really understanding, whatever that means, I think it is not difficult to believe that any form of understanding we implement will involve 'curve fitting' as a central part.
This seems like its confusing how we conceptualize the training/learning process with what the system is actually doing. We conceptualize tuning parameters as curve fitting, and we conceptualize predicting the next token as maximizing probability. But that doesn't mean there is anything like curve fitting or probability maxxing happening as the system's parameters converge.
The core feature of curve fitting is learning explicit examples and then interpolating (in an uninformative manner) between unlearned examples. But there's no reason to think this completely describes what the system is doing, in the sense that there are no more informative descriptions of its behavior. Take an example that LLMs are surprisingly good at, creating poetry given arbitrary constraints. Imagine the ratio of the poems it has seen during its training over the number of unique poems it could create in principle. This number would be vanishingly small. Interpolating between two strings representing well-formed poems in an uninformative manner (i.e. some finite polynomial) will not generate well-formed poems. The only way you could move between two examples of well-formed poems while staying on the manifold of well-formed poems is if you captured all relevant features of the manifold. But I fail to see a difference between capturing all relevant features of the poetry-manifold and understanding poetry.
What LLMs do can be described as curve fitting in only the most uninformative description possible. What they do is discover features of the structures referred to by the training text and competently deploy these features in predicting the next token. A human that could do this would be consider to understand said structure.
It seems like a hack to be honest. Problem at hand is not to make transformers do addition of 100 digit numbers. Problem is the current systems can’t reason about things, math included.
Optimizing for a certain use case is not gonna take us where we wanna be. We want to have a system that can learn to reason.
> Problem is the current systems can’t reason about things
Sounds like the AGI argument trap: They're not able to reason, but we can't succintly define what it is.
I don't come with a reasoning chip. Whatever I call reasoning happens as a byproduct of my neural process.
I do think that the combination of a transformer network and calls to customized reasoning chips (systems that search and deduce answers, like Wolfram Alpha or logic/proof systems) may be a short-stop to something that can perform reason and execution of actions better than humans, but is not AGI.
> They're not able to reason, but we can't succintly define what it is.
For transformer-based LLMs, and most LLMs there's an obvious class of problems that they cannot solve. LLMs generally perform bounded computation per token, so they cannot reason about computational problems that are more than linearly complex, for a sufficiently large input instance. If you have a back-and-forth (many shot) your LLM can possibly utilize the context as state to solve harder problems, up to the context window, of course.
Humans can realise they don’t understand something and seek more knowledge to learn to understand it. But also humans can build complex structures out of simple fundamentals: The same logic of counting up beans on a table can be extrapolated to multiplying that table of beans. And then counting horses the same way you count beans but give them a value of multiple beans. And then simplify that by trading in promises of beans in trade of horses.
The fact that so many people can’t see the fundamental differences of an LLM and human intelligence reminds me of back when the very early computer scientists thought they could model the entirety of nature by reducing every “component” to a numeric value and compute it as “transfer of energy”.
Quite literally they did the same thing: They had a new toy (very advanced computation machines) and forced all of nature to “fit” within it. It also ended in failure, obviously. Not because nature or ecosystems (as it was coined) are “magic” but because grossly oversimplifying reality to fit desired models is a fool’s errand.
We’ll have to wait and see how far multi modal training takes us. Text only models are extremely limited by the kind of information we can encode as text and the loss of detail e.g. the word “cat” vs an image of a cat vs video of a cat vs direct physical interaction with a cat vs being a mammal that shares a great deal of biology with a cat. You need a table and beans before you can invent a method for counting them
> LLMs generally perform bounded computation per token, so they cannot reason about computational problems that are more than linearly complex, for a sufficiently large input instance.
I can’t judge if this is true, because I don’t know transformers well, but if it is, it unravels an intuitive thought I’ve never been able to articulate about not only LLMs, but possibly all pattern matching and the human analog of System 1 thinking.
Another fuzzy way of saying this is there’s something irreducible about complexity that can’t be pattern matched by any bounded heuristic – that it’s wishful thinking to assume historical data contains hidden higher-level patterns that unlock magical shortcuts to novel problems.
There is a distinction. Humans with the use of an unbounded scratchpad can emulate a general-purpose Turing machine and perform general computation given unbounded time. A LLM is still restricted to its context window which is a comparatively extreme limitation of memory. In comparison, our general-purpose computers have so much memory this isn't something we care about for most practical instances of hard problems that we solve with a classical CS algorithm. You can obviously modify LLMs to perform unbounded computation per token (and furnish it with a scratchpad) but afaict commercial LLMs today don't offer that.
>They're not able to reason, but we can't [succinctly] define what it is.
People also routinely fail to reason, even programmers often write "obvious" logic bugs they don't notice until it gives an unexpected result at which point it's obvious to them. So both humans and AI don't always reason. But humans reason much better.
I myself have observed ChatGPT 4 solving novel problems I invented to my personal satisfaction well enough to say that it seems to have a rudimentary ability to sometimes show abilities we would typically call reasoning, but only at the level of a child. The issue isn't that it is supposed to reason perfectly or that humans reason perfectly, the issue is that it doesn't reason well enough to succeed at completing many kinds of tasks we would like it to succeed at. Please note that nobody expects it to reason perfectly. "Prove Fermat's last theorem in a rigorous way. Produce a proof that can be checked by Coq, Isabelle, Mizar, or HOL in a format supported directly by any of them" is arguably a request that includes nothing but reasoning and writing code. But we would not expect even Wiles to be able to complete it, and Wiles has actually proved Fermat's last theorem.
So we have an idea of reasoning as completing certain types of tasks successfully, and today humans can do it and AI can't.
The issue is that humans can see its answer is wrong and its "reasoning" is wrong.
The issue isn't that it never reasons correctly. It's that it doesn't do so often enough or well enough, and it doesn't complete tasks we expect humans to complete, and it doesn't always notice when it is printing something outrageously wrong and illogical.
It notices sometimes, it engages in elementary rudimentary guesswork sometimes, but just not often enough or well enough.
> The issue is that humans can see its answer is wrong and its "reasoning" is wrong.
I've noticed with LLMs that they're more likely to come to the wrong conclusion if you prime them in that manner. In this case, you posed the follow-up question as "Will <incorrect conclusion> always be true?" As a result, it's primed to try to prove that incorrect conclusion.
(That said, ChatGPT further did not answer the posed question, as it also changed "difference" -> "absolute difference"; in fact, the difference will alternate between increasing and decreasing, while the absolute difference is strictly increasing.)
I suppose it's a question whether what we call "reasoning" is an emergent phenomenon from having enough connections in a graph, or whether it's some other special sauce which we simply don't have in our current models yet. E.g. humans follow a deductive process to answer questions which they haven't encountered yet. Do we gain this ability purely from a denser/larger graph of knowledge, or from a completely different architecture?
I think until we know the answer to this, we can't make predictions about how to build true AGI.
> E.g. humans follow a deductive process to answer questions which they haven't encountered yet.
Rarely, actually.
More generally humans use all kind of inferences where problem at hand is intertwined with all other attention points that is occupying the mental load of the person. Giving a topic full mental attention and finding a path through pure deduction about a circumscribed subject is a rarity, even if you consider only those situations that require any conscious attention at all to perform some action before moving on.
If there is one space where it shines, sure it’s mathematics. But even there, the most notable mathematicians highly rely on some intuitions far before they manage to prove anything, as well as while selecting/creating their conceptual tools to attempt to build the proof, and rarely go to the point of formalizing their points through Coq/Isabelle or even with meticulous paper craft à la Principia Mathematica from Russel and Whitehead.
All of our deductive reasoning is founded in induction. For example, the basis of all arithmetic is physics analogies regarding things that exist and the understanding that a thing implies another thing is not based in deduction. Similarly, I suspect from my own experience that general reasoning requires a basic understanding of physics if its origin isn't something ineffable. The ability to connect and find implications cannot itself be purely deductive and it would seem to me that an understanding of physical reality would have to be the origin for that ability.
I must be in the minority here, but I don't think most people exercise any reason. I'd even venture that the vast majority of people haven't reasoned recently at all. In my mind, reasoning is an ability... a willful act to engage in thinking through an abstract problem. Most people don't do this and just use rationalization and learned behavior, which our brains are good at.
Well, 99% of day to day life is mundane for much of living beings on earth. A bee is able to get through it's entire life without showing signs that it deeply ponders about anything.
However, humans have the ability to reason about things (whether most people use this ability is a different question). So then we must ask the question: is this ability just a more advanced form of probabilistic pattern matching, or is it a different architecture altogether? Will current AI models be able to develop this ability, or will we need new models?
I think for the most part that's true, but obviously there are things people want to use LLMs for that do require planning/reasoning, and it makes for unexpected failure modes if LLMs don't have this ability.
> humans follow a deductive process to answer questions which they haven't encountered yet
nope. most humans fall in various traps such as pattern recognition, confirmation bias, and many others instead of relying on deductive analysis. Even scientists fail at being rigorous.
Of course there are cases like this, nobody is perfect. But we are talking about mathematics here, not everyday subconscious decision making. I agree that 99% of daily life is trivial pattern recognition. That's not what distinguishes humans though is it? Because animals, down to single celled organisms do just fine without higher order mental capabilities. But we are talking about reasoning here - and specifically about structured one like math.
I disagree that daily life is "trivial pattern recognition".
Just our visual object recognition is immensely powerful and far beyond and current AI. A simple task like walking to the fridge requires a ton of pattern recognition and spatial reasoning. Recognizing people's moods/predicting behaviors is also incredibly involved imo.
Ive said this many times but perhaps we should focus on achieving dog level intelligence first before we start worrying about human level AGI.
Oh I'm very much with you. In fact I get irked by people here breathlessly parroting that human level AGI is upon us any day now. I'd be impressed if an AI had mouse level capabilities any time soon. I think the current models are very impressive, but they are parlor tricks compared to what a true AGI should be capable of.
This is such a strawman. Do you have to really stoop to this level? There are a billion useless things people pay for, is that a measure of the intelligence behind it? People routinely pay $1000 dollars for a dog, does that mean a dog is 50x more intelligent than ChatGPT? All I'm saying is that we should be a bit more humble about intelligence when we understand so little about it.
Just because LLMs are useful, it doesn't mean they exhibit more intelligence than a mouse. A mouse probably also doesn't reason about anything, but it is an agent capable of independent behavior, something that is still very far removed from current AI models.
>All I'm saying is that we should be a bit more humble about intelligence when we understand so little about it.
OK, as long as we're are being humble, how about we refrain from confidently proclaiming that there is a mouse level and a dog level that AI hasn't reached yet and that researchers will have to spend a long time getting past, so there's plenty of time before we have to worry about the possibility of AI's becoming dangerous or transformative to society?
Just our visual object recognition is immensely powerful and far beyond and current AI.
That's a point you'll likely have to revisit pretty soon. Radiology, for instance, probably won't exist as a profession 20-30 years from now. Captchas are already pretty much done for.
Well 1. Radiology is an insanely niche subject not indiciative of general intelligence, and 2. AI being at good radiology isn't about object recognition or spatial reasoning, its data analysis connecting features to outcomes.
Lastly, check out the ARC challenge or any other spatial reasoning tests for AI. Humans get ~80% on these challenges whereas the best AI is still at 25%
It seems theres multiple things by the name ARC. There is one by AI2 which is a text based science questions/word problems. The one Im referring to is this https://lab42.global/arc/
As to the study, I have the same objection as the radiology one. This isnt about object recognition and certainly not spatial reasoning, its the ability to predict cancer based on presence of visual features.
The "object recognition" part of this is super simple. Its a single, mostly 2D object in more or less the same angle, and the AI is trained on detecting just this.
The "object recognition" part of this is super simple. Its a single, mostly 2D object in more or less the same angle, and the AI is trained on detecting just this.
As I understand, conceptually they just changed 346 + 23 = ? to (1: 3, 2: 4, 3: 6) + (1: 2, 2: 3) = ?
So it is not that much of a specific hack. There could be a broader principle here where something is holding transformers back in a general fashion, and we might be able to improve on the architecture!
how do you argue that these models are not able to reason?
deductive reasoning is just drawing specific conclusion from general patterns. something I would argue this models can do (of course not always and are still pretty bad in most cases)
the point i’m trying to make is that sometimes reasoning is overrated and put on the top of the cognitive ladder, sometimes I have seen it compared to self-awareness or stuff like that. I know that you are not probably saying it in this way, just wanted to let it out.
I believe there is fundamental work still to be done, maybe models that are able to draw patterns comparing experience, but this kind of work can be useful as make us reflect in every step of what these models do, and how much the internal representation learned can be optimized
We have no definition of reasoning that is sufficiently precise to be useful.
But we do have a bunch of benchmark tasks/datasets that test what we intuitively understand to be aspects of reasoning.
For AI models, "being able to reason" means "performing well on these benchmarks tasks/datasets".
Over time, we'll add more benchmarking tasks and datasets that ostensibly test aspects of "reasoning", and people will develop models that succeed on more and more of these simultaneously.
And these models will become more and more useful. And people will still argue over whether they are truly "reasoning".
The fundamental argument of "Artificial Intelligence, Natural Stupidity" is that AI researchers constantly abuse terms like "reasoning," "deduction," "understanding," and so on, deluding others and themselves that their machine is almost as intelligent as a human when it's clearly dumber than a dog. My cats don't need "general patterns" to form deductions, they deduce many sophisticated things (on their terms) with n=1 data points.
In the 80s the computers were indisputably dumber than ants. That's probably not true these days. But the decades-long refusal of most AI researchers to accept humility about the limitations of their knowledge (now they describe multiple-choice science trivia as "graduate level reasoning") suggests to me that none of us will live to see an AI that's smarter than a mouse. There's just too much money and ideology, and too little falsifiability.
Drew McDermot's warning is well-heeded, but there are established and well-understood definitions of deductive, inductive and abductive reasoning that go back to at least Charles Sanders Pierce (philosopher and pioneer of predicate logic, contemporary of Gotlob Frege) that are widely accepted in AI research, and that even McDermot would have accepted. See sig for intro.
This is completely irrelevant. McDermot's point was that scientifically-plausible definitions of reasoning were not actually being used in practice by AI researchers when they made claims about their systems. That is just as true today.
I've read McDermot's paper a few times (it's a favourite of mine) and I don't remember that angle. Can you please clarify why you say that's his point?
Ants behave in ways that a modern computer still can't imitate. I don't think that generalized intelligence is possible but if it is it would need a different starting point than our current computing hardware. Even insects are flexible in ways that computers aren't.
> Deductive reasoning is the process of drawing valid inferences. An inference is valid if its conclusion follows logically from its premises, meaning that it is impossible for the premises to be true and the conclusion to be false.
> how do you argue that these models are not able to reason?
They just don't have the right architecture to support it.
An LLM is just a fixed size stack of N transformer layers, and has no working memory other than the temporary activations between layers. There are always exactly N steps of "logic" (embedding transformation) put into each word output.
You can use prompts like "think step by step" to try to work around these limitations so that a complex problem can (with good planning by the model) be broken down into M steps of N layers, and the model's own output in early steps acts as pseudo-memory for later steps, but this only gets you so far. It provides a workaround for the fixed N layers and memory, but creates critical dependency on ability to plan and maintain coherency while manipulating long contexts, which are both observed weaknesses of LLMs.
Human reasoning/planning isn't a linear process of N steps - in the general case it's more like an iterative/explorative process of what-if prediction/deduction, backtracking etc, requiring working memory and focus on the task. There's a lot more to the architecture of our brain than a stack of layers - a transformer is just not up to the job, nor was built for it.
It is not «deductive reasoning»: it is just "reasoning". That is, revising a body of ideas for qualities pertinent to alethic (truthfulness) and understanding (completeness).
It is critical thinking, continuous cycles of reprocessing.
And this cannot be overrated: it is the core activity.
There is a difference between poor reasoning and no reasoning. SOTA LLMs correctly answer a significant number of these questions correctly. The likelihood of doing so without reasoning is astronomically small.
Reasoning in general is not a binary or global property. You aren't surprised when high-schoolers don't, after having learned how to draw 2D shapes, immediately go on to draw 200D hypercubes.
Granting that, the original point was that they're not excited about this particular paper unless (for example) it improves the networks' general reasoning abilities.
The problem was never "my llm can't do addition" - it can write python code!
The problem is "my llm can't solve hard problems that require reasoning"
>deductive reasoning is just drawing specific conclusion from general patterns. something I would argue this models can do
That the models can't see a corpus of 1-5 digit addition then generalise that out to n-digit addition is an indicator that their reasoning capacities are very poor and inefficient.
Young children take a single textbook & couple of days worth of tuition to achieve generalised understanding of addition. Models train for the equivalent of hundreds of years, across (nearly) the totality of human achievement in mathematics, and struggle with 10-digit addition.
This is not suggestive of an underlying capacity to draw conclusions from general patterns.
I think the “train for hundreds of years” argument is misleading. It’s based off of parallel compute time and how long it would take to run the same training sequentially on a single GPU. This assumes an equivalence with human thought based on the tokens per second rate of the model which is a bad measurement because it varies depending on hardware and the closest comparison you could draw to what a human brain is doing would be either the act of writing or speaking but we obviously process a lot more information and produce a higher volume of information at a much higher rate than we can speak or write. Imagine if you had to verbally direct each motion of your body, it would take an absurd amount of time to do anything depending on the specificity you had to work with.
The work done in this paper is very interesting and your dismissal of “it can’t see a corpus and then generalize to n digits” is not called for. They are training models from scratch in 24 hours per model using only 20 million samples. It’s hard to equate that to an activity a single human could do. It’s as though you had piles of accounting ledgers filled with sums and no other information or knowledge of mathematics, numbers or the world and you discovered how to do addition based on that information alone. There is no textbook or tutor helping them do this either it should be noted.
There is a form of generalization if it can derive an algorithm based on a maximum length of 20 digit operands that also works for 120 digits. Is it the same algorithm we use by limiting ourselves to adding two digits at a time? Probably not but it may emulate some of what we are doing.
>There is no textbook or tutor helping them do this either it should be noted.
For this particular paper there isn't, but all of the large frontier models do have textbooks (we can assume they have almost all modern textbooks). They also have formal proofs of addition in Principia Mathematica, alongside nearly every math paper ever produced. And still, they demonstrate an incapacity to deal with relatively trivial addition - even though they can give you a step-by-step breakdown of how to correctly perform that addition with the columnar-addition approach. This juxtaposition seems transparently at odds with the idea of an underlying understanding & deductive reasoning in this context.
>There is a form of generalization if it can derive an algorithm based on a maximum length of 20 digit operands that also works for 120 digits. Is it the same algorithm we use by limiting ourselves to adding two digits at a time? Probably not but it may emulate some of what we are doing.
The paper is technically interesting, but I think it's reasonable to definitively conclude the model had not created an algorithm that is remotely as effective as columnar addition. If it had, it would be able to perform addition on n-size integers. Instead it has created a relatively predictable result that, when given lots of domain-specific problems, transformers get better at approximating the results of those domain-specific problems, and that when faced with problems significantly beyond its training data, its accuracy degrades.
That's not a useless result. But it's not the deductive reasoning that was being discussed in the thread - at least if you add the (relatively uncontroversial) caveat that deductive reasoning should lead to correct conclusion.
We're as humanity building a reasoning machine bottom up. It can't reason... yet. Expecting a magical switch that will make it reason about anything and everything is unreasonable. Starting with arithmetic makes perfect sense.
I didn’t test with all LLM out there, but all of thus I tested failed with something as basic as "What is the number of words in the sentence coming before the next one? Please answer."
In my experience, LLMs tend to perform better if you give them instructions before the data to be operated on. At least for the ~13b size models.
So,something like: Please count the number of words in the following sentence. "What is the number of words in the sentence coming before the next one?"
edit: Which might be an artifact of the training data always being in that kind of format.
For things like this where we have computationally cheap, well understood, reliable tools available (aka calculator) it seems better to train the model in tool use.
I guess perhaps the techniques could be generalized though?
Generalizable techniques is mostly the point of papers like this one yes. What they show here is that apparently fundamental problems with transformer reasoning can be fixed by encoding data in a more sophisticated manner. This is exciting. I've been thinking for a long time that the tokenization schemes are a low hanging fruit for improving coding LLM performance, this isn't exactly the same thing but it's in the same general area. Smartness and reasoning ability with the current set of algorithmic techniques seems to have topped out around GPT-4 level, which implies that further leaps in mental abilities must come from improving other things beyond training set size.
For example, whilst replacing the need for a calculator isn't very important, one obvious research direction would be to explore adding extra embeddings to code inputs, perhaps that are being computed by an IDE.
It seems sub-word tokenization vs using character inputs is just a trade off to gain computational efficiency, and obviously isn't how our brain works. We're not born with a fixed visual tokenization scheme - we learn to create our own groupings and object representations.
However, transformers seem to struggle a bit with accurately manipulating sequences, so going to character inputs and hoping for those to be aggregated into words/numbers/etc might cause more problems than it solves?
I have to wonder if these models would not be better off learning whole-word embeddings rather than tokens. You'd have thought they would learn embeddings that encode any useful relatedness (e.g. corresponding to common prefixes) between words. Perhaps numbers would be better off input as a sequence of individual digit embeddings.
Yeah a tiny vocab of characters doesn't work that well, it was tried very early on and creating large vocabs of tokens was a big improvement. Which makes sense. A lot of tokens are full words and so the token->embedding phase can quickly look up an embedding in vector space that contains a lot of meaning, whereas an embedding of 'z' or whatever is going to be meaningless.
I guess this extends to numbers split across multiple tokens too (especially in the somewhat odd way the OpenAI tokenizer does it). The model is having to work really hard to learn what a given sequence of number chunks means (e.g. chunks '123' '45' vs '123' '4'). It somehow need to realize that the embedding for '4' represents a single-digit number, but the embedding for '45' represents a two-digit number, and this then correspondingly changes the meaning of the preceding '123' token!
It would have made it easier for the model to grok numbers if, similar to the proposed alternative, if 1234 was tokenized as '1000' '200' '30' '4' for powers of 10 up to some reasonable limit (then maybe '1^' '2^' after this reasonable limit). This would let the model easily grok human-sized numbers and need to work harder to grok, say, 20-digit ones, just the same as we do. Some early curriculum training, while not necessary, could then help it to quickly learn which embeddings represent numbers which are d * 10^1 vs d * 10^2, etc.
That's sort of what this paper is doing. They add positional embeddings so the model can understand the positions of the digits inside the numbers better.
I think this is more a matter of how numbers are input and lack of specific training, including visual training.
For example, the number 12,345,678 is input to ChatGPT as the three tokens "123" "456" "78", which isn't the best place to start to learn that this is an 8 digit number with specific digit positions!
As a human child you learn about numbers largely visually by pointing to units, tens, hundreds etc, visually aligning them to add, etc. Maybe a multi-modal model, if it was visually trained on chalkboard primary school math, would do better in learning the concept of position based powers of 10, etc.
I'd say the key point here isn't that they "need" specialised embeddings, but rather that it improves things and it can samewhat manage without.
That's a far more surmountable problem. Maybe you need one model for biology and another for coding etc. i.e. Broad split by domain. Still weak AI not true general in AGI sense, but still seems like a good next step
I think understanding mathematics is what LLM really need at the moment far more important than video generation that is just another form of CGI [1]. After deep learning and transformer, understanding mathematics and its proofs not just arithmetic will be the next game changer for LLM and a turning point for humanity.
[1] Why LLMs like ChatGPT and Google Bard are bad at math:
> understanding mathematics and its proofs not just arithmetic will be the next game changer for LLM
Why?
I definitely agree that such capabilities would represent a major advance (and very likely go together with game changing increases of capabilities in other areas). I also think using AI to write formal math proofs in e.g. Lean is very cool.
However, by itself, it seems like this capability wouldn't be very useful, commercially for example. Do you think this capability is exceptionally informative merely because it has to go together with other capabilities? It's not impossible to have a (maybe somewhat limited) formal math AI that will remain mostly irrelevant to the everyday world (like FormalGeo).
Understanding mathematics basically means understading higher-level reasoning. If an AI were able to actually do this + the ability to generate and interpret language that LLMs already show, it would seem to be 90% or more of the way to AGI.
> However, by itself, it seems like this capability wouldn't be very useful, commercially for example.
Quite the opposite, it's the holy grail of all AI.
Consider various work that isn't (and can't) be done by computers/robots/etc right now.
The intelligence constraint is universally, a required amount of problem solving. Even the "low skill" labour requires it.
And to perform such problem solving, you need advanced logic and reasoning capabilities, which is the same thing as novel mathematics, just applied to a different end.
Let's be a little more concrete: do you think FormalGeo [1] is a big deal? I think it's very cool but ultimately not useful in and of itself. It's only useful insofar as it shows AI capabilities advancing in general.
Let's suppose we had an AI that works roughly like [1] but for the kind of mathematics done in Lean's Mathlib, and that was on par or better than humans working on it. Would that AI by itself be commercially useful?
Again, of course having such an AI implies a major jump in capabilities and it would most likely mean useful AI can be trained with similar techniques. But that's not what I mean by the system itself being useful. If all you're saying is that such an AI demonstates we can now probably build AIs that do things which we usually say require "logic and reasoning abilities", I completely agree.
Maybe I'm splitting hairs too much here. However, it could well be that such an AI would be useful by itself. I just can't think of much besides a major advance in the formal software verification niche, which still almost nobody would use...
> I just can't think of much besides a major advance in the formal software verification niche, which still almost nobody would use...
The reason is slightly different here.
What's so desirable here is an AI system with such general intelligence that it is capable of such mathematics by itself as a consequence. Not because the mathematics is so useful, but because the required reasoning capabilities are at such a level that, we could speak of an artificial intelligence that is meaningfully "general" about any problem.
It's a decent approximation of "able to solve any problem" that we can still reasonably test.
> Let's be a little more concrete: do you think FormalGeo [1] is a big deal?
It looks to be an interesting approach in modelling mathematics, and their use of machine learning is an interesting novelty that may pave the way to more useful general mathematics systems, but I can't find much about how these systems might interop with current/'generative' AI systems.
And that last bit is one of the big roadblocks for current AI. They're very weak at reasoning, but we can't directly interop to (e.g.) LLMs, so we can't compensate for that weakness.
Something I've been thinking about is how the Minds -- the super-human AI hyper-computers that fly the ships in the Culture series of novels are described. The image built up in my head[1] is that they're hybrids blending neural networks and regular compute substrates. They can calculate, simulate, and reason in combination.
There have been crude attempts at this already, hooking in Mathematica and Python into ChatGPT. I say crude, because these add-ons are controlled via output tokens.
What I would like to see is a GPT-style AI that also has compute blocks, not just transformer blocks. I don't mean compute in the sense of "matrix multiply for weights and biases", but literally an ALU-style block of basic maths operations available for use by the neurons.
One thought that I had was that this could be via activations that have both a floating-point activation value and "baggage" such as a numerical value from the input. Like a token in a traditional parser, that can represent a constant string or an integer with its decoded value.
The newer, truly multi-modal models gave me a related idea: Just like how they can have "image" tokens and "audio" tokens, I wonder if they could be given "numeric data" tokens or "math symbol" tokens. Not in the same way that they're given mixed-language text tokens, but dedicated tokens that are fed into both the transformer blocks and also into ALU blocks.
Just an idle thought...
[1] Every reader reads into a story something unique, which may or may not align with what the author intended. This is my understanding, coloured by my own knowledge, etc, etc...
The problem, if you embed an ALU like that, is how to train it to use them properly. And then it's not clear if they actually need to be able to do that in the middle of a pass that, at the end, is going to produce a single token anyway.
Controlling that stuff via output tokens actually kinda makes sense by analogy, since that is how we use calculators etc. But I do agree that specialized tokens that are used specifically to activate tools like that might be a better idea than just using plain text to signal in-band. And production of such specialized tokens can be easily trained.
I like this idea a lot. Right now we are going the long/hard way round, and post training asking an LLM to know it needs compute, then write a compute request, then feed back the compute answer into a tokenization loop.
It probably does make sense to add a mini CPU as a layer / tool / math primitive. I wonder how you'd train it to use such a thing? In my mind it's not really a layer per-se, but it's a set of function calls a layer could route to when it wants, and weight the response appropriately.
I just wonder if numbers were written right to left, llms would be much better at arithmetic. You can 'predict' the least significant digit by reusing the already written digits in the computation, but to generate most significant ones, you generally need to do the entire computation in one go.
Yes. This has already been demonstrated by "Teaching Arithmetic to Small Transformers" https://arxiv.org/abs/2307.03381 , I'm not sure what OP adds except demonstrating that you can do that via the embedding itself rather than the tokenization.
> We first demonstrate that conventional training data is not the most effective for arithmetic learning, and simple formatting changes can significantly improve accuracy. This leads to sharp phase transitions as a function of training data scale, which, in some cases, can be explained through connections to low-rank matrix completion. Building on prior work, we then train on chain-of-thought style data that includes intermediate step results. Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed. We also study the interplay between arithmetic and text data during training and examine the effects of few-shot prompting, pretraining, and model scale. Additionally, we discuss length generalization challenges.
This is an interesting idea but probably hard to verify.
A tangent is that positional systems were originally invented with least digit first, I believe.
The Babylonian sexagesimal system was like that as was the Arabic one (where first is on the right).
The most significant digit first convention came when right-to left numbers were used in left-to-right systems without reversing them in writing. To this day we read the more common smaller numbers least significant digit first to varying degrees.
16 = six teen, sech zehn
98 = acht und neunzig, achten negentig, ثمانية وتسعون
I'm curious about the framing of research like this.. "The poor performance of transformers on arithmetic tasks" (relative to what?) and how that informs the adjacent conversation on progress towards AGI.
Some say AGI has already been achieved, others that it's years or decades away. When I dig into the disagreement, it often partially depends on the perspective of how competent humans are on the tasks in question, with the optimists being, I think, more realistic about variance in human intelligence and the pessimists seeming to reserve the term "general intelligence" for possessing a nearly perfect suite of capabilities that many otherwise intelligent people practically don't have.
For example with arithmetic, this study cites another [Dziri et al. 2023], that says:
"For instance, humans can solve 3-digit by 3-digit multiplication arithmetic after learning basic calculation rules. Yet, off-the-shelf ChatGPT and GPT4 achieve only 55% and 59% accuracies on this task, respectively."
I still see value in normative statements about human capability in AI & AGI research, but I think we'll need to move towards explicit statistical framing.
DeepMind's Position paper "Levels of AGI for Operationalizing Progress on the Path to AGI" has a schema like this, where AGI capabilities are defined across 2 axes of Performance level X Generality (narrow vs general), and the Performance levels are measured by comparison with "Percentile of skilled adults" able to perform the task.. https://arxiv.org/pdf/2311.02462#page=3.40
Within that framing, this paper's title or result might be "Achieving AGI Competency in Arithmetic", or "Expertise", or "Virtuosity", i.e. on par respectively with 50th, 90th or 99th percentile of skilled adults.
Exactly, we need a much more granular approach to evaluating intelligence and generality. Our current conception of intelligence largely works because humans share evolutionary history and partake in the same 10+ years of standardized training. As such, many dimensions of our intelligence correlate quite a bit, and you can likely infer a person's "general" proficiency or education by checking only a subset of those dimensions. If someone can't do arithmetic then it's very unlikely that they'll be able to compute integrals.
LLMs don't share that property, though. Their distribution of proficiency over various dimensions and subfields is highly variable and only slightly correlated. Therefore, it makes no sense to infer the ability or inability to perform some magically global type of reasoning or generalization from just a subset of tasks, the way we do for humans.
Agreed on the first part, but for LLMs not having correlated capabilities, I think we've seen they do. As the GPTs progress, mainly by model size, their scores across a battery of tests goes up, eg OpenAI's paper for ChatGPT 4, showing a leap in performance across a couple dozen tests.
Also found this, a Mensa test for across the top dozen frontier models.
AGI is like consciousness, 75% of the people in any given conversation are talking about different things.
Truthfully we're going to see that improving language models towards AGI works out the same way self driving cars did - we're going to feel like we're 85% of the way there out of the gate, then we're going to keep tripping over things for the next 15 years.
At least with AGI, we can just throw up our hands, use an easier definition and take the W.
I don't understand the framing of your comment. You act like the LLM's feelings are going to be hurt if you say it isn't a real AGI. "Well, you can't do basic math expected of fifth graders, but there are dumb fifth graders too, so here's the 'human-level intelligence' participation trophy anyway."
The issue that separates "AGI" from current AI systems is the lack of generality. (Humour me.)
In particular, the lack of reasoning capability. And what the pessimists argue here is that there is no road to get there for current systems. Transformers are approximation machines, and are generalized for that specific task. But that's also where it stops, they can't do things that aren't such pattern-approximation.
Optimizing a transformer for arithmetic isn't a step towards AGI, because it is not generalizing. You'd need to do this for every conceivable task and subtask. This is the exact reason why imperative-programmed AI architectures were discarded.
Put bluntly, this approach will never get you a transformer that won't shit itself when asked to do novel reasoning tasks, such as novel mathematics. (Which I will remind the reader, anything but the basic programming work counts as)
And critically, the fundamental architecture of these transformer systems doesn't allow the combination of them into other AI systems to acquire generalized capabilities. There's no way to make an LLM hook into a computer-algebra-system, you can only feed 'finished' output of one system into another.
The other day I was wondering if LLMs are bad at at maths because they don't have readily apparent access to the concept of "columns". Apparently the answer is yes.
Vertical alignment across lines is pretty important for humans to learn operations on digits, but the way we encode lines with a \n separator doesn't really help. In a recent codebullet video gpt really struggled with any kind of vertical alignment task. I wonder if it would do better on a fixed 80 column width...
Isn't it more that they don't have ready access to the much-more-fundamental concept of decimal numbers?
My understanding was that they tokenized them into chunks and tried to learn associations between the chunks, the same as if one was breaking apart English words.
So "2+2=4" isn't being treated that differently from "all's well that ends well." This might lead to a kind of Benny's Rules [0] situation, where sufficient brute-force can make a collection of overfitted non-arithmetic rules appear to work.
I went through the paper and thought immediately about how did they implement it; I missed they published their code as well. Here is the link for everyone who skimmed past it: https://github.com/mcleish7/arithmetic/tree/main
It's basically the same as feature engineering in pre-deep machine learning: constructing features with high information content can significantly reduce the amount of data and computation needed to fit a useful model. And sometimes it's impossible to fit a useful model without careful feature engineering, either because the model itself is constrained in some way or because there isn't enough data or both.
It's analogous to making a choice of inductive bias within the model itself. We literally could not do LLMs without the carefully-constructed transformer architecture. Why should we expect to make further progress without paying more attention to the embeddings?
Since models are very good at writing very short computer programs, and computer programs are very good at mathematical calculations, would it not be considerably more efficient to train them to recognise a "what is x + y" type problem, and respond with the answer to "write and execute a small javascript program to calculate x + y, then share the result"?
From a getting answers perspective yes, from an understanding LLMs perspective no. If you read the avstract you can see how this goes beyond arithmetic and helps with longform reasoning
But that's not all that relevant to the question "can LLMs do math". People don't really need ChatGPT to replace a calculator. They are interested in whether the LLM has learned higher reasoning skills from it's training on language (especially since we know it has "read" more math books than any human could in a lifetime). Responding with a program that reuses the + primitive in JS proves no such thing. Even responding with a description of the addition algorithm doesn't prove that it has "understood" maths, if it can't actually run that algorithm itself - it's essentially looking up a memorized definition. The only real proof is actually having the LLM itself perform the addition (without any special-case logic).
This question is of course relevant only in a research sense, in seeking to understand to what extent and in what ways the LLM is acting as a stochastic parrot vs gaining a type of "understanding", for lack of a better word.
This is a cromulent approach, though it would be far more effective to have the LLM generate computer-algebra-system instructions.
The problem is that it's not particularly useful: As the problem complexity increases, the user will need to be increasingly specific in the prompt, rapidly approaching being fully exact. There's simply no point to it if your prompt has to (basically) spell out the entire program.
And at that point, the user might as well use the backing system directly, and we should just write a convenient input DSL for that.
"Syntax-Aware Transformer Models for Neural Machine Translation" by Yang et al. (2019). This model enhances the transformer architecture with syntax-aware attention mechanisms that consider dependency parse trees.
Context-Aware Neural Machine Translation Learns Anaphora Resolution" by Bawden et al. (2018). This paper explores integrating context and syntax into neural machine translation models.
I think the main problem is the way we turn the raw mathematics symbols or equations into tokens, and these suboptimal tokenization may decreases the performance
I thinks that's far from the only problem.
To me the most obvious problem is that we use right-to-left numbers (think about the order you're writing digits when doing long addition) in a left-to-right language.
Without a special number-flipping step; the transformer is forced to produce the output token-by-token, i.e. from left-to-right. Without the ability to store additional internal state, this turns addition into an O(N²) problem purely due to the suboptimal output ordering!
The paper discusses this, and the approach taken in the paper implements a number-flip stage, so numbers are formatted with their least significant figure first.
What is the point of this work? 99% on 100-digit arithmetic means there's a 0% chance anyone will ever use a Transformer as an ALU or anything of the kind. We already know how to hard-code a (literally) infinitely more accurate addition machine.
And not only addition: all four arithmetic operations. The technique proposed in the article -imposing a strong inductive bias for addition- kiind of works for multiplication, but not for subtraction or division (clearly; I can't even find the words in the paper). As a practical way to build a machine to do arithmetic this is out of the question.
We've known how to mechanise arithmetic since the 1850's with Blaize Pascal and his Pascaline. What is the point in demonstrating it's possible to reinvent a broken, partial, buggy version of an arithmetic machine if one tries really hard and shoehorns the necessary patterns in a neural net? We've known that for a long time, too (every proof that a neural net can simulate this or that Turing machine if you design the network diagram and set the weights by hand, ever).
So what is the point of this? Transformers are supposed to be the "sparks of AGI" and they can almost do arithmetic if we try very hard to shove it down their heads? Who cares?
Ok you want the general answer? Consider a discrete time Markov process with memory length N on a finite state space. Train a transformer with context length N on sample trajectories with SGD. Can you expect the transformer to become a good approximation for the dynamics of the Markov process? More specifically, suppose your Markov process is generated by some algorithm/Turing machine couple with some random data. Then, can you expect the transformer to learn to emulate the behavior of the underlying Turing machine, even when run on data which was notnin the initial distribution?
Another way to phrase it: Given a physical process that generates discrete time series trajectories, can our current transformer + SGD method learn to emulate the underlying physical processes by observing sample trajectories?
This question can be somewhat mathematically stated but it is quite difficult because there are still some words in there where I used common sense. For example mathematically there will always exist weird counterexamples, so you would have to quantify things very carefully. That's very difficult, so experiments are the best we can do right now.
Hence any instance where transformers fail to learn a Marko process are very interesting. Example: Addition of random numbers.
Is addition a Markov process? I really don't think so. You can certainly model e.g. integer addition by a Markov process, up to some integer k but addition itself is usually formalised by the Peano axioms, that are not quite Markovian. I guess you can see the relation between S(n) and S(S(n)) as some kind of Markov chain. That's really not a standard view though.
In any case, a complete theory of addition must be correct up to inifinity so you won't get that with any Markov process we can train from data. Although you can learn addition with a simple linear regression, by setting the weights appropriately. That's because a function of a line already includes addition, and multiplication, and that's basically not very different to what the team in the paper above is trying to do. Meaning: they're trying to hand-code the concept of addition in embeddings. It's not 100% because they're also at the same time trying to not 100% encode it, but it's a hard balance to strike.
> With positions resolved, we can study the logical extrapolation ability of transformers
They are interested in how well they can make a neural net logically extrapolate outside its training set, once encoding barriers are removed. They show that in fact even quite small language models can do this successfully once we're not confusing them with bad encodings anymore.
This seems like fundamental work. It was only a few years ago that Google employees were arguing LLMs were nothing more than "stochastic parrots". Well, that take will go down in history as one of the worst takes on AI ever. I don't think anyone really had any doubt by 2024 that this wasn't true, but the huge and opaque datasets meant people could always argue that maybe this wasn't an example of logical reasoning or extrapolation, maybe it had just seen this specific question before. But this work shows in a controlled environment that the model can learn the principles of addition and extrapolate to much larger numbers. It's not just repeating answers it's seen in its dataset. It should kill off the parrot meme for good.
>> But this work shows in a controlled environment that the model can learn the principles of addition and extrapolate to much larger numbers.
No, because it's given hand-engineered embeddings that act as a strong inductive bias that is specific to addition. It's like addition is programmed right in.
It’s not about arithmetic but about embeddings. The positional embeddings used in transformers are rather simplistic. If they can add this one new capability to transformers by using different embeddings then maybe there are other capabilities that are within reach.
No, because those embeddings only work for addition (very weakly for multiplication and sorting). Imagine needing a specially-crafted bias for every single task. The Deep Learning revolution brought on by Convolutional Neural Nets was supposed to do away with the need to do exactly that.
I think there is a good reason to find low-hanging fruits that pay dividends on these types of tasks, not because solving addition with a transformer is a good idea, but because it could improve performance in other parts of the network. Maybe there are other subsequences that could be annotated in this way? Per paragraph, tokens per word, who knows.
Obviously, the "best" way to do addition on a computer is by doing it exactly.
>> I think there is a good reason to find low-hanging fruits that pay dividends on these types of tasks, not because solving addition with a transformer is a good idea, but because it could improve performance in other parts of the network.
The paper makes this claim but if they could do that, they'd have showed it already: instead their hand-crafted, artisanal embeddings only work well for addition and only weakly for multiplication and sorting, and not at all for other arithmetic operations.
One is that research into what the limits of the architecture are is useful. Maths has a nice property of being very easy to verify and you can construct logical processes with it. It's a useful testbed.
Second is there are a lot more places that understanding how to do arithmetic help, outside of just doing sums on their own.
>What is the point of this work? 99% on 100-digit arithmetic means there's a 0% chance anyone will ever use a Transformer as an ALU or anything of the kind. We already know how to hard-code a (literally) infinitely more accurate addition machine.
Nobody's going to be replacing calculators with transformers sure but many are and will be using transformers to solve problems arithmetic is a necessary component of.
>So what is the point of this? Transformers are supposed to be the "sparks of AGI" and they can almost do arithmetic if we try very hard to shove it down their heads? Who cares?
You don't need to shove anything down for transformers to get arithmetic. Just changing how numbers are tokenized works. But that requires an entire retrain so why not explore other techniques?
And what does any of this have to do with AGI ? You know how terrible humans are at arithmetic right ?
Yes, but humans invented arithmetic. And then we invented computers that are much better than us at arithmetic calculations. That's a pattern we can observe all over the place: we're pretty damn good at inventing rich models of complex environments and processes but we're not very good at calculating the results of such models when that requires a lot of computation.
E.g., take chess. Modelling a game of chess as a game tree and searching the game tree by adversarial search is a human invention. Humans are pretty crap at searching a game tree beyond a handful of ply, but we can program a computer to go dozens of ply deep across thousands of branches, and beat any human.
So the challenge for AI is not to get computers to calculate when we know how the calculation is to be performed. The challenge is to get computers to create their own models. And that's a grand, open challenge that is not even close to be solved, certainly not by LLMs. Yann LeCun and Yoshua Bengio have said similar things.
The linked work doesn't move the needle any closer to that and it just shows progress in calculating arithmetic using a transformer, which we already know how to do in a myriad different ways and much more accurately. Hence my criticism for it.
I think most would argue Mathematics is a discipline that is discovered more than invented. That said, this isn't really the point I think.
A few humans invented/discovered arithmetic. Most humans will be born, live and die inventing absolutely nothing, even those with the opportunity and resources to do so.
It doesn't make sense to me that a bar most humans can't reach is the bar for General Intelligence of the Artificial kind. You can't eat your cake and have it.
Don't get me wrong. It's a fine goal to have. Of course we want machines that can invent things and push the frontier of science! It is still however a logical fallacy that an inability to do such would disqualify machines of general intelligence when it does not do so for Humans.
>The challenge is to get computers to create their own models. And that's a grand, open challenge that is not even close to be solved, certainly not by LLMs.
LLMs have fairly complex models of the world made manifest by the data they're trained on.
>> Most humans will be born, live and die inventing absolutely nothing, even those with the opportunity and resources to do so.
I don't think that's right at all. I like to visit museums. You really get hit in the face with the unending creativity of the human mind and the variety of all that human hands have crafted over thousands of years across hundreds of cultures. I would go as far as to say that the natural state of the human mind is to create new things all the time. And mathematics itself was not created (invented or discovered) by one person, but by many thousands.
In any case, it doesn't matter if one instance of the class of human minds hasn't invented anything, in the same way that it doesn't matter if one car can't do 80mph. It's indisputable that we have the capacity for some novelty, and generality, in our thinking. Maybe not every member of the species will achieve the same things, but the fact is that the species, as a species, has the ability to come up with never-before seen things: art, maths, tech, bad poetry, you name it.
>> Lecun may disagree but some others like Hinton, Ilya and Norvig don't.
I'm with LeCun and Bengio. There's a fair amount of confusion about what a "model" is in that sense: a theory of the world. There's no reason why LLMs should have that. Maybe a transformer architecture could develop a model of the world- but it would have to be trained on, well, the world, first. Sutskever's bet is that a model can be learned from text generated by entities that already have a world model, i.e. us, but LeCun is right in pointing out that a lot of what we know about the world is never transmitted by text or language.
I can see that in my work: I work with planning, right now, where the standard thing is to create a model in some mathematical logic notation, that is at once as powerful as human language and much more precise, and then let a planning agent make decisions according to that model. It's obvious that despite having rich and powerful notations available there is information about the world that we simply don't know how to encode. That information will not be found in text, either.
Sutskever again seems to think that, that kind of information, can somehow be guessed from the text, but that seems like a very tall order, and Transformers don't look like the right architecture. You need something that can learn hidden (latent) variables. Transformers can't do that.
>In any case, it doesn't matter if one instance of the class of human minds hasn't invented anything, in the same way that it doesn't matter if one car can't do 80mph.
It does matter, depending on what claim you're making. We've not reached the upper bound of transformer ability. Until we clearly do, then it very much does matter.
>I'm with LeCun and Bengio. There's a fair amount of confusion about what a "model" is in that sense: a theory of the world. There's no reason why LLMs should have that.
See this is my problem with Lecun's arguments. He usually starts with the premise that it's not possible and works his way from there. If you disagree with the premise then there's very little left. "Well it shouldn't be possible" is not a convincing argument, especially when we really have very little clue on the nature of intelligence.
>Sutskever's bet is that a model can be learned from text generated by entities that already have a world model, i.e. us, but LeCun is right in pointing out that a lot of what we know about the world is never transmitted by text or language.
A lot of the world is transmitted by things humans don't have access to. Wouldn't birds that can naturally sense electromagnetic waves to intuit direction say humans have no model of the world ? Would they be right ? Nobody is trained on the world. Everything that exists is trained on small slices of it. A lot of the world is transmitted by text and language. And if push comes to shove then text and language is not the only thing you can train a transformer on.
>Sutskever again seems to think that, that kind of information, can somehow be guessed from the text, but that seems like a very tall order,
I don't think this is as tall an order as you believe
>and Transformers don't look like the right architecture. You need something that can learn hidden (latent) variables. Transformers can't do that.
Unless i'm misunderstanding what you mean by hidden variables, it's very clear a transformer is regularly learning not just the sequences themselves but what might produce them.
>> Unless i'm misunderstanding what you mean by hidden variables, it's very clear a transformer is regularly learning not just the sequences themselves but what might produce them.
That's what I mean, but I don't think that's happening regulary, or at all. I don't see where the transformer architecture allows for this. Of course we can claim that any model of a process from examples is implicitly modelling the underlying sub-processes, for example we can claim that a multivariate regression that predicts the age at death from demographic data is somehow learning to represent human behaviour, say, but that's one of those big claims that need big evidence.
On the two works you link to, I know the one on mechanistic interpretabiity. As the author says:
Epistemic status: I feel pretty confident that I have fully reverse engineered this network, and have enough different lines of evidence that I am confident in how it works.
But I don't feel that confident at all that the author's confidence should instill confidence in myself. A clear, direct proof is needed, although of course we can discuss what a proof even means and how much it is a social construct etc.
The other paper, I haven't read. I'm going to bet it's basically data leakage which is a pervasive problem with most deep learning work that suffices to invalidate many big claims about big results. I'll have to read the paper a bit more carefully.
But, again, what is in the transformer architecture that can predict hidden variables?
> What is the point of this work? [...] We already know how to hard-code a (literally) infinitely more accurate addition machine.
There are many situations where it is useful for the LLM to get basic arithmetic right.
For example, if someone asks your LLM to explain this line of code [1] which takes a 28x28 px input image, is the right explanation that 28×28÷4×64=9216 ? Or is that the wrong explanation?
And being able to get 100-digit arithmetic right 99% of the time might make use feel reassured that the 4-digit arithmetic we need from the model will be right an even higher % of the time.
Seriously? They say it right in the introduction. The goal is to learn how to infer algorithmic processes directly from data. Much like how MNIST was used in the early days of NNs, you have to start with small toy problems that are representative of the problem domain. Once you have success with that, you can scale up problem complexity.
General algorithmic capability is one of the key traits that we think AGI should have, and it’s currently missing. If you have a better approach for getting there quicker than everyone else in the field, please share it.
I would even appreciate seeing more papers on approaches that didn’t work very well so it saves other researchers from going in the wrong direction. That alone would be enough justification for publishing an article.
>> The goal is to learn how to infer algorithmic processes directly from data.
And they demonstrated nothing like that. An "algorithmic process" is not finding the weights for a function given some carefully designed bias. An algorithm is a sequence of operations that calculates the result of a function. Nothing like that has been demonstrated in the linked paper at all.
>> General algorithmic capability is one of the key traits that we think AGI should have, and it’s currently missing. If you have a better approach for getting there quicker than everyone else in the field, please share it.
It's not missing at all, you just wont' find it in neural nets. And my PhD and post-doc research is exactly on that sort of thing, learning programs, algorithms and, currently, solvers for general planning problems.
Meanwhile I'm over here using Claude 3 Opus to do trig and calculus problems as well as generate the LaTex representation of the equations. It's not necessary to be 100% in my case (purely for fun) but I follow its reasoning and it's pretty consistent at least enough for "orders of magnitude" and first order effects. I was gonna post some of the chats about physics but probably nobody cares.
I did do some followup research. The math in its complex reasoning "tracks" but when I asked it to do 4 digit x 4 digit multiplication, it got most of it right except for a weird random digit error in the middle (?!) of the correct answer, lol. Now I want to run CLUTTR against Claude since it seems nobody has published that yet.
It's probably on-par or better than humans get unaided. Hell, I'd bet due to transcription errors it's better than what humans get in a lot of settings, even when aided by a calculator.
I guarantee you professionals using math at work - for example in finance - not have a 1% error quota. They use tools. We have tools. Nobody in any serious role (money, etc) works unaided.
Math inference is a palor trick as is the whole “world model” bullshit - physics doesn’t work with 99% accuracy.
It’s the same reason agents are bullshit right now - error compounding at 95% reliability per step murders them and currently there is no path to triple 9
This is basically free to add, and there's no reason it shouldn't be made part of standard tokenization.
I'm more interested in the question of how we can find other useful concepts for data -> embedding space like this; can we incept our tokenization inception so it has more inception?