What is the point of this work? 99% on 100-digit arithmetic means there's a 0% chance anyone will ever use a Transformer as an ALU or anything of the kind. We already know how to hard-code a (literally) infinitely more accurate addition machine.
And not only addition: all four arithmetic operations. The technique proposed in the article -imposing a strong inductive bias for addition- kiind of works for multiplication, but not for subtraction or division (clearly; I can't even find the words in the paper). As a practical way to build a machine to do arithmetic this is out of the question.
We've known how to mechanise arithmetic since the 1850's with Blaize Pascal and his Pascaline. What is the point in demonstrating it's possible to reinvent a broken, partial, buggy version of an arithmetic machine if one tries really hard and shoehorns the necessary patterns in a neural net? We've known that for a long time, too (every proof that a neural net can simulate this or that Turing machine if you design the network diagram and set the weights by hand, ever).
So what is the point of this? Transformers are supposed to be the "sparks of AGI" and they can almost do arithmetic if we try very hard to shove it down their heads? Who cares?
Ok you want the general answer? Consider a discrete time Markov process with memory length N on a finite state space. Train a transformer with context length N on sample trajectories with SGD. Can you expect the transformer to become a good approximation for the dynamics of the Markov process? More specifically, suppose your Markov process is generated by some algorithm/Turing machine couple with some random data. Then, can you expect the transformer to learn to emulate the behavior of the underlying Turing machine, even when run on data which was notnin the initial distribution?
Another way to phrase it: Given a physical process that generates discrete time series trajectories, can our current transformer + SGD method learn to emulate the underlying physical processes by observing sample trajectories?
This question can be somewhat mathematically stated but it is quite difficult because there are still some words in there where I used common sense. For example mathematically there will always exist weird counterexamples, so you would have to quantify things very carefully. That's very difficult, so experiments are the best we can do right now.
Hence any instance where transformers fail to learn a Marko process are very interesting. Example: Addition of random numbers.
Is addition a Markov process? I really don't think so. You can certainly model e.g. integer addition by a Markov process, up to some integer k but addition itself is usually formalised by the Peano axioms, that are not quite Markovian. I guess you can see the relation between S(n) and S(S(n)) as some kind of Markov chain. That's really not a standard view though.
In any case, a complete theory of addition must be correct up to inifinity so you won't get that with any Markov process we can train from data. Although you can learn addition with a simple linear regression, by setting the weights appropriately. That's because a function of a line already includes addition, and multiplication, and that's basically not very different to what the team in the paper above is trying to do. Meaning: they're trying to hand-code the concept of addition in embeddings. It's not 100% because they're also at the same time trying to not 100% encode it, but it's a hard balance to strike.
> With positions resolved, we can study the logical extrapolation ability of transformers
They are interested in how well they can make a neural net logically extrapolate outside its training set, once encoding barriers are removed. They show that in fact even quite small language models can do this successfully once we're not confusing them with bad encodings anymore.
This seems like fundamental work. It was only a few years ago that Google employees were arguing LLMs were nothing more than "stochastic parrots". Well, that take will go down in history as one of the worst takes on AI ever. I don't think anyone really had any doubt by 2024 that this wasn't true, but the huge and opaque datasets meant people could always argue that maybe this wasn't an example of logical reasoning or extrapolation, maybe it had just seen this specific question before. But this work shows in a controlled environment that the model can learn the principles of addition and extrapolate to much larger numbers. It's not just repeating answers it's seen in its dataset. It should kill off the parrot meme for good.
>> But this work shows in a controlled environment that the model can learn the principles of addition and extrapolate to much larger numbers.
No, because it's given hand-engineered embeddings that act as a strong inductive bias that is specific to addition. It's like addition is programmed right in.
It’s not about arithmetic but about embeddings. The positional embeddings used in transformers are rather simplistic. If they can add this one new capability to transformers by using different embeddings then maybe there are other capabilities that are within reach.
No, because those embeddings only work for addition (very weakly for multiplication and sorting). Imagine needing a specially-crafted bias for every single task. The Deep Learning revolution brought on by Convolutional Neural Nets was supposed to do away with the need to do exactly that.
I think there is a good reason to find low-hanging fruits that pay dividends on these types of tasks, not because solving addition with a transformer is a good idea, but because it could improve performance in other parts of the network. Maybe there are other subsequences that could be annotated in this way? Per paragraph, tokens per word, who knows.
Obviously, the "best" way to do addition on a computer is by doing it exactly.
>> I think there is a good reason to find low-hanging fruits that pay dividends on these types of tasks, not because solving addition with a transformer is a good idea, but because it could improve performance in other parts of the network.
The paper makes this claim but if they could do that, they'd have showed it already: instead their hand-crafted, artisanal embeddings only work well for addition and only weakly for multiplication and sorting, and not at all for other arithmetic operations.
One is that research into what the limits of the architecture are is useful. Maths has a nice property of being very easy to verify and you can construct logical processes with it. It's a useful testbed.
Second is there are a lot more places that understanding how to do arithmetic help, outside of just doing sums on their own.
>What is the point of this work? 99% on 100-digit arithmetic means there's a 0% chance anyone will ever use a Transformer as an ALU or anything of the kind. We already know how to hard-code a (literally) infinitely more accurate addition machine.
Nobody's going to be replacing calculators with transformers sure but many are and will be using transformers to solve problems arithmetic is a necessary component of.
>So what is the point of this? Transformers are supposed to be the "sparks of AGI" and they can almost do arithmetic if we try very hard to shove it down their heads? Who cares?
You don't need to shove anything down for transformers to get arithmetic. Just changing how numbers are tokenized works. But that requires an entire retrain so why not explore other techniques?
And what does any of this have to do with AGI ? You know how terrible humans are at arithmetic right ?
Yes, but humans invented arithmetic. And then we invented computers that are much better than us at arithmetic calculations. That's a pattern we can observe all over the place: we're pretty damn good at inventing rich models of complex environments and processes but we're not very good at calculating the results of such models when that requires a lot of computation.
E.g., take chess. Modelling a game of chess as a game tree and searching the game tree by adversarial search is a human invention. Humans are pretty crap at searching a game tree beyond a handful of ply, but we can program a computer to go dozens of ply deep across thousands of branches, and beat any human.
So the challenge for AI is not to get computers to calculate when we know how the calculation is to be performed. The challenge is to get computers to create their own models. And that's a grand, open challenge that is not even close to be solved, certainly not by LLMs. Yann LeCun and Yoshua Bengio have said similar things.
The linked work doesn't move the needle any closer to that and it just shows progress in calculating arithmetic using a transformer, which we already know how to do in a myriad different ways and much more accurately. Hence my criticism for it.
I think most would argue Mathematics is a discipline that is discovered more than invented. That said, this isn't really the point I think.
A few humans invented/discovered arithmetic. Most humans will be born, live and die inventing absolutely nothing, even those with the opportunity and resources to do so.
It doesn't make sense to me that a bar most humans can't reach is the bar for General Intelligence of the Artificial kind. You can't eat your cake and have it.
Don't get me wrong. It's a fine goal to have. Of course we want machines that can invent things and push the frontier of science! It is still however a logical fallacy that an inability to do such would disqualify machines of general intelligence when it does not do so for Humans.
>The challenge is to get computers to create their own models. And that's a grand, open challenge that is not even close to be solved, certainly not by LLMs.
LLMs have fairly complex models of the world made manifest by the data they're trained on.
>> Most humans will be born, live and die inventing absolutely nothing, even those with the opportunity and resources to do so.
I don't think that's right at all. I like to visit museums. You really get hit in the face with the unending creativity of the human mind and the variety of all that human hands have crafted over thousands of years across hundreds of cultures. I would go as far as to say that the natural state of the human mind is to create new things all the time. And mathematics itself was not created (invented or discovered) by one person, but by many thousands.
In any case, it doesn't matter if one instance of the class of human minds hasn't invented anything, in the same way that it doesn't matter if one car can't do 80mph. It's indisputable that we have the capacity for some novelty, and generality, in our thinking. Maybe not every member of the species will achieve the same things, but the fact is that the species, as a species, has the ability to come up with never-before seen things: art, maths, tech, bad poetry, you name it.
>> Lecun may disagree but some others like Hinton, Ilya and Norvig don't.
I'm with LeCun and Bengio. There's a fair amount of confusion about what a "model" is in that sense: a theory of the world. There's no reason why LLMs should have that. Maybe a transformer architecture could develop a model of the world- but it would have to be trained on, well, the world, first. Sutskever's bet is that a model can be learned from text generated by entities that already have a world model, i.e. us, but LeCun is right in pointing out that a lot of what we know about the world is never transmitted by text or language.
I can see that in my work: I work with planning, right now, where the standard thing is to create a model in some mathematical logic notation, that is at once as powerful as human language and much more precise, and then let a planning agent make decisions according to that model. It's obvious that despite having rich and powerful notations available there is information about the world that we simply don't know how to encode. That information will not be found in text, either.
Sutskever again seems to think that, that kind of information, can somehow be guessed from the text, but that seems like a very tall order, and Transformers don't look like the right architecture. You need something that can learn hidden (latent) variables. Transformers can't do that.
>In any case, it doesn't matter if one instance of the class of human minds hasn't invented anything, in the same way that it doesn't matter if one car can't do 80mph.
It does matter, depending on what claim you're making. We've not reached the upper bound of transformer ability. Until we clearly do, then it very much does matter.
>I'm with LeCun and Bengio. There's a fair amount of confusion about what a "model" is in that sense: a theory of the world. There's no reason why LLMs should have that.
See this is my problem with Lecun's arguments. He usually starts with the premise that it's not possible and works his way from there. If you disagree with the premise then there's very little left. "Well it shouldn't be possible" is not a convincing argument, especially when we really have very little clue on the nature of intelligence.
>Sutskever's bet is that a model can be learned from text generated by entities that already have a world model, i.e. us, but LeCun is right in pointing out that a lot of what we know about the world is never transmitted by text or language.
A lot of the world is transmitted by things humans don't have access to. Wouldn't birds that can naturally sense electromagnetic waves to intuit direction say humans have no model of the world ? Would they be right ? Nobody is trained on the world. Everything that exists is trained on small slices of it. A lot of the world is transmitted by text and language. And if push comes to shove then text and language is not the only thing you can train a transformer on.
>Sutskever again seems to think that, that kind of information, can somehow be guessed from the text, but that seems like a very tall order,
I don't think this is as tall an order as you believe
>and Transformers don't look like the right architecture. You need something that can learn hidden (latent) variables. Transformers can't do that.
Unless i'm misunderstanding what you mean by hidden variables, it's very clear a transformer is regularly learning not just the sequences themselves but what might produce them.
>> Unless i'm misunderstanding what you mean by hidden variables, it's very clear a transformer is regularly learning not just the sequences themselves but what might produce them.
That's what I mean, but I don't think that's happening regulary, or at all. I don't see where the transformer architecture allows for this. Of course we can claim that any model of a process from examples is implicitly modelling the underlying sub-processes, for example we can claim that a multivariate regression that predicts the age at death from demographic data is somehow learning to represent human behaviour, say, but that's one of those big claims that need big evidence.
On the two works you link to, I know the one on mechanistic interpretabiity. As the author says:
Epistemic status: I feel pretty confident that I have fully reverse engineered this network, and have enough different lines of evidence that I am confident in how it works.
But I don't feel that confident at all that the author's confidence should instill confidence in myself. A clear, direct proof is needed, although of course we can discuss what a proof even means and how much it is a social construct etc.
The other paper, I haven't read. I'm going to bet it's basically data leakage which is a pervasive problem with most deep learning work that suffices to invalidate many big claims about big results. I'll have to read the paper a bit more carefully.
But, again, what is in the transformer architecture that can predict hidden variables?
> What is the point of this work? [...] We already know how to hard-code a (literally) infinitely more accurate addition machine.
There are many situations where it is useful for the LLM to get basic arithmetic right.
For example, if someone asks your LLM to explain this line of code [1] which takes a 28x28 px input image, is the right explanation that 28×28÷4×64=9216 ? Or is that the wrong explanation?
And being able to get 100-digit arithmetic right 99% of the time might make use feel reassured that the 4-digit arithmetic we need from the model will be right an even higher % of the time.
Seriously? They say it right in the introduction. The goal is to learn how to infer algorithmic processes directly from data. Much like how MNIST was used in the early days of NNs, you have to start with small toy problems that are representative of the problem domain. Once you have success with that, you can scale up problem complexity.
General algorithmic capability is one of the key traits that we think AGI should have, and it’s currently missing. If you have a better approach for getting there quicker than everyone else in the field, please share it.
I would even appreciate seeing more papers on approaches that didn’t work very well so it saves other researchers from going in the wrong direction. That alone would be enough justification for publishing an article.
>> The goal is to learn how to infer algorithmic processes directly from data.
And they demonstrated nothing like that. An "algorithmic process" is not finding the weights for a function given some carefully designed bias. An algorithm is a sequence of operations that calculates the result of a function. Nothing like that has been demonstrated in the linked paper at all.
>> General algorithmic capability is one of the key traits that we think AGI should have, and it’s currently missing. If you have a better approach for getting there quicker than everyone else in the field, please share it.
It's not missing at all, you just wont' find it in neural nets. And my PhD and post-doc research is exactly on that sort of thing, learning programs, algorithms and, currently, solvers for general planning problems.
And not only addition: all four arithmetic operations. The technique proposed in the article -imposing a strong inductive bias for addition- kiind of works for multiplication, but not for subtraction or division (clearly; I can't even find the words in the paper). As a practical way to build a machine to do arithmetic this is out of the question.
We've known how to mechanise arithmetic since the 1850's with Blaize Pascal and his Pascaline. What is the point in demonstrating it's possible to reinvent a broken, partial, buggy version of an arithmetic machine if one tries really hard and shoehorns the necessary patterns in a neural net? We've known that for a long time, too (every proof that a neural net can simulate this or that Turing machine if you design the network diagram and set the weights by hand, ever).
So what is the point of this? Transformers are supposed to be the "sparks of AGI" and they can almost do arithmetic if we try very hard to shove it down their heads? Who cares?