Transformers Can Do Arithmetic with the Right Embeddings

vessenes · 2024-05-28T11:07:21 1716894441

Wow, a lot of grumpiness in here. If it's true that adding like 20 or so tokens to encode column location / decimal spot triples math performance in out of band tasks, that's a big deal. It's a simple fix, it improves performance A LOT, and they even indicate it's not just a party trick, in that the LLM can use the information to do better on related tasks like sorting and list making.

This is basically free to add, and there's no reason it shouldn't be made part of standard tokenization.

I'm more interested in the question of how we can find other useful concepts for data -> embedding space like this; can we incept our tokenization inception so it has more inception?

wrsh07 · 2024-05-28T14:15:41 1716905741

This is cool, but special casing digits is unsatisfying.

It makes me think that the authors have correctly identified an issue (positional embeddings) but don't propose a general solution.

I'm not sure if such a thing is possible, but if it is, it would feel more complete. (Fwiw, positional embeddings have had issues for a long time! So a general solution to this would benefit more than just arithmetic. Helpfully, we now have a really good specific example to serve as a baseline for any generalization we seek)

neves · 2024-05-28T14:24:27 1716906267

but it makes sense to have a different encoding. Mathematics is a completely different language. Maybe we should have more than one class of encodings.

PartiallyTyped · 2024-05-28T19:04:55 1716923095

There were some recent posts (either here or reddit) supporting the claim that different regions activate when reading programs vs when reading text. If we take that to be true; and squint just enough, one could claim that arithmetic and mathematics should be treated differently to language.

ckemere · 2024-05-28T20:37:32 1716928652

Numeracy is definitely associated with different brain regions than just reading. See, e.g. https://www.sciencedirect.com/science/article/pii/S105381191...

(Dehaene also has a book, “The Numbet Sense”)

wrsh07 · 2024-05-28T23:34:25 1716939265

I would only find that satisfying (from a snobbish and impractical perspective) if we were able to have the model decide: 1) what encoding should this section use? 2) how should I train this encoding?

A mixture of experts but for encodings is interesting, though!

Maybe there's a clean way to implement

PeterisP · 2024-05-29T09:00:12 1716973212

For arbitrary documents and queries, how do we reliably segment the text between those two different languages? And if we can do that, why can't the model do it implicitly?

nprateem · 2024-05-28T13:23:34 1716902614

But I don't want tricks. I want to know that it knows so I don't have to continually guess whether it's right or not.

naasking · 2024-05-28T13:30:37 1716903037

That's simply not possible. Human understanding is still unreliable, even for geniuses.

lapitopi · 2024-05-28T13:43:30 1716903810

That’s why I am asking a computer.

catapart · 2024-05-28T14:09:34 1716905374

I'm with you. I get that this is akin to asking a human, because we're trying to reason, so we will bring along (assumedly) unavoidable deficiencies of human reasoning. But if I were to ask a human genius this question, ne would grab a calculator and employ it as ne did the rest of ner reasoning.

So it seems like we should probably teach LLMs to "use a calculator", rather than try to get them to be more right when doing math 'in their head'.

naasking · 2024-05-28T15:09:03 1716908943

Indeed, "use a calculator" is "just a trick"!

throwaway4aday · 2024-05-28T19:33:41 1716924821

Solving that will be a much bigger deal but it's at odds with producing a highly accurate emulation of human thought and language. Language models can serve as tools to understand and experiment with logic formulated as natural language but it isn't their primary purpose. What you're asking is equivalent to creating an auditable trace of everything that goes into making a statement which is pretty much impossible even for the person making a statement. We can get close by limiting ourselves to narrow domains like mathematics but even then someone can come along and question the premises on which we construct such a system. I'm not saying it isn't worth pursuing, it just isn't the standard that we should hold a model to when we ourselves are incapable of it. The goal here is to create a system capable of doing the things that a human can do. If you prefer to have a system that behaves within the confines of a mathematical formalism with well defined rules then build that model instead.

krapp · 2024-05-28T22:42:18 1716936138

It's entirely possible. Don't use LLMs for math. Use the computers we already have that have been capable of doing math accurately for a century. Right tool, right job.

naasking · 2024-05-29T13:03:00 1716987780

OP said they didn't want tricks from their LLM. Using a calculator, like we do, is technically a trick.

nprateem · 2024-05-28T16:06:31 1716912391

My calculator manages

shepherdjerred · 2024-05-28T16:18:37 1716913117

Your calculator is deterministic. Humans and AI are not.

jayd16 · 2024-05-28T19:43:56 1716925436

LLMs are deterministic. We just sample the results, no? Also, no reason AI needs not be deterministic.

TeMPOraL · 2024-05-28T21:32:39 1716931959

> LLMs are deterministic.

In theory, yes. In practice, parallelism combined with floating point math make current implementations fundamentally non-deterministic.

mensetmanusman · 2024-05-29T01:44:54 1716947094

Not fundamentally, but extremely costly to keep track of everything. Still deterministic though.

Tycho · 2024-05-28T22:48:11 1716936491

Can you elaborate more on the parallelism aspect?

joquarky · 2024-05-29T19:55:35 1717012535

Temperature cannot ever reach 0 (this causes a division error), so they are not deterministic.

nprateem · 2024-05-28T18:55:06 1716922506

Exactly

vidarh · 2024-05-28T19:24:49 1716924289

The point is don't ask an LLM to do tasks that a calculator can do. Ask if to use the calculator, just like most humans would.

chowells · 2024-05-28T20:23:46 1716927826

The point is that you shouldn't need to ask. If it's actually a general-purpose system it will do it automatically.

vidarh · 2024-05-29T00:12:59 1716941579

An LLM in isolation is not a general purpose system, but with ChatGPT at least, most of the time you don't need to ask. In fact, it's increasingly difficult to force it to do "manual" maths, as it's strongly predisposed to do things like write and evaluate Python code instead of doing it "manually".

E.g. I verified just now and asked it to multiply two huge numbers, and it immediately spat out Python code, then evaluated it and gave me the result, rather than try to add the numbers itself.

int_19h · 2024-05-29T01:27:55 1716946075

It still needs to be trained on what calculators are, on how to use them, and on when it is appropriate to do so. Again, same as humans.

nprateem · 2024-05-28T20:11:48 1716927108

I know. That was my point.

jhanschoo · 2024-05-28T16:22:16 1716913336

A basic transformer architecture performs only a bounded amount of computation per generated token, so it can never emulate a machine computing sufficiently hard problems.

EVa5I7bHFq9mnYK · 2024-05-28T21:10:03 1716930603

Yes, because it's feed forward. It must have loops to be a Turing machine.

phkahler · 2024-05-28T21:51:56 1716933116

It does. The output is fed back in.

jhanschoo · 2024-05-29T01:59:38 1716947978

It indeed does, but it must generate a token per loop, and can thereby solve some linearly complex problems, but it cannot solve harder problems.

refulgentis · 2024-05-28T14:39:29 1716907169

> This is basically free to add, and there's no reason it shouldn't be made part of standard tokenization.

This is muchhhhh different from how tokenization works today. Adding tokens to the vocabulary is free, everything outside that (i.e. string -> tokens) is going to be a major pain in the ass. Doable but annoying and error prone

Filligree · 2024-05-28T14:51:29 1716907889

Doesn’t seem as complicated as, say, coding a lexer for C. And why shouldn’t tokenisation use lexers or an equivalent?

refulgentis · 2024-05-28T17:55:36 1716918936

Good old software development. :( Recent case studies:

- llama.cpp wasn't tokenizing properly, and it came to a head with llama3. Essentially every local model before May 2024 is soft-deprecated, new ones have to indicate the proper tokenizer, and that currently only covers a small subset of popular models

- I recently had to review 41 Phi-3 and Llama 3 models, only 3 had the right tokenizer set

Not saying it's impossible, and we definitely should, and I bet it 100% happens, but...*shudders*

Filligree · 2024-05-28T19:52:14 1716925934

Meanwhile, I just wrote a custom tokeniser for my fan control experiment.

It features such amusements as: - Tokens representing the current time of day and day of week, with half-hour granularity. [14:30][Monday], as the debugger reports. - An entirely separate set of numeric tokens for CPU usage and such, on a logarithmic scale. Also features tokens for digit position, measured from the right. - A hardcoded text tokeniser for executable paths. [/nix/store](..cut..)/bin/executable name. I didn't feel like using the usual approach, so I built a huffman compressor to generate the tokens for arbitrary text, because why not. - Tokens representing program state - "just started", "long-running", etc. - Tokens representing the fact that the following text is from `tail -f ~/.bash_history`. - Start-of-segment tokens for each of the above, and also for GPU and CPU core complex power usage.

It's not that many tokens in total, and the input is structured data, so why not represent it as such? I still had sixty-five thousand tokens for the text tokeniser.

badrunaway · 2024-05-28T20:13:37 1716927217

engineering vs. science -> scientist-types find such hacks ugly whereas engineers have to pay bills and get things moving fast.

TeMPOraL · 2024-05-28T21:30:48 1716931848

And when engineers accumulate enough related hacks, scientist-types may discover a pattern and find a proper, general solution. But they wouldn't get there without the pile of hacks that are effectively meta-level empirical evidence.

Retric · 2024-05-29T01:23:50 1716945830

AI research has mostly progressed when there’s been enough processing power to avoid needing to use the old style of hacks rather than any sort of generalization going on.

AlphaZero vs Stockfish wasn’t some outgrowth of existing methods. They basically throw the old style away and started over.

Object recognition, LLM’s etc all involved throwing what used to be unimaginable levels of data and compute at a problem that “suddenly” worked. Not saying the people at OpenAI aren’t clever, but instead that it wouldn’t have worked in 2000.

badrunaway · 2024-05-29T11:07:15 1716980835

Yea, sometimes that happen. But I won't say it's must. Scientists work by funding. Engineers work on real world markets.

uoaei · 2024-05-28T14:01:07 1716904867

It's also obvious and it's hacky. Frankly I'm stunned this hasn't been tried yet. The people thinking this is a stepping stone to More Intelligence are missing the forest for the trees.

Deep learning is always and only ever about representing data abstractly. The more abstractions you can make irrelevant (why would you have to learn how to do math when the base-10 perspective on ASCII-digits is already provided for you?) the more you've biased your architecture to readily learn and understand the problem space.

Intelligence doesn't exist where Divine Creator gave you access to this or that faculty. It's developing those faculties yourself by reasoning through the process of composing your own mental model about the problem.

PeterisP · 2024-05-29T08:53:16 1716972796

ASCII digits do not always imply base-10 numbers, they can also be identifiers (e.g. phone numbers), parts of words (IPv6, Log4j), and used in various 'written slang' such as g2g, 4ever, m8 for mate, etc, etc.

And, crucially, I'd argue that for in "chatbot" tasks those other uses are more common than arithmetic, so arbitrary focus to specifically optimize arithmetic doesn't really make sense - the bitter lesson is that we don't want to bias our architecture according to our understanding of a specific problem space but rather enable the models to learn the problem space directly from data.

uoaei · 2024-05-29T12:58:26 1716987506

You're missing the picture again.

Stepping one level out in the metacognition hierarchy is the key. "Learning to learn" as it were. It is only the relative ease of implementation and deployment of feedforward models like Transformers that makes it seem like we have reached an optimum but we desperately need to move beyond it before it's entrenched too thoroughly.

PeterisP · 2024-05-29T15:55:48 1716998148

Okay, but it does seem that this hack is in the entirely opposite direction; a pure transformer is more towards "learning to learn" than any special preprocessing to explicitly encode a different representation of numbers.

We probably do have to move beyond transformers, but not in the direction of such hacks, but rather towards even more general representations that could encode the whole class of all such alternate representations and then learn from data which of them work best.

uoaei · 2024-06-06T22:29:07 1717712947

You seem to be making my point just fine. What was your confusion, then?

imtringued · 2024-05-29T06:55:28 1716965728

You seemingly missed the part where the next model could learn how to generate its own hierarchical position embeddings. The problem here is obviously that you want the model to look at position i in object a and object b where the position i was chosen by a previous layer. If anything, the answer is probably to just have a dynamic position input from the model into the RoPE embedding, then it can learn the ideal position encoding on its own.

pixl97 · 2024-05-29T01:23:41 1716945821

I'd rather not wait another billion or so years for computers to evolve themselves

zacksiri · 2024-05-28T14:46:50 1716907610

I think the problem here is that 'understanding' is not the same as curve fitting.

If all one is doing is giving a model lots of data and fitting curves it's not really 'understanding' but brute forcing it's way (with gradient descent) and then storing the weights and finally approximate the solution when a query is passed in.

This is not the same as understanding. Human intelligence can operate deterministically as well as non-deterministically. We can listen to language, which is by it's nature non-deterministic and convert that into deterministic operations and vice a versa. IE we can operate on some logic and explain it in multiple ways to other people.

Understanding requires much less data than brute forcing your way into pattern recognition.

When you see a simple number like this 2 * 4 you are able to understand that it's equivalent to 2 + 2 + 2 + 2 and that in turn means 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 <- Count that and you've got your answer.

Because you 'understand' this basic concept and all the operations in between you are able to compute more examples. But you only need to understand it once. Once you understand multiplications and additions and all the tricks in between you are able to compute 23 * 10 without being fed 23 * 10 as prior data. Understanding is very different from fitting a curve. You can reach conclusions and understanding through pattern recognition, but it's important to differentiate 'approximation' from 'calculation'. If you understand something in it's entirety you should be able to calculate an outcome deterministically.

Right now LLMs lack 'understanding', and seems to only 'approximate' which may seem like 'understanding' but is actually not.

zyklu5 · 2024-05-28T15:02:55 1716908575

I think you are mixing layers of abstraction. To make a crude but I think not unhelpful analogy: 'Understanding' is a natural language concept that is our way to describe whats happening in our heads, and like most other such concepts is resistant to any clear definition and will exhibit sorites type paradoxes when one is attempted. It belongs to the presentation layer of the stack. While the process of curve fitting, however it is implemented, with whatever NN structure (like transformers) or maybe something else entirely belongs to the physical layer of the stack -- akin to frequency modulation.

While I am unsure whether LLMs are really understanding, whatever that means, I think it is not difficult to believe that any form of understanding we implement will involve 'curve fitting' as a central part.

zacksiri · 2024-05-28T15:35:28 1716910528

Thank you for your explanation. It's helpful to see another perspective on 'understanding'.

hackinthebochs · 2024-05-28T18:41:38 1716921698

This seems like its confusing how we conceptualize the training/learning process with what the system is actually doing. We conceptualize tuning parameters as curve fitting, and we conceptualize predicting the next token as maximizing probability. But that doesn't mean there is anything like curve fitting or probability maxxing happening as the system's parameters converge.

The core feature of curve fitting is learning explicit examples and then interpolating (in an uninformative manner) between unlearned examples. But there's no reason to think this completely describes what the system is doing, in the sense that there are no more informative descriptions of its behavior. Take an example that LLMs are surprisingly good at, creating poetry given arbitrary constraints. Imagine the ratio of the poems it has seen during its training over the number of unique poems it could create in principle. This number would be vanishingly small. Interpolating between two strings representing well-formed poems in an uninformative manner (i.e. some finite polynomial) will not generate well-formed poems. The only way you could move between two examples of well-formed poems while staying on the manifold of well-formed poems is if you captured all relevant features of the manifold. But I fail to see a difference between capturing all relevant features of the poetry-manifold and understanding poetry.

What LLMs do can be described as curve fitting in only the most uninformative description possible. What they do is discover features of the structures referred to by the training text and competently deploy these features in predicting the next token. A human that could do this would be consider to understand said structure.

msoad · 2024-05-28T07:22:03 1716880923

It seems like a hack to be honest. Problem at hand is not to make transformers do addition of 100 digit numbers. Problem is the current systems can’t reason about things, math included.

Optimizing for a certain use case is not gonna take us where we wanna be. We want to have a system that can learn to reason.

sshine · 2024-05-28T09:36:28 1716888988

> Problem is the current systems can’t reason about things

Sounds like the AGI argument trap: They're not able to reason, but we can't succintly define what it is.

I don't come with a reasoning chip. Whatever I call reasoning happens as a byproduct of my neural process.

I do think that the combination of a transformer network and calls to customized reasoning chips (systems that search and deduce answers, like Wolfram Alpha or logic/proof systems) may be a short-stop to something that can perform reason and execution of actions better than humans, but is not AGI.

jhanschoo · 2024-05-28T16:30:29 1716913829

> They're not able to reason, but we can't succintly define what it is.

For transformer-based LLMs, and most LLMs there's an obvious class of problems that they cannot solve. LLMs generally perform bounded computation per token, so they cannot reason about computational problems that are more than linearly complex, for a sufficiently large input instance. If you have a back-and-forth (many shot) your LLM can possibly utilize the context as state to solve harder problems, up to the context window, of course.

simianparrot · 2024-05-28T22:20:34 1716934834

Humans can realise they don’t understand something and seek more knowledge to learn to understand it. But also humans can build complex structures out of simple fundamentals: The same logic of counting up beans on a table can be extrapolated to multiplying that table of beans. And then counting horses the same way you count beans but give them a value of multiple beans. And then simplify that by trading in promises of beans in trade of horses.

The fact that so many people can’t see the fundamental differences of an LLM and human intelligence reminds me of back when the very early computer scientists thought they could model the entirety of nature by reducing every “component” to a numeric value and compute it as “transfer of energy”.

Quite literally they did the same thing: They had a new toy (very advanced computation machines) and forced all of nature to “fit” within it. It also ended in failure, obviously. Not because nature or ecosystems (as it was coined) are “magic” but because grossly oversimplifying reality to fit desired models is a fool’s errand.

throwthrowuknow · 2024-05-29T01:51:48 1716947508

We’ll have to wait and see how far multi modal training takes us. Text only models are extremely limited by the kind of information we can encode as text and the loss of detail e.g. the word “cat” vs an image of a cat vs video of a cat vs direct physical interaction with a cat vs being a mammal that shares a great deal of biology with a cat. You need a table and beans before you can invent a method for counting them

klabb3 · 2024-05-29T02:23:30 1716949410

> LLMs generally perform bounded computation per token, so they cannot reason about computational problems that are more than linearly complex, for a sufficiently large input instance.

I can’t judge if this is true, because I don’t know transformers well, but if it is, it unravels an intuitive thought I’ve never been able to articulate about not only LLMs, but possibly all pattern matching and the human analog of System 1 thinking.

Another fuzzy way of saying this is there’s something irreducible about complexity that can’t be pattern matched by any bounded heuristic – that it’s wishful thinking to assume historical data contains hidden higher-level patterns that unlock magical shortcuts to novel problems.

jhanschoo · 2024-05-29T16:33:12 1717000392

> it’s wishful thinking to assume historical data contains hidden higher-level patterns that unlock magical shortcuts to novel problems

In the right context, why not? You rely on this everyday to navigate the world with more facility than a newborn.

Have you heard about the different formal notions of complexity and especially Kolmogorov complexity?

lupire · 2024-05-28T20:16:04 1716927364

Humans have the same limitation and use same solution: showing your work and taking notes. There's no blocker here.

jhanschoo · 2024-05-29T01:54:58 1716947698

There is a distinction. Humans with the use of an unbounded scratchpad can emulate a general-purpose Turing machine and perform general computation given unbounded time. A LLM is still restricted to its context window which is a comparatively extreme limitation of memory. In comparison, our general-purpose computers have so much memory this isn't something we care about for most practical instances of hard problems that we solve with a classical CS algorithm. You can obviously modify LLMs to perform unbounded computation per token (and furnish it with a scratchpad) but afaict commercial LLMs today don't offer that.

logicallee · 2024-05-28T10:53:13 1716893593

>They're not able to reason, but we can't [succinctly] define what it is.

People also routinely fail to reason, even programmers often write "obvious" logic bugs they don't notice until it gives an unexpected result at which point it's obvious to them. So both humans and AI don't always reason. But humans reason much better.

I myself have observed ChatGPT 4 solving novel problems I invented to my personal satisfaction well enough to say that it seems to have a rudimentary ability to sometimes show abilities we would typically call reasoning, but only at the level of a child. The issue isn't that it is supposed to reason perfectly or that humans reason perfectly, the issue is that it doesn't reason well enough to succeed at completing many kinds of tasks we would like it to succeed at. Please note that nobody expects it to reason perfectly. "Prove Fermat's last theorem in a rigorous way. Produce a proof that can be checked by Coq, Isabelle, Mizar, or HOL in a format supported directly by any of them" is arguably a request that includes nothing but reasoning and writing code. But we would not expect even Wiles to be able to complete it, and Wiles has actually proved Fermat's last theorem.

So we have an idea of reasoning as completing certain types of tasks successfully, and today humans can do it and AI can't.

Today, it fails badly at tasks that require reasoning. A simple example: https://chatgpt.com/share/da95843e-218a-4d69-a161-6aa2d7a3c9...

The issue is that humans can see its answer is wrong and its "reasoning" is wrong.

The issue isn't that it never reasons correctly. It's that it doesn't do so often enough or well enough, and it doesn't complete tasks we expect humans to complete, and it doesn't always notice when it is printing something outrageously wrong and illogical.

It notices sometimes, it engages in elementary rudimentary guesswork sometimes, but just not often enough or well enough.

vitus · 2024-05-28T12:08:42 1716898122

> Today, it fails badly at tasks that require reasoning. A simple example: https://chatgpt.com/share/da95843e-218a-4d69-a161-6aa2d7a3c9...

> The issue is that humans can see its answer is wrong and its "reasoning" is wrong.

I've noticed with LLMs that they're more likely to come to the wrong conclusion if you prime them in that manner. In this case, you posed the follow-up question as "Will <incorrect conclusion> always be true?" As a result, it's primed to try to prove that incorrect conclusion.

(That said, ChatGPT further did not answer the posed question, as it also changed "difference" -> "absolute difference"; in fact, the difference will alternate between increasing and decreasing, while the absolute difference is strictly increasing.)

FeepingCreature · 2024-05-28T13:38:19 1716903499

Yes, thank you! This exactly matches my experience. The patterns are in there, they're just not prominent or developed enough to reach our level.

That's why I think of GPT3+ as "subhuman AGI," personally.

short_sells_poo · 2024-05-28T10:21:20 1716891680

I suppose it's a question whether what we call "reasoning" is an emergent phenomenon from having enough connections in a graph, or whether it's some other special sauce which we simply don't have in our current models yet. E.g. humans follow a deductive process to answer questions which they haven't encountered yet. Do we gain this ability purely from a denser/larger graph of knowledge, or from a completely different architecture?

I think until we know the answer to this, we can't make predictions about how to build true AGI.

psychoslave · 2024-05-28T10:31:38 1716892298

> E.g. humans follow a deductive process to answer questions which they haven't encountered yet.

Rarely, actually.

More generally humans use all kind of inferences where problem at hand is intertwined with all other attention points that is occupying the mental load of the person. Giving a topic full mental attention and finding a path through pure deduction about a circumscribed subject is a rarity, even if you consider only those situations that require any conscious attention at all to perform some action before moving on.

tadala · 2024-05-28T11:02:04 1716894124

Not within mathematics, where it is the entire sport, and which is the point of contention.

psychoslave · 2024-05-28T12:50:40 1716900640

If there is one space where it shines, sure it’s mathematics. But even there, the most notable mathematicians highly rely on some intuitions far before they manage to prove anything, as well as while selecting/creating their conceptual tools to attempt to build the proof, and rarely go to the point of formalizing their points through Coq/Isabelle or even with meticulous paper craft à la Principia Mathematica from Russel and Whitehead.

calf · 2024-05-29T16:59:57 1717001997

Except humans correctly believe that a Coq proof is theoretically correct whereas an LLM does not have this meta reasoning ability at all.

fishe · 2024-05-29T01:11:23 1716945083

All of our deductive reasoning is founded in induction. For example, the basis of all arithmetic is physics analogies regarding things that exist and the understanding that a thing implies another thing is not based in deduction. Similarly, I suspect from my own experience that general reasoning requires a basic understanding of physics if its origin isn't something ineffable. The ability to connect and find implications cannot itself be purely deductive and it would seem to me that an understanding of physical reality would have to be the origin for that ability.

sshine · 2024-05-28T12:37:23 1716899843

> an emergent phenomenon from having enough connections in a graph, or ... some other special sauce

For humans, it is emergent. But when we reason about reason, we invent special sauce.

If we build our theories of reason into our models, they achieve the strengths and limitations of our models.

If we don't, we're limited by the pace of evolution, because we don't have enough connections in our graph.

So I think we'll have something immediately more useful if we embed ALU special instructions into a neural network.

anon291 · 2024-05-28T13:08:43 1716901723

I must be in the minority here, but I don't think most people exercise any reason. I'd even venture that the vast majority of people haven't reasoned recently at all. In my mind, reasoning is an ability... a willful act to engage in thinking through an abstract problem. Most people don't do this and just use rationalization and learned behavior, which our brains are good at.

short_sells_poo · 2024-05-28T14:14:48 1716905688

Well, 99% of day to day life is mundane for much of living beings on earth. A bee is able to get through it's entire life without showing signs that it deeply ponders about anything.

However, humans have the ability to reason about things (whether most people use this ability is a different question). So then we must ask the question: is this ability just a more advanced form of probabilistic pattern matching, or is it a different architecture altogether? Will current AI models be able to develop this ability, or will we need new models?

ckemere · 2024-05-28T20:49:35 1716929375

People do inference all the time. “Is that driver about to turn?” “Where is the water next to the faucet coming from?” “Does this person like me?”

HarHarVeryFunny · 2024-05-28T14:17:59 1716905879

I think for the most part that's true, but obviously there are things people want to use LLMs for that do require planning/reasoning, and it makes for unexpected failure modes if LLMs don't have this ability.

ekianjo · 2024-05-28T11:15:27 1716894927

> humans follow a deductive process to answer questions which they haven't encountered yet

nope. most humans fall in various traps such as pattern recognition, confirmation bias, and many others instead of relying on deductive analysis. Even scientists fail at being rigorous.

short_sells_poo · 2024-05-28T12:21:07 1716898867

Of course there are cases like this, nobody is perfect. But we are talking about mathematics here, not everyday subconscious decision making. I agree that 99% of daily life is trivial pattern recognition. That's not what distinguishes humans though is it? Because animals, down to single celled organisms do just fine without higher order mental capabilities. But we are talking about reasoning here - and specifically about structured one like math.

andoando · 2024-05-28T16:50:03 1716915003

I disagree that daily life is "trivial pattern recognition".

Just our visual object recognition is immensely powerful and far beyond and current AI. A simple task like walking to the fridge requires a ton of pattern recognition and spatial reasoning. Recognizing people's moods/predicting behaviors is also incredibly involved imo.

Ive said this many times but perhaps we should focus on achieving dog level intelligence first before we start worrying about human level AGI.

short_sells_poo · 2024-05-28T18:07:57 1716919677

Oh I'm very much with you. In fact I get irked by people here breathlessly parroting that human level AGI is upon us any day now. I'd be impressed if an AI had mouse level capabilities any time soon. I think the current models are very impressive, but they are parlor tricks compared to what a true AGI should be capable of.

hollerith · 2024-05-28T18:17:13 1716920233

>if an AI had mouse level capabilities any time soon

That's why nobody has gotten any traction selling access to AIs for $20 a month whereas selling access to mouse labor is such a thriving business.

short_sells_poo · 2024-05-28T22:27:57 1716935277

This is such a strawman. Do you have to really stoop to this level? There are a billion useless things people pay for, is that a measure of the intelligence behind it? People routinely pay $1000 dollars for a dog, does that mean a dog is 50x more intelligent than ChatGPT? All I'm saying is that we should be a bit more humble about intelligence when we understand so little about it.

Just because LLMs are useful, it doesn't mean they exhibit more intelligence than a mouse. A mouse probably also doesn't reason about anything, but it is an agent capable of independent behavior, something that is still very far removed from current AI models.

hollerith · 2024-05-28T23:29:04 1716938944

>All I'm saying is that we should be a bit more humble about intelligence when we understand so little about it.

OK, as long as we're are being humble, how about we refrain from confidently proclaiming that there is a mouse level and a dog level that AI hasn't reached yet and that researchers will have to spend a long time getting past, so there's plenty of time before we have to worry about the possibility of AI's becoming dangerous or transformative to society?

CamperBob2 · 2024-05-28T21:15:30 1716930930

Just our visual object recognition is immensely powerful and far beyond and current AI.

That's a point you'll likely have to revisit pretty soon. Radiology, for instance, probably won't exist as a profession 20-30 years from now. Captchas are already pretty much done for.

andoando · 2024-05-28T23:05:46 1716937546

Well 1. Radiology is an insanely niche subject not indiciative of general intelligence, and 2. AI being at good radiology isn't about object recognition or spatial reasoning, its data analysis connecting features to outcomes.

Lastly, check out the ARC challenge or any other spatial reasoning tests for AI. Humans get ~80% on these challenges whereas the best AI is still at 25%

CamperBob2 · 2024-05-29T05:21:43 1716960103

Can you point me towards a citation for the 25% figure? I'm seeing numbers like 96% ( https://paperswithcode.com/sota/common-sense-reasoning-on-ar... ) but I'm guessing that's just for a subset of the larger class of questions.

Also, are you familiar with this study? What are your thoughts on it? https://www.esmo.org/newsroom/press-and-media-hub/esmo-media... Seems like a valid case where AI is competitive with skilled humans at object/image recognition.

andoando · 2024-05-29T17:17:12 1717003032

It seems theres multiple things by the name ARC. There is one by AI2 which is a text based science questions/word problems. The one Im referring to is this https://lab42.global/arc/

https://lab42.global/arcathon/leaderboard/

https://openreview.net/forum?id=E8m8oySvPJ

As to the study, I have the same objection as the radiology one. This isnt about object recognition and certainly not spatial reasoning, its the ability to predict cancer based on presence of visual features.

The "object recognition" part of this is super simple. Its a single, mostly 2D object in more or less the same angle, and the AI is trained on detecting just this.

CamperBob2 · 2024-05-29T18:21:19 1717006879

The "object recognition" part of this is super simple. Its a single, mostly 2D object in more or less the same angle, and the AI is trained on detecting just this.

And yet it outperforms human dermatologists.

golol · 2024-05-28T10:15:29 1716891329

As I understand, conceptually they just changed 346 + 23 = ? to (1: 3, 2: 4, 3: 6) + (1: 2, 2: 3) = ? So it is not that much of a specific hack. There could be a broader principle here where something is holding transformers back in a general fashion, and we might be able to improve on the architecture!

ckemere · 2024-05-28T20:51:27 1716929487

Hopefully 3:3, 2:4, 1:6 and 2:2, 1:3?

josehackernews · 2024-05-28T07:44:00 1716882240

how do you argue that these models are not able to reason?

deductive reasoning is just drawing specific conclusion from general patterns. something I would argue this models can do (of course not always and are still pretty bad in most cases)

the point i’m trying to make is that sometimes reasoning is overrated and put on the top of the cognitive ladder, sometimes I have seen it compared to self-awareness or stuff like that. I know that you are not probably saying it in this way, just wanted to let it out.

I believe there is fundamental work still to be done, maybe models that are able to draw patterns comparing experience, but this kind of work can be useful as make us reflect in every step of what these models do, and how much the internal representation learned can be optimized

math_dandy · 2024-05-28T16:32:41 1716913961

We have no definition of reasoning that is sufficiently precise to be useful.

But we do have a bunch of benchmark tasks/datasets that test what we intuitively understand to be aspects of reasoning.

For AI models, "being able to reason" means "performing well on these benchmarks tasks/datasets".

Over time, we'll add more benchmarking tasks and datasets that ostensibly test aspects of "reasoning", and people will develop models that succeed on more and more of these simultaneously.

And these models will become more and more useful. And people will still argue over whether they are truly "reasoning".

YeGoblynQueenne · 2024-05-28T07:47:09 1716882429

>> deductive reasoning is just drawing specific conclusion from general patterns.

This is according to whom, please?

nicklecompte · 2024-05-28T12:12:57 1716898377

The fundamental argument of "Artificial Intelligence, Natural Stupidity" is that AI researchers constantly abuse terms like "reasoning," "deduction," "understanding," and so on, deluding others and themselves that their machine is almost as intelligent as a human when it's clearly dumber than a dog. My cats don't need "general patterns" to form deductions, they deduce many sophisticated things (on their terms) with n=1 data points.

In the 80s the computers were indisputably dumber than ants. That's probably not true these days. But the decades-long refusal of most AI researchers to accept humility about the limitations of their knowledge (now they describe multiple-choice science trivia as "graduate level reasoning") suggests to me that none of us will live to see an AI that's smarter than a mouse. There's just too much money and ideology, and too little falsifiability.

YeGoblynQueenne · 2024-05-28T12:43:17 1716900197

Drew McDermot's warning is well-heeded, but there are established and well-understood definitions of deductive, inductive and abductive reasoning that go back to at least Charles Sanders Pierce (philosopher and pioneer of predicate logic, contemporary of Gotlob Frege) that are widely accepted in AI research, and that even McDermot would have accepted. See sig for intro.

nicklecompte · 2024-05-28T13:20:38 1716902438

This is completely irrelevant. McDermot's point was that scientifically-plausible definitions of reasoning were not actually being used in practice by AI researchers when they made claims about their systems. That is just as true today.

YeGoblynQueenne · 2024-05-28T16:02:56 1716912176

I've read McDermot's paper a few times (it's a favourite of mine) and I don't remember that angle. Can you please clarify why you say that's his point?

fishe · 2024-05-29T01:15:39 1716945339

Ants behave in ways that a modern computer still can't imitate. I don't think that generalized intelligence is possible but if it is it would need a different starting point than our current computing hardware. Even insects are flexible in ways that computers aren't.

naasking · 2024-05-28T13:27:55 1716902875

> My cats don't need "general patterns" to form deductions, they deduce many sophisticated things (on their terms) with n=1 data points.

No they don't. That's just generalization, so they've seen plenty of other data points that are similar enough.

foolswisdom · 2024-05-28T17:24:38 1716917078

> Deductive reasoning is the process of drawing valid inferences. An inference is valid if its conclusion follows logically from its premises, meaning that it is impossible for the premises to be true and the conclusion to be false.

<https://en.wikipedia.org/wiki/Deductive_reasoning>

YeGoblynQueenne · 2024-05-28T19:46:07 1716925567

That's not the definition used by the comment above.

HarHarVeryFunny · 2024-05-28T11:36:34 1716896194

> how do you argue that these models are not able to reason?

They just don't have the right architecture to support it.

An LLM is just a fixed size stack of N transformer layers, and has no working memory other than the temporary activations between layers. There are always exactly N steps of "logic" (embedding transformation) put into each word output.

You can use prompts like "think step by step" to try to work around these limitations so that a complex problem can (with good planning by the model) be broken down into M steps of N layers, and the model's own output in early steps acts as pseudo-memory for later steps, but this only gets you so far. It provides a workaround for the fixed N layers and memory, but creates critical dependency on ability to plan and maintain coherency while manipulating long contexts, which are both observed weaknesses of LLMs.

Human reasoning/planning isn't a linear process of N steps - in the general case it's more like an iterative/explorative process of what-if prediction/deduction, backtracking etc, requiring working memory and focus on the task. There's a lot more to the architecture of our brain than a stack of layers - a transformer is just not up to the job, nor was built for it.

mdp2021 · 2024-05-28T09:02:14 1716886934

It is not «deductive reasoning»: it is just "reasoning". That is, revising a body of ideas for qualities pertinent to alethic (truthfulness) and understanding (completeness).

It is critical thinking, continuous cycles of reprocessing.

And this cannot be overrated: it is the core activity.

msoad · 2024-05-28T08:00:03 1716883203

> how do you argue that these models are not able to reason?

I don't make this argument. Benchmarks like CLUTRR[1] show how poorly LLMs do in reasoning.

[1] https://github.com/facebookresearch/clutrr

Last5Digits · 2024-05-28T10:56:34 1716893794

There is a difference between poor reasoning and no reasoning. SOTA LLMs correctly answer a significant number of these questions correctly. The likelihood of doing so without reasoning is astronomically small.

Reasoning in general is not a binary or global property. You aren't surprised when high-schoolers don't, after having learned how to draw 2D shapes, immediately go on to draw 200D hypercubes.

wrsh07 · 2024-05-28T14:21:18 1716906078

Granting that, the original point was that they're not excited about this particular paper unless (for example) it improves the networks' general reasoning abilities.

The problem was never "my llm can't do addition" - it can write python code!

The problem is "my llm can't solve hard problems that require reasoning"

Shrezzing · 2024-05-28T09:28:16 1716888496

>deductive reasoning is just drawing specific conclusion from general patterns. something I would argue this models can do

That the models can't see a corpus of 1-5 digit addition then generalise that out to n-digit addition is an indicator that their reasoning capacities are very poor and inefficient.

Young children take a single textbook & couple of days worth of tuition to achieve generalised understanding of addition. Models train for the equivalent of hundreds of years, across (nearly) the totality of human achievement in mathematics, and struggle with 10-digit addition.

This is not suggestive of an underlying capacity to draw conclusions from general patterns.

mike_hearn · 2024-05-28T11:04:29 1716894269

> Young children take a single textbook & couple of days worth of tuition to achieve generalised understanding of addition

Maybe you did! Most young children cannot actually do bigint arithmetic reliably or at all after a couple days worth of tuition!

throwthrowuknow · 2024-05-28T10:58:29 1716893909

I think the “train for hundreds of years” argument is misleading. It’s based off of parallel compute time and how long it would take to run the same training sequentially on a single GPU. This assumes an equivalence with human thought based on the tokens per second rate of the model which is a bad measurement because it varies depending on hardware and the closest comparison you could draw to what a human brain is doing would be either the act of writing or speaking but we obviously process a lot more information and produce a higher volume of information at a much higher rate than we can speak or write. Imagine if you had to verbally direct each motion of your body, it would take an absurd amount of time to do anything depending on the specificity you had to work with.

The work done in this paper is very interesting and your dismissal of “it can’t see a corpus and then generalize to n digits” is not called for. They are training models from scratch in 24 hours per model using only 20 million samples. It’s hard to equate that to an activity a single human could do. It’s as though you had piles of accounting ledgers filled with sums and no other information or knowledge of mathematics, numbers or the world and you discovered how to do addition based on that information alone. There is no textbook or tutor helping them do this either it should be noted.

There is a form of generalization if it can derive an algorithm based on a maximum length of 20 digit operands that also works for 120 digits. Is it the same algorithm we use by limiting ourselves to adding two digits at a time? Probably not but it may emulate some of what we are doing.

OtherShrezzing · 2024-05-28T16:35:19 1716914119

>There is no textbook or tutor helping them do this either it should be noted.

For this particular paper there isn't, but all of the large frontier models do have textbooks (we can assume they have almost all modern textbooks). They also have formal proofs of addition in Principia Mathematica, alongside nearly every math paper ever produced. And still, they demonstrate an incapacity to deal with relatively trivial addition - even though they can give you a step-by-step breakdown of how to correctly perform that addition with the columnar-addition approach. This juxtaposition seems transparently at odds with the idea of an underlying understanding & deductive reasoning in this context.

>There is a form of generalization if it can derive an algorithm based on a maximum length of 20 digit operands that also works for 120 digits. Is it the same algorithm we use by limiting ourselves to adding two digits at a time? Probably not but it may emulate some of what we are doing.

The paper is technically interesting, but I think it's reasonable to definitively conclude the model had not created an algorithm that is remotely as effective as columnar addition. If it had, it would be able to perform addition on n-size integers. Instead it has created a relatively predictable result that, when given lots of domain-specific problems, transformers get better at approximating the results of those domain-specific problems, and that when faced with problems significantly beyond its training data, its accuracy degrades.

That's not a useless result. But it's not the deductive reasoning that was being discussed in the thread - at least if you add the (relatively uncontroversial) caveat that deductive reasoning should lead to correct conclusion.

baq · 2024-05-28T09:43:07 1716889387

We're as humanity building a reasoning machine bottom up. It can't reason... yet. Expecting a magical switch that will make it reason about anything and everything is unreasonable. Starting with arithmetic makes perfect sense.

psychoslave · 2024-05-28T10:21:51 1716891711

I didn’t test with all LLM out there, but all of thus I tested failed with something as basic as "What is the number of words in the sentence coming before the next one? Please answer."

asgeir · 2024-05-28T11:08:19 1716894499

In my experience, LLMs tend to perform better if you give them instructions before the data to be operated on. At least for the ~13b size models.

So,something like: Please count the number of words in the following sentence. "What is the number of words in the sentence coming before the next one?"

edit: Which might be an artifact of the training data always being in that kind of format.

olalonde · 2024-05-28T11:28:32 1716895712

GPT-4 (OpenAI):

The sentence you're referring to is "What is the number of words in the sentence coming before the next one? Please answer." It contains 14 words.

mikeocool · 2024-05-28T15:34:37 1716910477

Interestingly, chat gpt 4o gave me the answer 15.

psychoslave · 2024-05-28T12:54:18 1716900858

Thanks. I don’t have access to this engine which for some reason is kept in a closed garden for richer people. ¯\_(ツ)_/¯

wrsh07 · 2024-05-28T14:23:07 1716906187

You can always use the API which is dirt cheap? Just put $5 on and access via the playground

They have better data policies and your $5 will go way farther than a 1 month subscription

itchyjunk · 2024-05-28T10:41:15 1716892875

How many humans have you tested this with?

psychoslave · 2024-05-28T10:48:00 1716893280

Interesting point. Would you please answer the question I was mentioning? :)

Thorham · 2024-05-28T11:09:26 1716894566

grumpopotamus · 2024-05-28T14:36:49 1716907009

>Problem is the current systems can’t reason about things, math included.

Have you tried asking GPT-4 any questions that require reasoning to solve? If so, what did you ask, and what did it get wrong?

Havoc · 2024-05-28T10:28:58 1716892138

For things like this where we have computationally cheap, well understood, reliable tools available (aka calculator) it seems better to train the model in tool use.

I guess perhaps the techniques could be generalized though?

mike_hearn · 2024-05-28T11:00:53 1716894053

Generalizable techniques is mostly the point of papers like this one yes. What they show here is that apparently fundamental problems with transformer reasoning can be fixed by encoding data in a more sophisticated manner. This is exciting. I've been thinking for a long time that the tokenization schemes are a low hanging fruit for improving coding LLM performance, this isn't exactly the same thing but it's in the same general area. Smartness and reasoning ability with the current set of algorithmic techniques seems to have topped out around GPT-4 level, which implies that further leaps in mental abilities must come from improving other things beyond training set size.

For example, whilst replacing the need for a calculator isn't very important, one obvious research direction would be to explore adding extra embeddings to code inputs, perhaps that are being computed by an IDE.

HarHarVeryFunny · 2024-05-28T11:49:50 1716896990

It seems sub-word tokenization vs using character inputs is just a trade off to gain computational efficiency, and obviously isn't how our brain works. We're not born with a fixed visual tokenization scheme - we learn to create our own groupings and object representations.

However, transformers seem to struggle a bit with accurately manipulating sequences, so going to character inputs and hoping for those to be aggregated into words/numbers/etc might cause more problems than it solves?

I have to wonder if these models would not be better off learning whole-word embeddings rather than tokens. You'd have thought they would learn embeddings that encode any useful relatedness (e.g. corresponding to common prefixes) between words. Perhaps numbers would be better off input as a sequence of individual digit embeddings.

mike_hearn · 2024-05-29T09:44:15 1716975855

Yeah a tiny vocab of characters doesn't work that well, it was tried very early on and creating large vocabs of tokens was a big improvement. Which makes sense. A lot of tokens are full words and so the token->embedding phase can quickly look up an embedding in vector space that contains a lot of meaning, whereas an embedding of 'z' or whatever is going to be meaningless.

HarHarVeryFunny · 2024-05-30T12:02:45 1717070565

I guess this extends to numbers split across multiple tokens too (especially in the somewhat odd way the OpenAI tokenizer does it). The model is having to work really hard to learn what a given sequence of number chunks means (e.g. chunks '123' '45' vs '123' '4'). It somehow need to realize that the embedding for '4' represents a single-digit number, but the embedding for '45' represents a two-digit number, and this then correspondingly changes the meaning of the preceding '123' token!

It would have made it easier for the model to grok numbers if, similar to the proposed alternative, if 1234 was tokenized as '1000' '200' '30' '4' for powers of 10 up to some reasonable limit (then maybe '1^' '2^' after this reasonable limit). This would let the model easily grok human-sized numbers and need to work harder to grok, say, 20-digit ones, just the same as we do. Some early curriculum training, while not necessary, could then help it to quickly learn which embeddings represent numbers which are d * 10^1 vs d * 10^2, etc.

mike_hearn · 2024-05-30T19:52:16 1717098736

That's sort of what this paper is doing. They add positional embeddings so the model can understand the positions of the digits inside the numbers better.

0-_-0 · 2024-05-28T11:35:19 1716896119

To me this finding shows how transformers don't generalise, since they need specialised embeddings to handle a problem

HarHarVeryFunny · 2024-05-28T15:32:15 1716910335

I think this is more a matter of how numbers are input and lack of specific training, including visual training.

For example, the number 12,345,678 is input to ChatGPT as the three tokens "123" "456" "78", which isn't the best place to start to learn that this is an 8 digit number with specific digit positions!

https://platform.openai.com/tokenizer

As a human child you learn about numbers largely visually by pointing to units, tens, hundreds etc, visually aligning them to add, etc. Maybe a multi-modal model, if it was visually trained on chalkboard primary school math, would do better in learning the concept of position based powers of 10, etc.

Havoc · 2024-05-28T12:18:05 1716898685

I'd say the key point here isn't that they "need" specialised embeddings, but rather that it improves things and it can samewhat manage without.

That's a far more surmountable problem. Maybe you need one model for biology and another for coding etc. i.e. Broad split by domain. Still weak AI not true general in AGI sense, but still seems like a good next step

int_19h · 2024-05-29T01:50:34 1716947434

The fact that transformers generalize is kinda evident from the fact that they can solve novel puzzles.

verticalscaler · 2024-05-28T11:44:03 1716896643

Creating the universe in 100 lines of code is the ultimate code golf and we have all been nerd sniped.

teleforce · 2024-05-28T09:38:59 1716889139

I think understanding mathematics is what LLM really need at the moment far more important than video generation that is just another form of CGI [1]. After deep learning and transformer, understanding mathematics and its proofs not just arithmetic will be the next game changer for LLM and a turning point for humanity.

[1] Why LLMs like ChatGPT and Google Bard are bad at math:

https://www.xda-developers.com/why-llms-are-bad-at-math/

staunton · 2024-05-28T09:44:50 1716889490

> understanding mathematics and its proofs not just arithmetic will be the next game changer for LLM

Why?

I definitely agree that such capabilities would represent a major advance (and very likely go together with game changing increases of capabilities in other areas). I also think using AI to write formal math proofs in e.g. Lean is very cool.

However, by itself, it seems like this capability wouldn't be very useful, commercially for example. Do you think this capability is exceptionally informative merely because it has to go together with other capabilities? It's not impossible to have a (maybe somewhat limited) formal math AI that will remain mostly irrelevant to the everyday world (like FormalGeo).

simiones · 2024-05-28T10:10:39 1716891039

Understanding mathematics basically means understading higher-level reasoning. If an AI were able to actually do this + the ability to generate and interpret language that LLMs already show, it would seem to be 90% or more of the way to AGI.

ADeerAppeared · 2024-05-28T13:28:36 1716902916

> However, by itself, it seems like this capability wouldn't be very useful, commercially for example.

Quite the opposite, it's the holy grail of all AI.

Consider various work that isn't (and can't) be done by computers/robots/etc right now.

The intelligence constraint is universally, a required amount of problem solving. Even the "low skill" labour requires it.

And to perform such problem solving, you need advanced logic and reasoning capabilities, which is the same thing as novel mathematics, just applied to a different end.

staunton · 2024-05-28T13:57:29 1716904649

Let's be a little more concrete: do you think FormalGeo [1] is a big deal? I think it's very cool but ultimately not useful in and of itself. It's only useful insofar as it shows AI capabilities advancing in general.

Let's suppose we had an AI that works roughly like [1] but for the kind of mathematics done in Lean's Mathlib, and that was on par or better than humans working on it. Would that AI by itself be commercially useful?

Again, of course having such an AI implies a major jump in capabilities and it would most likely mean useful AI can be trained with similar techniques. But that's not what I mean by the system itself being useful. If all you're saying is that such an AI demonstates we can now probably build AIs that do things which we usually say require "logic and reasoning abilities", I completely agree.

Maybe I'm splitting hairs too much here. However, it could well be that such an AI would be useful by itself. I just can't think of much besides a major advance in the formal software verification niche, which still almost nobody would use...

[1]: https://github.com/FormalGeo/FormalGeo

ADeerAppeared · 2024-05-28T17:21:22 1716916882

> I just can't think of much besides a major advance in the formal software verification niche, which still almost nobody would use...

The reason is slightly different here.

What's so desirable here is an AI system with such general intelligence that it is capable of such mathematics by itself as a consequence. Not because the mathematics is so useful, but because the required reasoning capabilities are at such a level that, we could speak of an artificial intelligence that is meaningfully "general" about any problem.

It's a decent approximation of "able to solve any problem" that we can still reasonably test.

> Let's be a little more concrete: do you think FormalGeo [1] is a big deal?

It looks to be an interesting approach in modelling mathematics, and their use of machine learning is an interesting novelty that may pave the way to more useful general mathematics systems, but I can't find much about how these systems might interop with current/'generative' AI systems.

And that last bit is one of the big roadblocks for current AI. They're very weak at reasoning, but we can't directly interop to (e.g.) LLMs, so we can't compensate for that weakness.

jiggawatts · 2024-05-28T08:18:57 1716884337

Something I've been thinking about is how the Minds -- the super-human AI hyper-computers that fly the ships in the Culture series of novels are described. The image built up in my head[1] is that they're hybrids blending neural networks and regular compute substrates. They can calculate, simulate, and reason in combination.

There have been crude attempts at this already, hooking in Mathematica and Python into ChatGPT. I say crude, because these add-ons are controlled via output tokens.

What I would like to see is a GPT-style AI that also has compute blocks, not just transformer blocks. I don't mean compute in the sense of "matrix multiply for weights and biases", but literally an ALU-style block of basic maths operations available for use by the neurons.

One thought that I had was that this could be via activations that have both a floating-point activation value and "baggage" such as a numerical value from the input. Like a token in a traditional parser, that can represent a constant string or an integer with its decoded value.

The newer, truly multi-modal models gave me a related idea: Just like how they can have "image" tokens and "audio" tokens, I wonder if they could be given "numeric data" tokens or "math symbol" tokens. Not in the same way that they're given mixed-language text tokens, but dedicated tokens that are fed into both the transformer blocks and also into ALU blocks.

Just an idle thought...

[1] Every reader reads into a story something unique, which may or may not align with what the author intended. This is my understanding, coloured by my own knowledge, etc, etc...

int_19h · 2024-05-29T01:54:17 1716947657

The problem, if you embed an ALU like that, is how to train it to use them properly. And then it's not clear if they actually need to be able to do that in the middle of a pass that, at the end, is going to produce a single token anyway.

Controlling that stuff via output tokens actually kinda makes sense by analogy, since that is how we use calculators etc. But I do agree that specialized tokens that are used specifically to activate tools like that might be a better idea than just using plain text to signal in-band. And production of such specialized tokens can be easily trained.

vessenes · 2024-05-28T11:10:13 1716894613

Fellow huge Banks fan here.

I like this idea a lot. Right now we are going the long/hard way round, and post training asking an LLM to know it needs compute, then write a compute request, then feed back the compute answer into a tokenization loop.

It probably does make sense to add a mini CPU as a layer / tool / math primitive. I wonder how you'd train it to use such a thing? In my mind it's not really a layer per-se, but it's a set of function calls a layer could route to when it wants, and weight the response appropriately.

torginus · 2024-05-28T10:15:12 1716891312

I just wonder if numbers were written right to left, llms would be much better at arithmetic. You can 'predict' the least significant digit by reusing the already written digits in the computation, but to generate most significant ones, you generally need to do the entire computation in one go.

gwern · 2024-05-28T15:01:31 1716908491

Yes. This has already been demonstrated by "Teaching Arithmetic to Small Transformers" https://arxiv.org/abs/2307.03381 , I'm not sure what OP adds except demonstrating that you can do that via the embedding itself rather than the tokenization.

> We first demonstrate that conventional training data is not the most effective for arithmetic learning, and simple formatting changes can significantly improve accuracy. This leads to sharp phase transitions as a function of training data scale, which, in some cases, can be explained through connections to low-rank matrix completion. Building on prior work, we then train on chain-of-thought style data that includes intermediate step results. Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed. We also study the interplay between arithmetic and text data during training and examine the effects of few-shot prompting, pretraining, and model scale. Additionally, we discuss length generalization challenges.

weinzierl · 2024-05-28T11:05:33 1716894333

This is an interesting idea but probably hard to verify.

A tangent is that positional systems were originally invented with least digit first, I believe.

The Babylonian sexagesimal system was like that as was the Arabic one (where first is on the right).

The most significant digit first convention came when right-to left numbers were used in left-to-right systems without reversing them in writing. To this day we read the more common smaller numbers least significant digit first to varying degrees.

16 = six teen, sech zehn

98 = acht und neunzig, achten negentig, ثمانية وتسعون

lupire · 2024-05-28T20:21:36 1716927696

Left to right is fine, but it takes more work (multi shot) to do carries.

17 + 14 = 20 + 11 = 30 + 1 = 31

vs 17 + 14 = 10 + 10 + 10 + 1 = 31

spencerchubb · 2024-05-28T14:56:38 1716908198

They do that in the paper. Least significant digit on the left

pmayrgundter · 2024-05-28T11:16:21 1716894981

I'm curious about the framing of research like this.. "The poor performance of transformers on arithmetic tasks" (relative to what?) and how that informs the adjacent conversation on progress towards AGI.

Some say AGI has already been achieved, others that it's years or decades away. When I dig into the disagreement, it often partially depends on the perspective of how competent humans are on the tasks in question, with the optimists being, I think, more realistic about variance in human intelligence and the pessimists seeming to reserve the term "general intelligence" for possessing a nearly perfect suite of capabilities that many otherwise intelligent people practically don't have.

For example with arithmetic, this study cites another [Dziri et al. 2023], that says:

"For instance, humans can solve 3-digit by 3-digit multiplication arithmetic after learning basic calculation rules. Yet, off-the-shelf ChatGPT and GPT4 achieve only 55% and 59% accuracies on this task, respectively."

But this isn't the case.. 5-6% of the population have https://en.wikipedia.org/wiki/Dyscalculia, but can be otherwise normal.

I still see value in normative statements about human capability in AI & AGI research, but I think we'll need to move towards explicit statistical framing.

DeepMind's Position paper "Levels of AGI for Operationalizing Progress on the Path to AGI" has a schema like this, where AGI capabilities are defined across 2 axes of Performance level X Generality (narrow vs general), and the Performance levels are measured by comparison with "Percentile of skilled adults" able to perform the task.. https://arxiv.org/pdf/2311.02462#page=3.40

Within that framing, this paper's title or result might be "Achieving AGI Competency in Arithmetic", or "Expertise", or "Virtuosity", i.e. on par respectively with 50th, 90th or 99th percentile of skilled adults.

Last5Digits · 2024-05-28T11:50:04 1716897004

Exactly, we need a much more granular approach to evaluating intelligence and generality. Our current conception of intelligence largely works because humans share evolutionary history and partake in the same 10+ years of standardized training. As such, many dimensions of our intelligence correlate quite a bit, and you can likely infer a person's "general" proficiency or education by checking only a subset of those dimensions. If someone can't do arithmetic then it's very unlikely that they'll be able to compute integrals.

LLMs don't share that property, though. Their distribution of proficiency over various dimensions and subfields is highly variable and only slightly correlated. Therefore, it makes no sense to infer the ability or inability to perform some magically global type of reasoning or generalization from just a subset of tasks, the way we do for humans.

pmayrgundter · 2024-05-28T13:32:26 1716903146

Agreed on the first part, but for LLMs not having correlated capabilities, I think we've seen they do. As the GPTs progress, mainly by model size, their scores across a battery of tests goes up, eg OpenAI's paper for ChatGPT 4, showing a leap in performance across a couple dozen tests.

Also found this, a Mensa test for across the top dozen frontier models.

https://www.maximumtruth.org/p/ais-ranked-by-iq-ai-passes-10...

That does seem to me to be demonstrating a global type of reasoning or generalization.

Also see the author's note that at least with Claude, they seem to be releasing about every 20 IQ points.

CuriouslyC · 2024-05-28T11:52:03 1716897123

AGI is like consciousness, 75% of the people in any given conversation are talking about different things.

Truthfully we're going to see that improving language models towards AGI works out the same way self driving cars did - we're going to feel like we're 85% of the way there out of the gate, then we're going to keep tripping over things for the next 15 years.

At least with AGI, we can just throw up our hands, use an easier definition and take the W.

edflsafoiewq · 2024-05-28T13:07:52 1716901672

I don't understand the framing of your comment. You act like the LLM's feelings are going to be hurt if you say it isn't a real AGI. "Well, you can't do basic math expected of fifth graders, but there are dumb fifth graders too, so here's the 'human-level intelligence' participation trophy anyway."

ADeerAppeared · 2024-05-28T13:10:36 1716901836

> But this isn't the case.. 5-6% of the population have https://en.wikipedia.org/wiki/Dyscalculia, but can be otherwise normal.

This nitpicking is a red herring.

The issue that separates "AGI" from current AI systems is the lack of generality. (Humour me.)

In particular, the lack of reasoning capability. And what the pessimists argue here is that there is no road to get there for current systems. Transformers are approximation machines, and are generalized for that specific task. But that's also where it stops, they can't do things that aren't such pattern-approximation.

Optimizing a transformer for arithmetic isn't a step towards AGI, because it is not generalizing. You'd need to do this for every conceivable task and subtask. This is the exact reason why imperative-programmed AI architectures were discarded.

Put bluntly, this approach will never get you a transformer that won't shit itself when asked to do novel reasoning tasks, such as novel mathematics. (Which I will remind the reader, anything but the basic programming work counts as)

And critically, the fundamental architecture of these transformer systems doesn't allow the combination of them into other AI systems to acquire generalized capabilities. There's no way to make an LLM hook into a computer-algebra-system, you can only feed 'finished' output of one system into another.

infogulch · 2024-05-28T07:04:57 1716879897

The other day I was wondering if LLMs are bad at at maths because they don't have readily apparent access to the concept of "columns". Apparently the answer is yes.

Vertical alignment across lines is pretty important for humans to learn operations on digits, but the way we encode lines with a \n separator doesn't really help. In a recent codebullet video gpt really struggled with any kind of vertical alignment task. I wonder if it would do better on a fixed 80 column width...

Terr_ · 2024-05-28T08:00:54 1716883254

Isn't it more that they don't have ready access to the much-more-fundamental concept of decimal numbers?

My understanding was that they tokenized them into chunks and tried to learn associations between the chunks, the same as if one was breaking apart English words.

So "2+2=4" isn't being treated that differently from "all's well that ends well." This might lead to a kind of Benny's Rules [0] situation, where sufficient brute-force can make a collection of overfitted non-arithmetic rules appear to work.

[0] https://blog.mathed.net/2011/07/rysk-erlwangers-bennys-conce...

isaacfung · 2024-05-28T10:28:40 1716892120

The current gen llms tokenize numbers digit by digit unlike earlier llms.

Last5Digits · 2024-05-28T10:49:02 1716893342

They don't. Which you can easily check with any of the dozen web apps currently implementing the GPT-4o tokenizer.

mike_hearn · 2024-05-28T11:06:13 1716894373

No, it doesn't help. Bloomberg tried this and it didn't seem to make much difference.

singularity2001 · 2024-05-28T15:57:05 1716911825

If someone else is interested in the Bloomberg tokenizer:

https://medium.com/generative-ai-insights-for-business-leade...

andrepd · 2024-05-28T10:22:50 1716891770

Fascinating article!

Terr_ · 2024-05-29T00:46:38 1716943598

It looks like the math-notation formatting didn't survive, for that you might want to see a PDF, ex: https://people.wou.edu/~girodm/library/benny.pdf

matrix2596 · 2024-05-28T11:51:26 1716897086

wouldnt presenting numbers in reverse order, with the least significant digit on the left and most significant on the right help with the reasoning?

spencerchubb · 2024-05-28T14:57:31 1716908251

They do that in the paper

topherjaynes · 2024-05-28T18:44:10 1716921850

I went through the paper and thought immediately about how did they implement it; I missed they published their code as well. Here is the link for everyone who skimmed past it: https://github.com/mcleish7/arithmetic/tree/main

byt3h3ad · 2024-05-28T18:45:42 1716921942

my bad, should have posted it with the link itself

topherjaynes · 2024-05-28T18:51:14 1716922274

Good to start with the concept, I just had so many implementation questions. Working through the code know... which is way harder to digest.

nerdponx · 2024-05-28T20:06:47 1716926807

I like to see more focus on the input embeddings.

It's basically the same as feature engineering in pre-deep machine learning: constructing features with high information content can significantly reduce the amount of data and computation needed to fit a useful model. And sometimes it's impossible to fit a useful model without careful feature engineering, either because the model itself is constrained in some way or because there isn't enough data or both.

It's analogous to making a choice of inductive bias within the model itself. We literally could not do LLMs without the carefully-constructed transformer architecture. Why should we expect to make further progress without paying more attention to the embeddings?

Shrezzing · 2024-05-28T09:44:29 1716889469

Since models are very good at writing very short computer programs, and computer programs are very good at mathematical calculations, would it not be considerably more efficient to train them to recognise a "what is x + y" type problem, and respond with the answer to "write and execute a small javascript program to calculate x + y, then share the result"?

Grimblewald · 2024-05-28T09:49:52 1716889792

From a getting answers perspective yes, from an understanding LLMs perspective no. If you read the avstract you can see how this goes beyond arithmetic and helps with longform reasoning

simiones · 2024-05-28T09:57:34 1716890254

But that's not all that relevant to the question "can LLMs do math". People don't really need ChatGPT to replace a calculator. They are interested in whether the LLM has learned higher reasoning skills from it's training on language (especially since we know it has "read" more math books than any human could in a lifetime). Responding with a program that reuses the + primitive in JS proves no such thing. Even responding with a description of the addition algorithm doesn't prove that it has "understood" maths, if it can't actually run that algorithm itself - it's essentially looking up a memorized definition. The only real proof is actually having the LLM itself perform the addition (without any special-case logic).

This question is of course relevant only in a research sense, in seeking to understand to what extent and in what ways the LLM is acting as a stochastic parrot vs gaining a type of "understanding", for lack of a better word.

Shrezzing · 2024-05-28T10:49:34 1716893374

That's a fair summary of why the research is happening. Thanks.

gmerc · 2024-05-28T09:51:28 1716889888

That's in fact what ChatGPT does ... because 99% accurate math is not useful to anyone.

ADeerAppeared · 2024-05-28T13:35:48 1716903348

This is a cromulent approach, though it would be far more effective to have the LLM generate computer-algebra-system instructions.

The problem is that it's not particularly useful: As the problem complexity increases, the user will need to be increasingly specific in the prompt, rapidly approaching being fully exact. There's simply no point to it if your prompt has to (basically) spell out the entire program.

And at that point, the user might as well use the backing system directly, and we should just write a convenient input DSL for that.

andrepd · 2024-05-28T10:16:58 1716891418

Yes, this is what external tools/plugins/api calls are all about.

kjhcvkek77 · 2024-05-28T09:40:12 1716889212

Very cool that it was able to generalise from small numbers to larger ones with such high accuracy.

skyde · 2024-05-28T16:38:34 1716914314

Why not apply same concept every time a word is split into more than one token?

Basically if a word contain a Prefix, suffix or root word. We could have a token position relative to the start of the word in the embedding.

skyde · 2024-05-28T16:53:57 1716915237

It seems it has been done before:

"Syntax-Aware Transformer Models for Neural Machine Translation" by Yang et al. (2019). This model enhances the transformer architecture with syntax-aware attention mechanisms that consider dependency parse trees.

Context-Aware Neural Machine Translation Learns Anaphora Resolution" by Bawden et al. (2018). This paper explores integrating context and syntax into neural machine translation models.

_bifc · 2024-05-28T07:20:54 1716880854

I think the main problem is the way we turn the raw mathematics symbols or equations into tokens, and these suboptimal tokenization may decreases the performance

ynik · 2024-05-28T08:44:28 1716885868

I thinks that's far from the only problem. To me the most obvious problem is that we use right-to-left numbers (think about the order you're writing digits when doing long addition) in a left-to-right language. Without a special number-flipping step; the transformer is forced to produce the output token-by-token, i.e. from left-to-right. Without the ability to store additional internal state, this turns addition into an O(N²) problem purely due to the suboptimal output ordering!

Shrezzing · 2024-05-28T10:14:28 1716891268

The paper discusses this, and the approach taken in the paper implements a number-flip stage, so numbers are formatted with their least significant figure first.

threatofrain · 2024-05-28T08:35:31 1716885331

That doesn't stop decent code output for many computer languages.

wantsanagent · 2024-05-28T20:09:28 1716926968

I like these kinds of fixes. It's like realizing your child has vision problems and getting them glasses.

winddude · 2024-05-28T20:38:39 1716928719

But a calculator wouldn't be very good if it's only correct 99% of the time for arithmetic...

CyberDildonics · 2024-05-29T00:24:48 1716942288

I'm pretty sure getting computers to do arithmetic is not a giant hurdle.

YeGoblynQueenne · 2024-05-28T07:55:52 1716882952

What is the point of this work? 99% on 100-digit arithmetic means there's a 0% chance anyone will ever use a Transformer as an ALU or anything of the kind. We already know how to hard-code a (literally) infinitely more accurate addition machine.

And not only addition: all four arithmetic operations. The technique proposed in the article -imposing a strong inductive bias for addition- kiind of works for multiplication, but not for subtraction or division (clearly; I can't even find the words in the paper). As a practical way to build a machine to do arithmetic this is out of the question.

We've known how to mechanise arithmetic since the 1850's with Blaize Pascal and his Pascaline. What is the point in demonstrating it's possible to reinvent a broken, partial, buggy version of an arithmetic machine if one tries really hard and shoehorns the necessary patterns in a neural net? We've known that for a long time, too (every proof that a neural net can simulate this or that Turing machine if you design the network diagram and set the weights by hand, ever).

So what is the point of this? Transformers are supposed to be the "sparks of AGI" and they can almost do arithmetic if we try very hard to shove it down their heads? Who cares?

golol · 2024-05-28T10:27:56 1716892076

Ok you want the general answer? Consider a discrete time Markov process with memory length N on a finite state space. Train a transformer with context length N on sample trajectories with SGD. Can you expect the transformer to become a good approximation for the dynamics of the Markov process? More specifically, suppose your Markov process is generated by some algorithm/Turing machine couple with some random data. Then, can you expect the transformer to learn to emulate the behavior of the underlying Turing machine, even when run on data which was notnin the initial distribution?

Another way to phrase it: Given a physical process that generates discrete time series trajectories, can our current transformer + SGD method learn to emulate the underlying physical processes by observing sample trajectories?

This question can be somewhat mathematically stated but it is quite difficult because there are still some words in there where I used common sense. For example mathematically there will always exist weird counterexamples, so you would have to quantify things very carefully. That's very difficult, so experiments are the best we can do right now.

Hence any instance where transformers fail to learn a Marko process are very interesting. Example: Addition of random numbers.

YeGoblynQueenne · 2024-05-28T12:49:05 1716900545

Is addition a Markov process? I really don't think so. You can certainly model e.g. integer addition by a Markov process, up to some integer k but addition itself is usually formalised by the Peano axioms, that are not quite Markovian. I guess you can see the relation between S(n) and S(S(n)) as some kind of Markov chain. That's really not a standard view though.

In any case, a complete theory of addition must be correct up to inifinity so you won't get that with any Markov process we can train from data. Although you can learn addition with a simple linear regression, by setting the weights appropriately. That's because a function of a line already includes addition, and multiplication, and that's basically not very different to what the team in the paper above is trying to do. Meaning: they're trying to hand-code the concept of addition in embeddings. It's not 100% because they're also at the same time trying to not 100% encode it, but it's a hard balance to strike.

mike_hearn · 2024-05-28T11:10:21 1716894621

They explain their true goal in the introduction:

> With positions resolved, we can study the logical extrapolation ability of transformers

They are interested in how well they can make a neural net logically extrapolate outside its training set, once encoding barriers are removed. They show that in fact even quite small language models can do this successfully once we're not confusing them with bad encodings anymore.

This seems like fundamental work. It was only a few years ago that Google employees were arguing LLMs were nothing more than "stochastic parrots". Well, that take will go down in history as one of the worst takes on AI ever. I don't think anyone really had any doubt by 2024 that this wasn't true, but the huge and opaque datasets meant people could always argue that maybe this wasn't an example of logical reasoning or extrapolation, maybe it had just seen this specific question before. But this work shows in a controlled environment that the model can learn the principles of addition and extrapolate to much larger numbers. It's not just repeating answers it's seen in its dataset. It should kill off the parrot meme for good.

YeGoblynQueenne · 2024-05-28T12:19:07 1716898747

>> But this work shows in a controlled environment that the model can learn the principles of addition and extrapolate to much larger numbers.

No, because it's given hand-engineered embeddings that act as a strong inductive bias that is specific to addition. It's like addition is programmed right in.

zarzavat · 2024-05-28T08:09:20 1716883760

It’s not about arithmetic but about embeddings. The positional embeddings used in transformers are rather simplistic. If they can add this one new capability to transformers by using different embeddings then maybe there are other capabilities that are within reach.

YeGoblynQueenne · 2024-05-28T12:19:29 1716898769

No, because those embeddings only work for addition (very weakly for multiplication and sorting). Imagine needing a specially-crafted bias for every single task. The Deep Learning revolution brought on by Convolutional Neural Nets was supposed to do away with the need to do exactly that.

dagss · 2024-05-28T09:41:46 1716889306

By such arguments, what is the point of any research, at all?

If everyone was using horses what would you had said about the first prototype car? Probably a very slow and clumsy and failureprone thing.

toxik · 2024-05-28T08:14:36 1716884076

I think there is a good reason to find low-hanging fruits that pay dividends on these types of tasks, not because solving addition with a transformer is a good idea, but because it could improve performance in other parts of the network. Maybe there are other subsequences that could be annotated in this way? Per paragraph, tokens per word, who knows.

Obviously, the "best" way to do addition on a computer is by doing it exactly.

YeGoblynQueenne · 2024-05-28T12:22:35 1716898955

>> I think there is a good reason to find low-hanging fruits that pay dividends on these types of tasks, not because solving addition with a transformer is a good idea, but because it could improve performance in other parts of the network.

The paper makes this claim but if they could do that, they'd have showed it already: instead their hand-crafted, artisanal embeddings only work well for addition and only weakly for multiplication and sorting, and not at all for other arithmetic operations.

Chinjut · 2024-05-28T15:46:36 1716911196

Minor point, but Blaise Pascal was centuries earlier than the 1850s.

YeGoblynQueenne · 2024-05-28T19:33:05 1716924785

Thanks, you're right - my bad.

IanCal · 2024-05-28T08:04:21 1716883461

There are two sides to this that jump out

One is that research into what the limits of the architecture are is useful. Maths has a nice property of being very easy to verify and you can construct logical processes with it. It's a useful testbed.

Second is there are a lot more places that understanding how to do arithmetic help, outside of just doing sums on their own.

og_kalu · 2024-05-28T14:13:51 1716905631

>What is the point of this work? 99% on 100-digit arithmetic means there's a 0% chance anyone will ever use a Transformer as an ALU or anything of the kind. We already know how to hard-code a (literally) infinitely more accurate addition machine.

Nobody's going to be replacing calculators with transformers sure but many are and will be using transformers to solve problems arithmetic is a necessary component of.

>So what is the point of this? Transformers are supposed to be the "sparks of AGI" and they can almost do arithmetic if we try very hard to shove it down their heads? Who cares?

You don't need to shove anything down for transformers to get arithmetic. Just changing how numbers are tokenized works. But that requires an entire retrain so why not explore other techniques?

And what does any of this have to do with AGI ? You know how terrible humans are at arithmetic right ?

YeGoblynQueenne · 2024-05-28T19:39:05 1716925145

Yes, but humans invented arithmetic. And then we invented computers that are much better than us at arithmetic calculations. That's a pattern we can observe all over the place: we're pretty damn good at inventing rich models of complex environments and processes but we're not very good at calculating the results of such models when that requires a lot of computation.

E.g., take chess. Modelling a game of chess as a game tree and searching the game tree by adversarial search is a human invention. Humans are pretty crap at searching a game tree beyond a handful of ply, but we can program a computer to go dozens of ply deep across thousands of branches, and beat any human.

So the challenge for AI is not to get computers to calculate when we know how the calculation is to be performed. The challenge is to get computers to create their own models. And that's a grand, open challenge that is not even close to be solved, certainly not by LLMs. Yann LeCun and Yoshua Bengio have said similar things.

The linked work doesn't move the needle any closer to that and it just shows progress in calculating arithmetic using a transformer, which we already know how to do in a myriad different ways and much more accurately. Hence my criticism for it.

og_kalu · 2024-05-29T00:10:13 1716941413

>Yes, but humans invented arithmetic.

I think most would argue Mathematics is a discipline that is discovered more than invented. That said, this isn't really the point I think.

A few humans invented/discovered arithmetic. Most humans will be born, live and die inventing absolutely nothing, even those with the opportunity and resources to do so.

It doesn't make sense to me that a bar most humans can't reach is the bar for General Intelligence of the Artificial kind. You can't eat your cake and have it.

Don't get me wrong. It's a fine goal to have. Of course we want machines that can invent things and push the frontier of science! It is still however a logical fallacy that an inability to do such would disqualify machines of general intelligence when it does not do so for Humans.

>The challenge is to get computers to create their own models. And that's a grand, open challenge that is not even close to be solved, certainly not by LLMs.

LLMs have fairly complex models of the world made manifest by the data they're trained on.

https://transformer-circuits.pub/2024/scaling-monosemanticit...

Lecun may disagree but some others like Hinton, Ilya and Norvig don't.

YeGoblynQueenne · 2024-05-29T08:10:57 1716970257

>> Most humans will be born, live and die inventing absolutely nothing, even those with the opportunity and resources to do so.

I don't think that's right at all. I like to visit museums. You really get hit in the face with the unending creativity of the human mind and the variety of all that human hands have crafted over thousands of years across hundreds of cultures. I would go as far as to say that the natural state of the human mind is to create new things all the time. And mathematics itself was not created (invented or discovered) by one person, but by many thousands.

In any case, it doesn't matter if one instance of the class of human minds hasn't invented anything, in the same way that it doesn't matter if one car can't do 80mph. It's indisputable that we have the capacity for some novelty, and generality, in our thinking. Maybe not every member of the species will achieve the same things, but the fact is that the species, as a species, has the ability to come up with never-before seen things: art, maths, tech, bad poetry, you name it.

>> Lecun may disagree but some others like Hinton, Ilya and Norvig don't.

I'm with LeCun and Bengio. There's a fair amount of confusion about what a "model" is in that sense: a theory of the world. There's no reason why LLMs should have that. Maybe a transformer architecture could develop a model of the world- but it would have to be trained on, well, the world, first. Sutskever's bet is that a model can be learned from text generated by entities that already have a world model, i.e. us, but LeCun is right in pointing out that a lot of what we know about the world is never transmitted by text or language.

I can see that in my work: I work with planning, right now, where the standard thing is to create a model in some mathematical logic notation, that is at once as powerful as human language and much more precise, and then let a planning agent make decisions according to that model. It's obvious that despite having rich and powerful notations available there is information about the world that we simply don't know how to encode. That information will not be found in text, either.

Sutskever again seems to think that, that kind of information, can somehow be guessed from the text, but that seems like a very tall order, and Transformers don't look like the right architecture. You need something that can learn hidden (latent) variables. Transformers can't do that.

og_kalu · 2024-05-29T17:10:57 1717002657

>In any case, it doesn't matter if one instance of the class of human minds hasn't invented anything, in the same way that it doesn't matter if one car can't do 80mph.

It does matter, depending on what claim you're making. We've not reached the upper bound of transformer ability. Until we clearly do, then it very much does matter.

>I'm with LeCun and Bengio. There's a fair amount of confusion about what a "model" is in that sense: a theory of the world. There's no reason why LLMs should have that.

See this is my problem with Lecun's arguments. He usually starts with the premise that it's not possible and works his way from there. If you disagree with the premise then there's very little left. "Well it shouldn't be possible" is not a convincing argument, especially when we really have very little clue on the nature of intelligence.

>Sutskever's bet is that a model can be learned from text generated by entities that already have a world model, i.e. us, but LeCun is right in pointing out that a lot of what we know about the world is never transmitted by text or language.

A lot of the world is transmitted by things humans don't have access to. Wouldn't birds that can naturally sense electromagnetic waves to intuit direction say humans have no model of the world ? Would they be right ? Nobody is trained on the world. Everything that exists is trained on small slices of it. A lot of the world is transmitted by text and language. And if push comes to shove then text and language is not the only thing you can train a transformer on.

>Sutskever again seems to think that, that kind of information, can somehow be guessed from the text, but that seems like a very tall order,

I don't think this is as tall an order as you believe

>and Transformers don't look like the right architecture. You need something that can learn hidden (latent) variables. Transformers can't do that.

But they do this all the time.

Transformer trained on only protein sequences learns biological structure and function - https://www.pnas.org/doi/full/10.1073/pnas.2016239118

Toy example on binary addition (transformer trained on inputs and outputs of addition sequences) learn an algorithm for it - https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mec...

Unless i'm misunderstanding what you mean by hidden variables, it's very clear a transformer is regularly learning not just the sequences themselves but what might produce them.

YeGoblynQueenne · 2024-05-31T08:44:17 1717145057

Sorry, I missed this.

>> Unless i'm misunderstanding what you mean by hidden variables, it's very clear a transformer is regularly learning not just the sequences themselves but what might produce them.

That's what I mean, but I don't think that's happening regulary, or at all. I don't see where the transformer architecture allows for this. Of course we can claim that any model of a process from examples is implicitly modelling the underlying sub-processes, for example we can claim that a multivariate regression that predicts the age at death from demographic data is somehow learning to represent human behaviour, say, but that's one of those big claims that need big evidence.

On the two works you link to, I know the one on mechanistic interpretabiity. As the author says:

Epistemic status: I feel pretty confident that I have fully reverse engineered this network, and have enough different lines of evidence that I am confident in how it works.

But I don't feel that confident at all that the author's confidence should instill confidence in myself. A clear, direct proof is needed, although of course we can discuss what a proof even means and how much it is a social construct etc.

The other paper, I haven't read. I'm going to bet it's basically data leakage which is a pervasive problem with most deep learning work that suffices to invalidate many big claims about big results. I'll have to read the paper a bit more carefully.

But, again, what is in the transformer architecture that can predict hidden variables?