Hacker News new | past | comments | ask | show | jobs | submit login
“Emergent” abilities in LLMs actually develop gradually and predictably – study (quantamagazine.org)
255 points by Anon84 7 months ago | hide | past | favorite | 201 comments



There are several issues with the study:

1. Replacing pass/fail accuracy with smoother alternatives (e.g token edit distance) could be a terrible proxy for skill, depending on the task.

2. Even by the authors' metrics, they _still_ find a few potentially emergent abilities.

3. Hindsight is 20-20. Yes, we can revisit the data and fiddle until we find transforms that erase emergence from aptitude plots. The fact is, folk used commonplace test accuracy measurements, and the results were unpredictable and surprising. That's the true notable phenomenon.

I think there's value in the paper. Just...don't take its conclusions too far.


Just like it's mentioned later in the article: it doesn't really matter if you get an addition mostly right. You either get it right or you don't. I still appreciate their effort though, because even after altering the grading system, there were still some emergent abilities.


Assume we have a child, and we test him regularly:

- Test 1: First he can just draw squiggles on the math test

- Test 2: Then he can do arithmetic correctly

- Test 3: He fails on the last details on the algebraic calculation.

Now, event though he fails on all tests, any reasonable parent would see that he improving nicely, and would be able to work in his chosen field in a year or so.

Or alternatively, if we talk about AI, we can set the Test as a threshold, and we see the results are continuously trending upwards, and we can expect the curve to breach the threshold in the future.

That is; measuring improvement, instead of pass/fail, allows one to predict when we might be able to use the AI for something.


With AI you can do millions of tests. Some tests are easy by chance (eg. "Please multiply this list of numbers by zero"). Some tests are correct by chance alone, easy or hard.

When you actually do these millions of tests, I don't think it really matters what the exact success metric is - an AI which is 'closer to correct, but still wrong' on one test will still get more tests correct overall on the dataset of millions of tests.


Human beings do arithmetic problems wrong all the time so I'm not sure "doing addition 100% right" is a merit of intelligence.

I'm not saying LLM will achieve AGI (I don't know if it will, or when it does we'll even know). But somehow people seem to be judging AI's intelligence with this simple procedural:

1. Find a task that AI can't do perfectly. 2. Gotcha! AI isn't intelligent.

It just makes me question humans' intelligence if anything.


Arithmetics is extremely easy for a neural network to perform and learn perfectly, that LLMs fails to learn it even though it is so easy is strong evidence that LLMs has very limited capability to learn logical structures that can't be represented as grammar.

> Human beings do arithmetic problems wrong all the time

Humans built cars and planes and massive ships before we had calculators, that requires a massive amount of calculations that are all perfect to be possible. Humans aren't bad at getting calculations right, they are just a bit slow. Today humans are bad since we don't practice it, not because we can't. LLMs can't do that today, can learn and can't is a massive difference.


My intuition is that a significant challenge for LLMs' ability to do arithmetics has to do with tokenization. For instance, `1654+73225` as per the OpenAI tokenizer tool breaks down into `165•4•+•732•25`, meaning the LLM is incapable of considering digits individually; that is, "165" is a single "word" and its relationship to "4" and in fact each other token representing a numerical value has to be learned. It can't do simple carry operations (or other arithmetic abstractions humans have access to) in the vast majority of cases because its internal representation of text is not designed for this. Arithmetic is easy to do in base 10 or 2 or 16, but it's a whole lot harder in base ~100k where 99% of the "digits" are words like "cat" or "///////////".

Compare that to understanding arbitrary base64-encoded strings; that's much harder for humans to do without tools. Tokenization still isn't _the_ greatest fit for it, but it's a lot more tractable, and LLMs can do it no problem. Even understanding ASCII art is impressive, given they have no innate idea of what any letter looks like, and they "see" fragments of each letter on each line.

So I'm not sure if I agree or disagree with you here. I'd say LLMs in fact have very impressive capabilities to learn logical structures. Whether grammar is the problem isn't clear to me, but their internal representation format obviously and enormously influences how much harder seemingly trivial tasks become. Perhaps some efforts in hand-tuning vocabularies could improve performance in some tasks, perhaps something different altogether is necessary, but I don't think it's an impossible hurdle to overcome.


I don't think that's really how it works - sure this is true at the first level in a neural network, but in deep neural networks after the first few layers the LLM shouldn't be 'thinking' in tokens anymore.

The tokens are just the input - the internal representation can be totally different (and that format isn't tokens).


Please don't act like you "know how it works" when you obviously don't.

The issue is not the fact that the model "thinks or doesn't think in tokens". The model is forced at the final sampling/decoding step to convert it's latent back into tokens, one token at a time.

The models are fully capable of understanding the premise that they should "output a 5-7-5 syllable Haiku", but from the perspective of a model trying to count its own syllables, this is not possible, as its own vocabulary is tokenized in such a way that not only does the model not have direct phonetic information within the dataset, but it literally has no analogue for how humans count syllables (measuring mouth drops). Models can't reason about the number of characters or even tokens used in a reply too for the same exact reason too.

The person you're replying to broadly is right, and you are broadly wrong. The internal format does not matter when the final decoding step forces a return of tokenization. Please actually use these systems rather than pontificating about them online.


Thank god we aren’t talking about a model counting syllables then.


That requires converting from a weird unhelpful form into a more helpful form first, so yes but the tokenisation makes things harder as it adds an extra step - they need to learn how these things relate while having significant amounts of the structure hidden from them.


This conversion is inherent in the problem of language and maths though - Two, too (misspelt), 2, duo, dos, $0.02, and one apple next to another apple, 0b10 and 二 can all represent the (fairly abstract) concept of two.

The conversion to a helpful form is required anyway (also lets remember that computers don't work in base 10, and there isn't really a reason to believe that base 10 is inherently great for LLM's either)


It is, but there's a reason I teach my son addition like this:

    hundreds | tens | ones

        1        2      3
    +   2        1      5
    -----------------------
        3        3      8
Rather than

unoDOOOOS(third) {}{}{} [512354]_ = three"ate

* replace {}{}{} with addition, {}{} is subtraction unless followed by three spaces in which case it's also addition * translate and correct any misspellings * [512354] look up in your tables * _ is 15 * dotted lines indicate repeated numbers

Technically they're doing the same thing. One we would assume is harder to learn the fundamental concepts from.


Right, which is why testing arithmetics is a good way to test how well LLMs generalize their capabilities to non text tasks. LLMs can in theory be excellent at it, but they aren't due to how they are trained.


The tokens are the structure over which the attention mechanism is permutation equivariant. This structure permeates the forward pass, its important at every layer and will be until we find something better than attention.


> Arithmetics is extremely easy for a neural network to perform and learn perfectly

That'd depend on the design of the neural net and training objective.

It's certainly not something that comes naturally to an LLM which neither has numbers as inputs or outputs, nor is trained with an arithmetic objective.

Consider inputting "12345 * 10" into GPT-4. First thing it is going to do is tokenize the input, then embed these tokens, and these embedding vectors are then the starting point of what the transformer has to work with...

https://platform.openai.com/tokenizer

You can use OpenAI's tokenizer tool (above) to see how it represents the "12345 * 10" character sequence as tokens, and the answer is that it breaks it down into the token ID sequence [4513, 1774, 353, 220, 605]. The [4513, 1774] represents the character sequence "12345", and "605" represents the character sequence "10".

These token ID's will then be "embedded", which means mapping them to points in a very high dimensional space (e.g. 4096-D for LLaMA 7B), so each of those token ID's becomes a vector of 4096 1's and 0's, and these vectors are what the model itself actually sees as input.

So, for "12345 * 10", what the model sees during training is that whenever it sees V1 V2 V3 V4 it should predict V5, where V1-5 are those 4096-D input token embeddings. The model has no idea what any of these mean - they might represent "the cat sat on the mat" for all it knows. They are just a bunch of token representations, and the LLM is just trying to find patterns in the examples it is given to figure out what the preferred "next token" output is.

So, could you build (and train) a neural net to multiply, or add, two numbers together? Yes you could, if that is all you want to do. Is that what an LLM is? No, an LLM is a sequence predictor, not an NN designed and trained to do arithmetic, and all that is inside an LLM is a transformer (sequence-to-sequence predictor).


I know why it is hard for LLMs to learn this, that was the whole point. The way we make LLMs today means they can't identify such structures, and that is strong evidence they wont become smart just by scaling since all the things you brought up will still be true as we scale up.

To solve this you would need some sub networks that are pretrained to handle numbers and math and other domains, and then you start training the giant LLM it can find and connect those things. But we don't know how to do that well yet afaik, and I bet all the big players has already tested things like that. As you say adding capabilities to the same model is hard.


An LLM can learn to identify math easily enough, it's just that performing calculations just using language isn't very efficient, even if it's basically what we do ourselves. If you want an LLM to do it like us, then give it a pencil and paper ("think step by step").

If you want the LLM to be better than a human at math, then give it a calculator, or access to something like Wolfram Alpha for harder problems. Your proposed solution of "give it a specialized NN for math" is basically the same, but if you are going to give it a tool, they why not give it a more powerful one like a calculator ?!


Humans were terrible at getting calculations right - that's why we invented abacuses, slide-rules, books of mathematical tables and tabulation machines.


Humans invented those since we are slow and have limited working memory. But we managed to invent those since we understand how to perform reliable calculations.


Yes, but that acknowledges that there is a difference between understanding how to perform reliable calculations, and actually being able to perform reliable calculations.

Humans are good at the former, but not the latter.


Humans are good at performing reliable calculations with pen and paper. That is the same kind of tools that LLMs works with. I'm not sure why humans can do that but not LLMs, the task should be way easier for an LLM.


> Humans are good at performing reliable calculations with pen and paper.

Speak for yourself. Even though I've always been strong at my conceptual understanding and problem solving in math, I always found it difficult to avoid arithmetic mistakes on pen and paper and could never understand why I was assessed on that. I could have done so much better in high-school math if I was allowed to use a programmable computer for the calculations.

And I think it's the same for LLMs, we should assess them on doing the arithmetic in a single pass, but rather on writing the code to perform the calculation, and responding based on that.


Maybe a lot of people suffer from a degree of dyscalculia, but in my experience if you do it a lot you just stop making mistakes. Not just me, many others I've seen reliably do calculations pretty quick without making errors, you just do everything twice as you go and then arithmetic errors go to basically 0.

But I do acknowledge that there are probably some or many humans that maybe can't reach that level of reliability with arithmetics.


LLMs (internally) don't have a pen and paper equivalent. Their output is the output of their neurons. Like if I was a head on a table with a screen on my forehead that printed out my thoughts as they appeared in my head. Ask (promt) me my favorite color and "green" would show up on the screen.

This is why prompting LLM's to show their steps works so well, it makes them work through the problem "in their head" more efficiently, rather than just spit out an answer.

However, you can give LLM's external access to tools. Ask GPT4 a particularly challenging math problem, and it will write a python script and run it to get a solution. That is an LLM's "pen and paper".


> That is an LLM's "pen and paper".

No, that is an LLM's calculator or programming, it doesn't actually do the steps when it does that. When I use pen and paper to solve a problem I do all steps on my own, when I use a calculator or a programming language the tool does a lot of the work.

That difference is massive, since when I use a calculator that doesn't help me learn numbers and how they interact and how algorithms works, while if I do the steps myself I do. So getting an LLM that can reliably execute algorithms like us humans can is probably a critical step towards making them as reliable and smart as humans.

I do agree though that if LLMs could keep a hidden voice they used to reason before writing they could do better, but that voice being shown to the end user shouldn't make the model dumber, you would just see more spam.


You are spitting hairs on technicalities here. You need to do a lot of "steps" to write a program that solves your question. Debatably even more steps and more complexity than using pen and paper.

Maybe we should be giving the LLM's MS paint instead of python to work out problems? There is nothing unique or "human" about running through a long division problem, it is ultimately just an algorithm that is followed to arrive at a solution.


> There is nothing unique or "human" about running through a long division problem, it is ultimately just an algorithm that is followed to arrive at a solution.

Yes, which is why we should try to make LLMs do them and that way open them up to learn much more complex understanding of algorithms and instructions that humans has yet to build a tool for.

> You need to do a lot of "steps" to write a program that solves your question. Debatably even more steps and more complexity than using pen and paper.

What does this have to do with anything? I am highlighting a core deficiency in how LLMs are able to reason, you saying that what they currently do is harder doesn't change the fact that they are bad at this sort of reasoning.

And no, making such a program doesn't require more steps or understanding. You Google for a solution and then paste in your values, that is much easier to teach a kid than to teach them math. I am sure I can teach almost any 7 year old kid to add two numbers by changing values in a python program in about an hour, much faster than they could learn math the normal way. Working with such templates is the easiest task for an LLM, what we want is to try to get the LLM to do things that is harder for it.


Here is a prompt you can plug into GPT4:

"I have a problem for you to solve. Muffins sell for $3/each. rick bakes 30 muffins a day. Tom bakes 2 muffins monday, 4 tuesday, 6 wednsdays, up to 14 on sunday. On days which tom and jerry combined bake more than 41 muffins, the price of the muffins drops to $2.50. How much total revenue do rick and tom take in during a full week, combined."

Please tell me how ChaptGPT4 writing a script to solve that is not logical reasoning, while a human pulling out pen and paper to do it is...


> Please tell me how ChaptGPT4 writing a script to solve that is not logical reasoning, while a human pulling out pen and paper to do it is...

I changed the prompt a bit (made all the numbers 3-4 digits) and gpt-4 answered with this, it just made up numbers for the days that you didn't add numbers for so it failed before it even came to arithmetics. Here is what it said, after I said this about tom "Tom bakes 2911 muffins monday, 491 tuesday, 699 wednsdays, up to 149 on sunday.", it just assumed sundays number was for all other weekdays not given a human wouldn't do that, and it missed the "up to" statement. Maye the large numbers I gave threw it off, but if that is enough to throw it of just shows that it can't really reason.

So thanks for that, more evidence these models are bad at reasoning.

Here is the first part of what it responded with, it is wrong already here:

   First, let's calculate the number of muffins baked by Tom during the week:

   Monday: 2911
   Tuesday: 491
   Wednesday: 699
   Thursday: 149
   Friday: 149
   Saturday: 149
   Sunday: 149
Edit: Here it made an arithmetics error just below, the error is that 4062 is not greater than 4199, so two critical errors, I taught math at college for years and you wouldn't find many students making mistakes like this:

   Let's determine the days when Tom and Rick combined bake more than 4199 muffins:

   Monday: 2911 (Tom) + 3571 (Rick) = 6482
   Tuesday: 491 (Tom) + 3571 (Rick) = 4062
   Wednesday: 699 (Tom) + 3571 (Rick) = 4270

   On Monday, Tuesday, and Wednesday, they bake more than 4199 muffins combined, so the price of the muffins drops to $2851.50 on those days.


Just so we have this straight, you completely changed the nature of the problem (by turning a perfect information problem into an imperfect information problem) and then are looking at me with a straight face to make your point? Please...

Unless of course you didn't realize that tom has a pattern to his baking, at which point to irony becomes palpable.

And on top of that, I am willing to bet if you give me your prompt, I would be able to restructure it in such a way that GPT4 would be able to answer it correctly. More often than not, people are just really bad at properly asking it questions.


> Just so we have this straight, you completely changed the nature of the problem (by turning a perfect information problem into an imperfect information problem) and then are looking at me with a straight face to make your point? Please...

I used your exact quote and just changed the numbers, it is still a perfect information problem.

Or, ah right you mean you gave me an imperfect information problem since you assumed the reader would guess those values. Yeah, I read it as a perfect information problem where all values were given, and then you would give the income as a range of possible income values based on how many muffins were baked on Sunday. None of the LLMs I sent it to managed to solve it entirely, it is a pretty easy problem.

Reasonable way to parse your sentence is:

   Monday: 2, Tuesday: 4, Wednesday: 6, Sunday: 0-14, rest: doesn't work so 0
> Unless of course you didn't realize that tom has a pattern to his baking, at which point to irony becomes palpable.

If you didn't say he baked on those days then he didn't bake on those days. The specification is clear. If I say "I will bake 2 muffins on Tuesday and 6 muffins on Sunday" the reasonable interpretation is that I wont bake anything the rest of the days. Why would you assume he baked anything at all those days?

Or if I say "Emily will work Mondays and Thursdays", do you just guess the rest of the days she will work? No, you assume she just works those days.

Is that a standard problem you wrote from memory? Not sure why you would assume there were muffins baked in the days you didn't list.

For example, if I say Tom bakes up to 14 muffins on Sunday, then the reasonable interpretation is that Tom will bake 0-14 muffins on Sunday. Maybe you should write the prompt clearer if you mean something else? Because as written anyone would assume that he didn't bake the other day, and on sundays he baked up to 14 muffins.

Anyway, it failed even with your "up to" interpretation meaning the reader should fill in the values, it still made that math error. But it using your "up to" interpretation there is a huge red flag, since in a real environment nobody would give that kind of information as a riddle with hidden values, you would specify all the values for each day each person worked and the rest you assume the person just isn't working and baked 0 muffins. If the LLM starts to guess values for some patterns and words where it doesn't make sense then it is really unreliable.


I can see why some humans would struggle with the phrasing.

Thankfully GPT4 has strong reasoning skills and knew exactly what I meant.

https://chat.openai.com/c/b0ed06f1-c0d3-46a6-b07c-289b328417...

I encourage you to see the chat yourself, and would love to here how it's not reasoning.

Edit: Fixed Link: https://chat.openai.com/share/991ca8af-f735-436f-bfc2-5df929...

you can click the [>_] at the end for the code generated.

Seem to have hit reply cut off


I just get this from your link

   Unable to load conversation b0ed06f1-c0d3-46a6-b07c-289b328417bb


> For example, if I say Tom bakes up to 14 muffins on Sunday, then the reasonable interpretation is that Tom will bake 0-14 muffins on Sunday.

I don't have a stake in this muffin game, but that's indeed how I interpreted the instructions when reading them.

Had it said "and so on up tp 14 on Sunday" I would assume he baked each day.


> That is an LLM's "pen and paper".

No, that's an LLM's Python playground.

An LLM's "pen and paper" is "think step by step" where it gets to see it's own output to keep track of what it is doing.

I'd expect that with appropriate prompting one could get a good model to one/few-shot learn how to do addition this way.


LLM's do that ok too... just not for crazy complex equations that would be tough for most humans with pen and paper.

See below which I have just run on GPT4: https://chat.openai.com/share/3adb3aa2-8aec-474f-bdb0-4d761d...


I know they can do that, but not as reliably as I can for example or typical engineers from 80 years ago. I did engineer exams without a calculator just did all the calculations with pen and paper, didn't make mistakes, just takes a bit longer since calculating trigonometric functions takes a bit but still not a lot of time compared to how much you have.

That was how everyone did it back then, it really isn't that hard to do. Most people today never tried to do it so they think it is much harder than it actually is.


> strong evidence that LLMs has very limited capability to learn logical structures that can't be represented as grammar.

To add multi-digit numbers requires short term memory (are we on the units, or tens? was there a carry?), which LLMs don't have, so that's really the issue.

The normal workaround for lack of memory in LLMs is "think step-by-step" to use it's own output (which gets fed back in as an input) as memory, and I'd assume that with appropriate training data and prompting an LLM could learn to do it in this fashion - not just giving the answer, but by giving all the steps.

I suppose in theory LLMs could do limited precision math even without memory, if they did it in a single pass through their stack of transformer layers (96 for GPT-3) - use first few layers to add units and generate carry, next few layers to add tens, etc. I'm not sure how, or even if, one could train them to do it this way though - perhaps via a curriculum training agenda of first adding single digit numbers, then two-digit ones, etc ?


> Arithmetics is extremely easy for a neural network to perform and learn perfectly, that LLMs fails to learn it even though it is so easy is strong evidence that LLMs has very limited capability to learn logical structures that can't be represented as grammar.

IDK, there was an article posted here about yet another LLM that performed very badly on the math tests because they mistakenly left out all the math training data.

What impressed me was that it could learn any math at all from just 'reading' books or whatever. Though, perhaps, any correct answer could be attributed to pure luck, dunno.


Arithmetics is extremely easy for a neural network to perform and learn perfectly

Is it?


Yes, loss minimization quickly gets to the correct implementation of arithmetics since the primitives of neural networks are just math operations, so training it to add or multiply two inputs into an output is very easy. This is so easy and obvious that you run it to test that your neural network implementation works, if it can't figure out arithmetics then you have done something wrong.

LLMs fails to figure out that this is what it has to do, instead it looks like it has a ton of specialized rules to handle arithmetics that results in a lot of errors in the output and are extremely expensive to run.


So the networks you mentioned aren’t LLMs? Why is that a correct comparison then. Like blaming a human that they can’t jump like a cat or multiply like an arbitrary-precision library.


> So the networks you mentioned aren’t LLMs? Why is that a correct comparison then

Because an LLM is a neural network and neural networks contains neural networks. There is nothing stopping it from having an embedded neural network that learned how to do computations well, except an inability to identify such structures and patterns well enough to train for it.


Tokenizing ‘1735’ as a value of 1735 because you’ve seen a lot of math is probably the most difficult part.


> It just makes me question humans' intelligence if anything.

A more serious byproduct of the tendency to talk about machines in anthropomorphic terms is the companion phenomenon of talking about people in mechanistic terminology. The critical reading of articles about computer-assisted learning —excuse me: CAL for the intimi— leaves you no option: in the eyes of their authors, the educational process is simply reduced to a caricature, something like the building up of conditional reflexes. For those educationists, Pavlov’s dog adequately captures the essence of Mankind —while I can assure you, from intimate observations, that it only captures a minute fraction of what is involved in being a dog.

https://www.cs.utexas.edu/users/EWD/transcriptions/EWD09xx/E...


I mean I could equally say that the opposing bias is

1. Choose a few good blunt instruments we use to gatekeep students on the premise that it tests their "intelligence" (or wait, do we mean subject matter comprehension with this one?)

2. Apply a big ol' machine learning model to those tests

3. Woa it's smarter than a third grader! OMG it's smarter than a lawyer! You guys this must be ASI already!

Rhetoric and selective rigor can justify any perspective. Smart and stupid arguments can be made for any position. Water is wet

I also can't claim to know with certainty whether transformers are going to end up being AGI in some meaningful sense, but I will definitely say that we've created a lot of rubrics for assessing human intelligence that mostly exist for expediency, and a cursory glance at education should tell you there's a lot of Goodhart's Law going on with all of 'em. I know for a fact I can do a damn good job on your average multiple choice test on knowing some etymology and being good at logical elimination, and I can bullshit my way through an essay, both without taking the class, and I view this more as a flaw in the instrument than evidence that I'm a godlike superintelligence that can just know anything without studying it. Humans make a lot of tests that are soft to bullshitting with a little pattern-recognition thrown in


> it doesn't really matter if you get an addition mostly right.

This is definitely not true in the real world. Approximate solutions are often good enough to answer a question.


But then it isn't "addition" but approximating "addition".


There's always some uncertainty to any kind of answer or computation, it's just at some threshold we unconsciously decide to take it for a fact.


LLMs aren't wrong by a small percentage, they are wrong by a small number of tokens. They can miss a zero or be off by 100% and its just a token difference, to the LLM that is a minor mistake since everything else was right but it is a massive mistake in practice.


I watch math classes on youtube and some lecturers make symbolic mistakes all the time. Minus instead of a plus, missing exponents, saying x but writing y, etc. They only notice it when something unexpected contradicts down the line.


They got it right as you said, it just took a bit longer. That doesn't contradict what I said, humans can get things right very reliably by looking over the answers especially if you have another human to help look at the answers. An AI isn't comparable to a human, it is comparable to a team of humans, two ChatGPTs can't get more accurate by correcting each others answers but two humans can.


A professor might be able to iterate to a correct answer but a student might not.

And ChatGPT is definitely able to get improve its answer by iterating, it just depends on the toughness of the problem. If it's too difficult, no amount of iteration will get it much closer to the correct answer. If it's closer to its reasoning limits, then iterating will help.


But if you stop them just there, an error persists. A professor is “multi-modal” and in a constant stream of evebts, including their lecture plan and premeditated key results. Are you sure that at some level of LLM “intelligence”, putting it into the same shoes wouldn’t improve the whole setting enough? I mean sure, they make mistakes. But if you stop-frame a professor, they make mistakes too. They don’t correct immediately, only after a contradiction gets presented. Reminds me how LLMs behave. Am I wrong here?

Edit: was answering to gp, no idea how my post got here


Asking the LLM to correct itself doesn't improve answers since they will happily add errors to correct answers when asked to correct it. That makes it different from humans, humans can iterate and get better, our current LLMs can't.

> But if you stop them just there, an error persists

But humans doesn't stop there when they are making things that needs to be reliably correct. When errors aren't a big deal humans make a lot of errors, but when errors costs life humans become very reliable by taking more time and looking things over. They still sometimes makes mistakes that kills people, but very rarely.


So many things contribute to human error it is probably impossible to make a 1 to 1 parallel with LLM's. For instance, the fact that you are being recorded is in many cases a significant performance drop.


What uncertainty and threshold is there in the addition of integers, for example (within mathematics and the usual definitions)? Or in Boolean logic with the "and" operation?

I don't think everything has uncertainty and thresholds to it, especially, when it actually resides outside of a technical implementation.


To verify the answer you'll always need to trust the technical implementation that's doing the computation. Doesn't matter if it's our brains or a calculator.

Somewhere between "it's always wrong" and "it's always right unless the bits got flipped by cosmic rays" we deem the accuracy to be good enough.


Disagree, the theory exists outside of any specific technical implementation (every single one of those could be wrong, for example). You might not be able to verify something without being subject to random errors, but that doesn't mean the theory itself is subject to random errors.

Any implementation (or write down etc.) of something can have errors, but the errors are in the implementation and do not give rise to uncertainty outside of the implementation. There is no uncertainty as to what the sum of two integers should be (within the usual mathematics).


> it doesn't really matter if you get an addition mostly right

Back-of-the-envelope / mental math often works like that, and it's something that humans regularly use, so clearly it has some use.


Tell it to floating point numbers.


> They find that with a different measuring stick, emergence vanishes.

Isn’t that the case for most/all emergent behavior? If you change the scale and watch individual water molecules, you’d see them snapping into a crystal structure one by one instead of a sudden emergent block of ice.


Not quite. The problem is that the definition is especially poorly defined in ML. I elaborate more here[0]. You are describing emergence, but not what was claimed when LLMs were said to have emergent abilities (the distinction is explained in the article fwiw)

[0] https://news.ycombinator.com/item?id=39812315


But knowing the molecule structures at 50C and 75C tells you very little about the freezing point.

Different example: If you measure the number of cases of a given virus, it will either spread across the entire globe (R0 > 1, eg. COVID-19) or fail to spread widely (R0 < 1, eg. Ebola). Even though it's not completely binary, it's emergent behaviour because it looks binary. But if you were to measure R0 directly, you'd see a gradual increase, and could predict future variants/vaccine efficiencies/etc much more easily.

"Emergent" refers to eg. sigmoids, while "gradual" refers to eg. linear or logarithmic functions.


Nobody mistakes ice for something sudden emergent though, it's obvious to naked eye that it gradually appears


I don't know what exactly "emergent" means in this context, but doesn't ice appear pretty rapidly? It appears between -1 and 1 degrees, and when cooling down from 100 to 75 to 50 to 25 water you would learn nothing about the freezing point.


Not to mention, if water is pure and still enough, you can cool it below freezing point, and it still looks like a liquid, until it is disturbed and then it suddenly doesn't. Same with heating up above boiling point.


Also with heating? So all the bubbles rising up way before boiling and the visible evaporation does not happen with distilled water? Hard to believe, I have to say. You would have to do perfect even heating of all the water, to stop the movement from different temperate areas. Microwaving it?



Thx. And maybe not the best idea, but now I want to try this ..


> maybe not the best idea, but now I want to try this ..

I'm going to steal this for my epitaph, thanks


Have fun with it.

Mine would currently say:

Por tu culpa!

(this is all your fault)

But depending on the circumstances, I would be also fine with the one above. But very likely I would not care that much anymore ..


Every volume of water IRL will disagree about the rapidity though, go below 0 and first it's water, then it has some ice, then more ice, and only after a while eventually it's all ice (and even then maybe it has some water underneath, who knows). I can't see how it is not common sense. But with LLMs the common sense was until now "they can't then they suddenly can"


The water usually isn't below 0 in those cases.


My thermometer says it's -5, yet water is liquid. To be more precise, covered with thin ice. And covered with (thicker) ice it shall stay all the way to -20 probably. Whatever technical correctness you throw doesn't matter to common sense logic.

And that is even if there is a technical possibility to lower the temperature of every water molecule to -0.0001 simultaneously and demonstrate that it instantly freezes the entire volume in an instant. It'd be irrelevant since none of us live in that lab.


Your water is -5C? Is it salt water?


Thanks for another example!


For what - that different liquids have different freezing points? Not sure anyone is doubting that.


What are you arguing for exactly? That there may in theory exist some abstract spherical volume of perfect water in vacuum that freezes entirely in an instant that none of us will ever encounter in our lives? But that it is exactly what we should imagine when we think of how ice forms-- humanity's mental image should be of water instantly transforming into ice? Please clarify:)


Clear water freezes exactly at 0 degrees. I guess the graduality depends on if you think about this as interpolating the water temperature, or interpolating the energy given/taken from a body of water.


> Clear water freezes exactly at 0 degrees.

Water never arrives at 0 degrees all at once, so for practical purposes no. Even if it did, you cannot measure the temperature of every water molecule simultaneously to prove it.


In a way - yes. Ice in nature forms usually by water cooling heterogeneously, but not by water being below 0 all the time and everywhere and the ice forming slowly in some areas then.


Not just in nature, your glass of water in the freezer also has uneven temperature throughout the volume and becomes ice gradually (though faster than a lake)


The paper: Are Emergent Abilities of Large Language Models a Mirage? https://arxiv.org/abs/2304.15004


It's always easier to predict the future after it's arrived.

There may be some emergent features that represent phase changes that are truly difficult, even in retrospect, to predict, but I doubt these are common (does anyone even have a single convincing LLM example?).

I think the more normal case is that higher level capabilities are dependent on multiple lower level ones in ways that may be hard to predict. These smooth improvements in the constituent building blocks may often be there, but you'd have to know where to look (i.e. what the critical building blocks are).

In order to predict emergent capabilities you'd need to have identified ahead of time what the necessary building blocks are, and maybe done some simulation of what level each of those need would need to be at, to support the anticipated/hoped-for behavior.

Of course it's not just about scaling model size and/or data, but also about type and quality of data, where there may be abrupt changes between model versions. It's going to be very difficult ahead of time to analyze what new patterns/manipulations (i.e. building-block capabilities) the model will be able to learn from an updated training set, and therefore what emergent abilities can be built out of those new blocks.

I wonder how often the opposite happens, and model designers have been able to successfully identify that "in order to do X, we need capabilities A, B and C, and in order to get capabilities A, B and C, we need new datasets P & Q" ? In a hypothetical case like that they'd have been able to measure progress towards capability X.


> I wonder how often the opposite happens, and model designers have been able to successfully identity that "in order to do X, we need capabilities A, B and C, and in order to get capabilities A, B and C, we need new datasets P & Q" ? In a hypothetical case like that they'd have been able to measure progress towards capability X.

As far as I know before transformers that just didn't happen at all since models couldn't contain that many separate skills without them interfering with each other. This whole thing with models having many high quality capabilities is still pretty new.

But yeah, I think we need something like that and all the best LLMs today probably already do it, but none have said what they do so that is just speculation.


I'm sure that there are plenty of cases where model designers are at least attempting to pursue new capabilities in a targeted approach (but how often to this degree of complexity?), while at the same time realizing that new models+datasets will also have unanticipated new capabilities.

Before LLMs (whether transformer-based or not) most NNs were built to perform single tasks - a single objective, so having multiple higher-level capabilities was essentially out of the question. Of course LLMs nominally only have a single objective too, predict next word, but really they are targeting language.

In the GOFAI era of rule-based symbolic AI there were also some systems/approaches that had multiple skills (e.g. expert systems like CYC, or cognitive architectures like SOAR), so maybe there are forgotten lessons there on decomposability of skills.


> But with other tasks, the jump in ability wasn’t smooth. The performance remained near zero for a while, then performance jumped. Other studies found similar leaps in ability.

Wow, is the title of the submission inaccurate or what!


That paragraph outlines the findings of previous research, findings that this paper challenges.


I believe in this context they mean these abilities are purposely being worked on and brought into existence, they didn't just pop up out of no where.

A significant portion of the world was blindsided by the sudden emergence of "AI", but there were people that knew these things were coming.


The partial credit approach is fine, but if we want to train a model to get the right answer, then it does matter.

What I've noticed is that my loss curves on small models for arithmetic will reach a steady state where they're still getting the answer wrong even if some digits are right. Yes you can keep training but the number of training epochs necessary seems to be inversely correlated exponentially with the model size

So a model with x parameters is going to take n^2 time as long as a model with 2x parameters

At a certain parameter count this basically means it is nigh impossible to get the right answer via training using gradient descent

The more parameters, the easier it is to drive to convergence, which is a real, important metric

At some point the estimated time for the ability to emerge spontaneously becomes greater than a human lifetime of even the lifetime of all humanity. So in the sense that increasing model size makes it tractable... I think it's fair to say that the ability emerges suddenly enough.


I think the point is that if you want a model that gives the right answer, you should still use partial credit to figure out how far away you are from that goal instead of binary correct-or-not accuracy.

If you use a metric where improvement happens suddenly and unpredictably, you can't even estimate how much longer you need to train, because the ability might emerge spontaneously. But if the partial credit metric improves smoothly and predictably, you have a chance of extrapolating your training progress to see when you might hit your accuracy target, better than if you extrapolated accuracy directly.

And if you find that the estimated time is too long and you decide to train a bigger model, you can try to extrapolate across model sizes to estimate how big of a model you might need.


To those still missing the point, it's easy to come up with a metric that's guaranteed to be a huge jump.

Eg. imagine a big corpus of arithmetic operations (say 2500 additions, 2500 subtractions, 2500 multiplications and 2500 divisions): you can test when a model passes computing all of those correctly, and it's probably obvious that you'll have a long list of "didn't pass" until you suddenly get it passing.


It seems like model learning is too optimized for continuity, eg. continuous variables can be subdivided infinitely, where logic and algorithms are firm rather than fuzzy like this. Learning agents arguably need to be able to generalize from fuzzy to firm concepts to properly learn logic and algorithms. Whether that can happen automatically by scaling or whether that requires a fundamental shift is unclear.


Except if we take humans as example, we do not "generalize from fuzzy to firm". We don't handle firm, we approximate it with fuzzy. Which is why formal logic makes for a fine family of puzzle games, but fares extremely badly at describing reality and reasoning about reality. Better suited mathematical frameworks are probabilistic in nature, and so happen to be fundamentally continuous.


> We don't handle firm

We absolutely do handle firm. Mathematics, logic, morality are all examples that have plenty of firm classifications, as is typical, everyday reasoning, eg. I would state that I am naasking in this thread and I am not TeMPOraL, and I would not state that I am 99.999% probably naasking but 0.001% possibly TeMPOraL.

> Which is why formal logic makes for a fine family of puzzle games, but fares extremely badly at describing reality and reasoning about reality.

The device you're reading this on would suggest quite strongly that this conclusion is wrong.


> Mathematics, logic, morality are all examples that have plenty of firm classifications

Which tend to fail to apply to reality, unless relaxed or turned into continuities.

> as is typical, everyday reasoning, eg. I would state that I am naasking in this thread and I am not TeMPOraL, and I would not state that I am 99.999% probably naasking but 0.001% possibly TeMPOraL.

Everyday reasoning is a shorthand, not formal logic. You stating you're naasking and not TeMPOraL is not a formal logic statement, nor is a fully discrete one - not unless you really believe this to be 100% likely, and do not admit the possibility of any of things like:

- You dreaming or imagining it;

- You're being a brain in a vat;

- You being delirious or otherwise mentally compromised;

- Your senses actually working correctly, and not making shit up;

All of which, in reality, have non-zero likelihood. Applying discrete logic to real life forces you to accept things as certain even if evidence for them is less likely than our understanding of reality being correct - or conversely, rounding possible states of the universe down to impossible.


Did you see this submission? https://news.ycombinator.com/item?id=39575264 They sound linked.


This is a good paper. Though emergence doesn’t necessarily require a sudden jump in metrics, or unpredictability. New abilities can emerge gradually.


When we talk about "emergence" in machine learning, we talk about those metrics with a sudden jump, as explained in the paper that introduced the term: https://arxiv.org/abs/2206.07682


Just because somebody overloaded a word doesn't mean we can't correct people who use it wrong.

The correct phrase should be "sudden/unexpected emergence". Otherwise the phrase "gradual emergence" would be an oxymoron.


As another commenter explained, different domains use words differently. The ML paper I linked has been influential enough to use the word "emergent" in machine learning to describe this behaviour, so you're right that, within the context of ML and AI, "gradual emergence" is an oxymoron.

Again, I understand this might not align with what you think "emergence" means in the English language, in the same way in which "metal" in physics has a completely different meaning from "metal" in chemistry, "metal" in astrophysics, and "metal" in the English language.


To be fair, I don't think anybody in chemistry or astrophysics is very proud of this kind of overloading. It was done by some Philistines and the rest of us have to live with their defacement of the dictionary.


They didn't introduce the term. It's a standard English word:

https://en.wiktionary.org/wiki/emergent#English

> (philosophy, sciences) Having properties as a whole that are more complex than the properties contributed by each of the components individually.

If they've said that it requires a sudden jump then they're trying to redefine the word to mean something it doesn't. But I haven't heard anyone else use that definition.

The emergent properties of AI are like "how can it write and entire Python program when we've only ever asked it to predict single words". It doesn't require any sudden jumps.


Indeed, I was surprised by the article's assumption that "emergence" is about a sudden jump in ability.

I work in AI and ML but my academic background is in the philosophy of mind. When people talk about "emergence" w/r/t AI I assume they're speculating about how consciousness might be an emergent property of a neural network, in the same way some people speculate that our minds are emergent properties of our brains. I think that speculation is foolish, but that is beside the point. It has nothing to do with a sudden jump in abilities, per se.


Polysemous words, which sound the same and spell the same but have two or more than two different meanings are common.

Each subfield defines rigourous terms and some may be overloaded.

It is impossible to have any universal definition of words that are universal, exact definitions have to be domain specific.


Polysemous words exist of course, but that doesn't mean they're always unproblematic. The best examples, like bank (where the money is) versus bank (sloped piece of land) are unlikely to be confused because there is no domain where they're often used in conjunction.

In this context, its charitable to note that some polysemy is at work within the domain (because the application of terms is so new) and so its useful to have these kind of distinction-making conversations.

Many confusions arise when people have different senses of a term but insist there is no polysemy and they're all talking about the same thing.


Profound, but the AI field has not redefined this term, even if one paper tried to.


So the models might be getting smarter without discontinuous jumps. And it could just be that we're measuring them in a way that gives no credit for partial answers, thus we end up missing the signs that they've been getting continuously sharper all along.

This sounds kind of in line with what I gather Sam Altman's thinking is. They feel they can predict a model's reasoning abilities quite well based just on the size of the training compute and data.


Sam Altman is the salesman, it's not his thinking, it's the thinking of the many, many experts that work at OpenAI that he parrots.


Wow, now you're really sticking it up to the man!


If you are a fan of Sam Altman can't you say what technical problems he solved and in what way he is a technical leader here? Would be much more effective than just an insult.


VC is very often a cult of personality, let's face it. So is most of capitalism when you think about it - Elon Musk, Jeff Bezos, Bill Gates, etc. They're where they are because of vast quantities of unattributed work by other people.


He still didn’t solve anything. Rocking up and shoving money in your pockets is hardly a skill.


Sam’s a snake. This should be well known. He’s charismatic. He’s well liked, he gets things done. He also craves power and will lie and destroy you if you stand in his way. Never trust Sam


Could be worse, could be the one white-knighting for the billionaire that I think is my bff for some inexplicable reason.


Ground-breaking revelation! I should let all my friends at the forum know that CEO man is, indeed, bad.

Thank you!


Even a stopped clock gives the right time, twice a day.

LLMs are plausibility engines. The fundamental hypothesis on trial here is that increasing plausibility corresponds to increasing correctness. This is easily rejected for the human-sourced content that trains them, implying an upper bound on any dependent phenomenon. It follows that simply scaling LLMs will not produce AGI.


> LLMs are plausibility engines.

This is one way to look at LLMs. But it doesn't automatically impose an upper bound on their capabilities.

Humans are reproductive organisms. This is also true. At a glance, it would seem that humans could not evolve to be intelligent. They are only selected on their reproductive abilities. But obviously the categorization, while true, doesn't impose an upper bound on human abilities.

LLMs are evolved to be as efficient as possible at retaining knowledge.

A simple strategy for retaining knowledge is memorization. Neural networks definitely can do that.

A different strategy for retaining knowledge is using algorithms. Neural networks can evolve to retain knowledge using algorithms too. For instance, it has been shown that a small neural network evolved FFT-like structures to perform addition. It started out with memorization, which was far from perfect, but later in the training it switched to an algorithm using FFT for addition which produced perfect results.

I think the better LLMs retain knowledge through sophisticated compression. It involves building a world model and ways to relate the input text to that model.

I see it as the building blocks of a reasoning machine. It is incomplete and buggy, and the current architectures might hit a ceiling soon, but nonetheless it is a completely different thing than pure memorization.


FFT is easy fairly easy for transformers to find as a pattern.

Note that OpenAI admits that pre-training (pattern finding and matching) is where the abilities on professional tests comes from.

Even proofs in presburger arithmetic have an upper bound on the second level of the polynomial hierarchy, which is way beyond the computational power of any LLM

https://arxiv.org/abs/2207.00729

>The model’s capabilities on exams appear to stem primarily from the pre-training process and are not significantly affected by RLHF. On multiple choice questions, both the base GPT-4 model and the RLHF model perform equally well on average across the exams we tested

https://arxiv.org/abs/2303.08774

To practically scale, parallelism is required, and the ability to find algorithms within the L or TC0 complexity classes is limited.

Remember that individual ANNs as used in transformers are just binary linear classifiers. Which you can do a lot with, but probably limited to P assuming we don't find out L or TC0 are larger than we think now.

Presburger arithmetic, or first order logic with (+,=), or (*,=) is the strongest FoL that is both complete and consistent. move up to Peano arithmetic and you start to hit the limits from Gödel, Church, and Turing.

There will be instances that an LLM can learn with CoT, zero shot etc...

But most of those will be due to luck or parallel patterns in the training set.

LLMs simplicity bias is great for plausible answers to our of distribution questions, but puts limits on what can be learned as far as 'algorithms' go.

They will work for 'there exists' problems far better than for any 'for all' or 'for most' problems unless they have learnable patterns in the training set.


You're assuming that humans reliably pursue correctness rather than plausibility.

The whole scientific system specifically being designed to suppress plausible-sounding-but-incorrect claims says otherwise.


> The whole scientific system specifically being designed to suppress plausible-sounding-but-incorrect claims says otherwise.

But that we have came up with a scientific system is proof that humans actually can tell the difference, we might be bad at it but we still manage to do it. Humans that can't tell the difference are very unreliable and usually shunned in engineering and other STEM fields.


Doesn’t a claim become instantly implausible when we see how it is incorrect? It's only plausible as long as it might still turn out to be correct.

Furthermore, sometimes mistakes are made when assessing correctness, even in mathematical proofs. A correctness judgement thus still has a degree of plausibility associated with it.

I'm thus not sure if the distinction being made here isn't just one of degree.


> I'm thus not sure if the distinction being made here isn't just one of degree.

I think humans has two levels here, "that sounds plausible" and "I can't imagine how this could be wrong so I am willing to risk my life on it". The second can still be wrong, but usually it isn't. For example when you walk on a bridge, you assume the bridge is there and confidently put down your foot even though you would die if that part of the bridge happened to be an hallucination.

So science is a set of statements and experiments that sums up to facts that humans can't imagine are wrong. Sometimes they are still a bit wrong, but then only slightly like newtons equations or theories that aren't backed up by experiments. But in general science is full of facts that humans stake their life on every day.

LLMs doesn't seem to have that certainty layer, something so certain that the LLM would stake its life on it. I'd for sure never stake my life on anything an LLM says as it is today, but I happily do it for humans every time I take the elevator etc.


I would count your bridge example firmly in the plausibility category, in that I would find the bridge implausible, but not impossible, to collapse. In Bayesian speak, it’s a matter of credence. We usually don’t consciously think about it when our brain decides it's good enough with sufficient certainty, but in the end it's still based on a likelihood judgement.


Right, but the point is the scientific system is a cultural accomplishment, it didn't get hardcoded into our brains. We should hence not expect "only believe things that are actually true" as a standard for AGI either, hence LLMs are not necessarily disqualified by being bad at it.


For living beings there is a class of errors that kills you, there is no coming back from that. Our brains were trained to figure out what facts it is willing to stake its life on, I am sure all bigger living beings has such a part of their brain, so I'd say that yes we have that programmed into us.

The scientific method is just a way to make humans use that reliable part of their brain instead of the plausible part, even in situations that aren't life or death. I don't think it is possible to come up with something like the scientific method without having a built-in sense of what facts are.


>without having a built-in sense of what facts are.

There is no such thing in 'higher intelligence' beyond protoception. Pretty much every mammal has the same built in sense that you do, and yet none of them are carting around human intelligence. More so, when human children are left feral and not raised around other humans we typically consider them mentally disabled. Again, they learn the protoception stuff just fine, but all the stuff that "puts us above" the mammals is missing, it is not inane. Your counting, reading, and tools use are all learned behaviors after birth and form a continuous chain of learned behaviors taught by people (and more recently books and computers).

There is a reason that after development of the scientific method that most human endeavors experienced an exponential growth curve. It's because we don't recognise facts from fiction at a much higher rate than other animals, it's that we teach each other fact from fiction better than other animals.


> Pretty much every mammal has the same built in sense that you do, and yet none of them are carting around human intelligence

Having a sense of what are facts and what are assumptions doesn't mean you are smart, but not having it means you are pretty dumb. LLMs doesn't have that.

Natural language is more powerful than programming languages since it lets us communicate assumptions, guesses, uncertainty etc, to other intelligent beings. So a good LLM needs to be able to consider certainty level of statements and act accordingly, todays LLMs doesn't do that so generate dumber responses than a human would since they lack a sanity checker.


That’s an astute observation.


That is true, but only in a philosophical sense.

AI is on a path to outperform humans at a large number of tasks and jobs that previously would have been described as requiring intelligence.


> implying an upper bound on any dependent phenomenon

the implication sounds good, but is easily rejected with a counter example: a good student can surpass his master. A mediorcre student with multiple masters can surpass all of them (tangent: this is why the medival master-journeyman system was so efficient, I think).

Or to make the argument more abstract: Your implication seems to assume, that transferlearning does not exist.


LLMs do not study.

I’ll repeat an assertion I’ve made in this forum before: that the benchmark for AGI will be when two AIs play one another at an incomprehensible game of their own devising.

However, there is a hidden corollary in my earlier statement. By implying an upper bound on the correctness of their answers, we can assess that LLMs will not themselves achieve sentience and form a civilisational threat. That actually increases their potential utility in many applications. Much as you wouldn’t, for safety’s sake, put a jet engine in a conventional car (occasional mad-scientist projects notwithstanding).


The metric the authors use confuses me.

Edit distance seems like a strange way to test if the model understands arithmetic ([1], Figure 3). I think `1+3=3` would be equally as correct as `1+1=9`?

Why not consider how far off the model is `abs(actual-expected)`? I wonder if there is an inflection point with that metric.

https://arxiv.org/abs/2206.07682


It depends on how you do arithmetic. If you're a human and you do column addition, 12345+35791=58136 is just as big of a mistake as 48146 (the actual result is 48136). It's just one mistaken column in both. Binary half-adders work the same way.

We don't really know how LLMs do arithmetic. Maybe token edit distance would be interesting, but either way it doesn't really change the claim of the paper.

Unrelated: The link is incorrect, the one you're referring to is here: https://arxiv.org/pdf/2304.15004.pdf


Yeah, and as an aside I wonder how hard it would be to train an LLM to do addition, multiplication, etc, human style? Presumably it should be possible at least in step-by-step style (as substitute for short-term memory), the same way that we do it.

Without using an algorithmic approach, it seems an LLM can only learn a bunch of partially correct heuristics, and attempt to generalize over examples.

I've played with this a bit in the past, and came to the conclusion that GPT-3 seems to have learnt to compare the size of numbers (whether accurately or via heuristics), and would get the approximate size of an answer right (depending on the task), even if not the actual value right. I seem to recall it also doing this for tasks like asking for a prime number greater than a particular value.


I mean, is it efficient to teach them addition human style instead of heuristics of when to call the right function?

Imagine you could say 'calc' in your brain, and some separate subcomponent of your brain that is far more power efficient could return an answer almost instantly? You would not focus on understanding addition/subtraction, more you would focus on on when to use addition/subtraction/multiplication or whatever.


> is it efficient to teach them addition human style

Not if math is your only goal, but there'd be value in making the models more powerful so that they could learn to do simple things like this (and not so simple things) by themselves. You can't have a tool for everything, and hopefully future AGI can itself do things that are more than just a mashup of existing tool capabilities.


It's really interesting question how can we measure emergent abilities like arithmetic operations. We cannot test every operation on every possible combinations of numbers. Instead we must make sure somehow that LLM performing arithmetic operations using corresponding rules and axioms.


> We cannot test every operation on every possible combinations of numbers.

You can spot-check with a survey of random samples. That’s also how we often test humans in their abilities.

What's interesting is that the quality of the answer when asking an LLM to explain arithmetics and asking it to perform arithmetics don’t seem to be necessarily correlated. I.e. an LLM might be able to perfectly explain arithmetics but completely fail at performing it.

In humans, we don’t expect this to be the case (although there are examples of the opposite case with idiots savants, or, to a lesser degree, with children who might be able to perform some task but not explain it).

This disconnect in LLMs is one of the most important differences to human intelligence, it would seem.


> Instead we must make sure somehow that LLM performing arithmetic operations using corresponding rules and axioms.

It isn't. It's stringing together likely tokens that approximate (often very effectively!) what its data corpus has done in the past. And, relatedly, the best way I've found GPT4 to solve a word problem is to tell it to write some Python code to spit out an answer; the actual computation part is an easier thing to figure out when it's just running code.


A very simple example “list the Presidents in the order they were born”.

It gets the order wrong unless you tell it to “use Python”

https://chat.openai.com/share/4a673ea0-67d3-4256-b57d-dc1cf8...


My favorite example is telling it to reverse some longer text character-by-character. Trivial for any human to perform perfectly, but all models I've tested struggle with it and make mistakes all over. It's really hard for them because they lack hidden state to perform algorithms - or what you would call thought in a human. Instead (essentially) for each step they have to re-consider the entire thing and their past output, figure out what they already did, and what they have to do next.

On the other hand they'll spit out python code that'll get you the reversed text just fine.

It's also one of their greatest shortcomings when it comes to coding: They lack the ability to do any meaningful symbolical execution.

https://chat.openai.com/share/9faaae97-e20f-454e-b245-3e4c19...


I'm pretty sure this is a problem relating to tokenization. There are hardly any tokens that can be used for reversing text.

I simply took the "interesting fact" and added this to the prompt: "Reverse the sentence. Here is the beginning: ".noitatipicerp laminim seviecer"

Here is the result I got:

".noitatipicerp laminim seviecer" (prompt) - "taht ecalp a sa treseted a fo noitinifed eht yb htraE no treseted tsegral eht sa ti seifilaqulc esnapxe tsa v s'acitr nA.serutarepmet gnizeerf sti etipseD.dnali ni sse l ne ve na d t so ac eht g n o l l a ff all wons ) mc 5 . 61( seh cni . 5 6 tuoba fo noitatipicerp launn a egareva htiw, noitatipicerp eltt i l y r e v s eveice r ti esuaceb treseted a sa deifissalc si acitr an t A. e cap s d n a l y ci sti ot eu d dekool revo net fo tneu qer itnA" .a n fe ts erita n o t se r tse degral s'd lro w e ht taht stcaf gnitanicsaf enO".

Manually reversed back into original order:

"One fascinating facts that th e w orl d's larged est r es t o n atire st ef n a. Anti req uent of ten over looked d ue to its ic y l a n d s pac e .A t na rtica is classified as a detesert because it r ecieve s v e r y l i ttle precipitation ,with average a nnual precipitation of about 6 5 . inc hes (16 . 5 cm ) snow lla ff a l l o n g the ca os t d an ev en l ess in iland.Despite its freezing temperatures.An rtica's v ast expanse cluqalifies it as the largest detesert on Earth by the definition of a detesert as a place that" - (prompt) "receives minimal precipitation."


> I'm pretty sure this is a problem relating to tokenization.

I don't think so - because they seem to be able to repeat back any short sequence of characters without issue. If I pick anything from that text they struggled with, manually reverse it, and tell them to repeat the reversed word back to me, that works fine.

It's also not just an issue with reversing something character-by-character. You can ask them to reverse numbers or re-arrange words and they'll faceplant in the same way as soon as the input gets beyond a small threshold. Here surely there wouldn't be an issue with tokenization.

Of course if you would train a network on specifically the task of reversing text it would do quite well, but not because it's doing it using any straightforward algorithm. Nothing like what a human would be doing in that situation can be represented within their network - because they're directed graphs and there's no hidden state available to them.

The point is simply to demonstrate their inability to perform any novel task that requires even a tiny bit of what I dub "thought". By their very implementation they cannot.


> You can ask them to reverse numbers or re-arrange words and they'll faceplant in the same way as soon as the input gets beyond a small threshold. Here surely there wouldn't be an issue with tokenization.

My guess is the training data contains many short pairs of forward and backward sequences, but none after a certain threshold length (due to how quickly the number of possible sequences grows with length). This would imply there's no actual reversing going on, and the LLM is instead using the training data as a lookup table.


Apparently Claude-3 Opus can do reversal tasks pretty well, even without a code interpreter (or does it use one internally?).

https://twitter.com/AlexTamkin/status/1767248600919355670


Pretty much all of them will able to fake it on short sentences. All break down eventually (and soon).

Also that's not a reversal task because there was no input. It was free to make up anything that fits.


It’s horrible at relative times too. If you just give times, it can puzzle it out, but add something happening, it struggles:

https://chat.openai.com/share/5f558fc4-a0d0-494d-a3d7-ad78f5...

More: https://chat.openai.com/share/11c45192-6153-44b4-bb97-024e8d...

“The event at 3pm doesn’t fall within the 2.1-hour window around 5pm because this time window spans from 2:54 pm to 7:06 pm. The 3pm event occurred before the start of this window. Since 3pm is earlier than 2:54 pm, it’s outside the range we’re considering.”

Trillions of tokens!


The first example with ChatGPT 4

https://chat.openai.com/share/32335834-9d12-421e-96b2-9aa6f1...

For the second example, I had to tell it to use Python

https://chat.openai.com/share/76e6cd67-ad49-4508-b05d-3d26a3...


Does python involve calling "get_us_presidents()"?


I couldn’t see how to get the code to show in the shared link myself.

But I did look at the code during the session when I was creating the link. It’s just what you would expect - a dictionary of US Presidents and the year they were born and one line built in Python function to sort the list.


You can check the code it generated in the long OP provided (this button is not very visible so I understand if you missed it).


Why would it do that? Rules and axioms scale (slowly) with the number of layers. The model can heuristically approximate more easily and more incrementally.


But in this case why would you prefer approximation over answer ?


I would prefer my car to fly through the air, but that's not what it does.

My point is that LLMs are not magical, they're limited by their architecture and reality. They are not symbolic rule processors, even though they can fake it somewhat convincingly. In order for a symbolic rule processor to produce accurate answers, it must have some form of iteration (or fixed point computation, if you prefer). A finite number of layers imposes a fundamental limit on how far the rules' effects can be propagated, without feeding some state back in and iterating. You can augment or modify an LLM to internally do just that, but then it's a different architecture and most likely no longer trainable in a massively parallel fashion. Asking for a chain of thought gives a weak form of iteration restricted to passing state via the response so far, and apparently that chain of thought is compatible enough with the way the LLM works that it doesn't matter that the training did not explicitly involve that iteration.

In short, demanding accurate answers means moving back in the direction of traditional AI. Which has its own strengths and weaknesses, but has never achieved the level of apparent magic we're seeing from these relatively dumb collections of weights extracted from enormously massive piles of data.

The Secret Formula turned out to be "feed a huge amount of data to a big but dumb model", because the not so dumb (simple) models would take too long to feed the huge amount of data to, and the benefits of model complexity are massively outweighed by the competing benefits of learning big sets of weights from massive data. The trick was to find just the right form of "dumb" (though now it sounds like multiple forms of dumb work ok as long as you have the massive pile of data to feed it, and you don't go so dumb as to lose the attention mechanism).



So an instance of epistemological emergence instead of ontological emergence.


> But with other tasks, the jump in ability wasn’t smooth. The performance remained near zero for a while, then performance jumped.

Second paragraph completely contradicts the title given.


Continue reading.


This is very insightful. Another one of those "obvious in retrospect" ideas. We tend to think of addition as only correct or not but has anyone ever claimed that LLMs have a "discrete" output (might be using the wrong terminology)? It would then make sense that you need to measure performance in a continuous, not discrete way. Otherwise you'll end up with a sort of "aliasing"-type error.


Not quite in retrospect -- the earlier paper on emergence addresses this

> It is also important to consider the evaluation metrics used to measure emergent abilities. For instance, using exact string match as the evaluation metric for long-sequence targets may disguise compounding incremental improvements as emergence. Similar logic may apply for multi-step or arithmetic reasoning problems, where models are only scored on whether they get the final answer to a multi-step problem correct, without any credit given to partially correct solutions.

- Page 7, https://arxiv.org/pdf/2206.07682.pdf

They say that cross entropy loss in these cases goes down with model size incrementally, well before "emergent" capabilities appear. If so the model is improving (in a sense) even though these capabilities aren't observable below some critical size.


Okay, neat. Using a binary metric means that you observe sudden transitions from success to failure. Using a more granular metric means you observe smoother improvement. Logical and meaningful.

It does make sense to evaluate things like 3-digit-sum by “how close textually?” And “how close numerically?” and the phase change is an artifact of the actual question being “is fully correct?”


Too bad the article only mentions addition – which shows gradual improvement.

Would have loved to see the claim of emergence supported by an example as well. And more importantly, if the measure is boolean, completely wrong or completely right, then, well, I do expect to see a "sudden" jump in ability.


The original emergence paper has multiple "emergent" metrics. [1]

Though, the paper this article is about [2] mentions that among the "emergent" tasks they tested, 92% were measuring accuracy on either exact-string-match or multiple choice, so probably just bad metrics.

[1] https://arxiv.org/pdf/2206.07682.pdf

[2] https://arxiv.org/pdf/2304.15004.pdf


I think the graph is especially interesting. If the progress of LLMs is less radical when measured correctly, that could be another factor driving a bubble and its later popping.


The updated "partial credit" metric proposed by the researchers suggests that LLMs will continue to improve as parameter count increases (assuming you have the data to make use of XX trillion parameters). It's just showing that it isn't as "step-wise" as other rankings may have indicated.


It still is stepwise for the kinds of use cases being envisioned.

For example, it's useful to know your 1 trillion parameter self driving car model is actually 90 percent of the way there, but it's also useful to know that the threshold you need to meet to implement the technology will likely be met at some point when your parameter count increases.


Transformers have in-context meta learning. If you give them examples they can perform new tasks.

When you run AutoGPT, the outcome is not always predictable.


Phase changes are a natural pattern in complex systems. See state transition in fluids.

Depending on how you measure, they look instant or gradual


I think the importance of this study is being downplayed by many. That same spontaneous and "breakthrough" nature of the abilities of LLM that is now nullified, was a major component of many arguments against LLMs and could've possibly stopped us from achieving greater intelligence.

I see this as the removal of a possible great filter event of the evolution of LLMs. This is a pretty big concern being nullified.


I think it only feels like it is emergent just like how some people thinks LLMs are conscious.

It’s emergent because you don’t really understand what it was capable of? Or is it you have a confirmation bias because you don’t say its an emergent flaw when LLMs get some stuff so crazy wrong


Study: scientists prefer metrics which create an illusion that every time-consuming process is gradual and predictable


If a process can be measured by a linearly growing metric I'd say it's predictible.

And of course scientists try to understand complex processes and make them more predictible. That distinguishes science from magic. You said that like it's a bad thing


Emergence is a weird topic. If we go with the physicist's common definition [0,1] we'd need to differentiate into weak and strong.

As for weak emergence[2] that is de facto.

As for strong emergence[3] well... do we have an example of the phenomena anywhere?

To be clear, ML people have a different definition which I do not think is entirely useful. Let's look at the wording used in [4]

  Scaling up language models has been shown to predictably improve performance and sample
efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence raises the question of whether additional scaling could potentially further expand the range of capabilities of language models

  Emergence is when quantitative changes in a system result in qualitative changes in behavior.
  
  An ability is emergent if it is not present in smaller models but is present in larger models.
I'm not sure anything here is actually meaningful. We have very little knowledge about the interpretation of ML models, so it seems to be jumping the gun to say that phenomena cannot be predicted. The aspects are highly coupled with things like training techniques, optimization objectives (or loss functions), optimization methods, not just architectural designs.

We've also after more research found smaller capable of performing things that large models can do. So the goal post moves and makes the condition of emergent abilities. [RTFA]

So how do we define emergent abilities? Everything static except for the number of parameters? Does this make sense? Changing the number of parameters is an abstracted way of changing the optimization technique or training method. Clearly these are also different by nature of different batch sizes and/or a LR scheduler. So it is hard to be consistent.

Not to mention that we are not great at predicting much of what networks can do in general. There's plenty of people working on the subject (though this is a small proportion of total ML researchers, even if we exclude those that are highly application focused). So are we gonna call something emergent if it is just something we don't know about? And who gets to be the one to decide? I've seen abilities called emergent that are clearly a result of training methods like KL Divergence but surprising to individuals who do not have a rigorous statistics and/or metric theory background.

It just all seems to be jumping the gun. It's a new field and a new science. I think it is okay if we're willing to admit that our understanding just isn't great yet. There's no need to embellish or over attribute. These models are without a doubt powerful and useful, but at the same time I feel we are happy to greatly exaggerate all aspects about them. And for the life of me, I can't figure out why critiques are interpreted as dismissive. You can call an LLM a stochastic parrot and still think it is a great achievement and useful tool. Moreso, criticism is essential. Criticism gives us direction in research. Hype gives us motivation. But the two have to be in balance. Motivation without direction is an undirected Monte Carlo search (can still work, but we can do better than a drunken man). Overly criticizing (turning into dismissing, such as calling LLMs useless) is just pulling wool over one's eyes. It is the equivalent of lights being turned off and deciding to sit down and give up. When the lights get turned off you should find how to turn them back on. And we have more clues than a pure random process. After all, aren't we trying to make these better? I know I am. It is why I got interested in researching these things in the first place. I just wish we could have moderate hype rather than this ridiculously excessive one (which I think operates in a feedback loop with excessive criticism as it drives polarization).

[0] Sabine Hossenfelder (I know HN loves her): https://www.youtube.com/watch?v=bJE6-VTdbjw

[1] Sean Carrol: https://www.youtube.com/watch?v=0_PdLja-eGQ

Yes, Sabine and Sean are both controversial characters but they are well known.

[2] Weak emergence is where some larger phenomena forms out of smaller phenomena and results in something that the smaller thing can't do. The result requires the interactions of parts of a system. Temperature is a common example because it is generally easier to discuss the aggregated value of all the particles' "jigglyness" rather than each individual particle. The resultant property can be derived from the individual components. Conway's game of life is another common example. But in game theory we'd call a coalition an emergent property since utility is higher in the collective than the sum of each individual. Clearly this is true for ANNs since they are composed of neurons and a single neuron cannot perform these tasks.

[3] Strong convergence is about behavior that CANNOT be derived from the individual constituents.

[4] https://arxiv.org/abs/2206.07682


I think you're making it much more complicated than it is.

The fact is, nobody expected that speaking natural language, let alone reasoning capabilities, would emerge at the scale of GPT-3.

Even the people that suspected that scaling would improve GPT-2 (i.e. OpenAI), were surprised by the results.

Clearly this is "strong convergence" according to your definition, behavior that cannot be derived or extrapolated from individual constituents (or smaller scale).


> nobody expected that speaking natural language, let alone reasoning capabilities, would emerge at the scale of GPT-3

Tons of people expected it to "speak" natural language. I mean that's why we built it.

Jury is still out on reasoning.

I say this as a ML researcher btw

> Even the people that suspected that scaling would improve GPT-2 (i.e. OpenAI), were surprised by the results.

I'm sure some were, I'm sure some weren't. I mean as we've learned more about how it was trained and what it was trained on I think less people are being surprised. But fair to say that that's biased. But so is your version.

The problem is that __who__ is being surprised matters. A caveman would be surprised by a computer, because they have no reference. But we aren't. Perspective doesn't make the computer magic. Nor does it make it strong emergence. It just makes the caveman less knowledgeable. It's okay if we're the caveman.

> Clearly this is "strong convergence" according to your definition

LOL

> behavior that cannot be derived or extrapolated from individual constituents (or smaller scale).

Jury's still out. Don't count your chickens before they hatch. Currently this is an unanswerable question. Just because __WE__ can't derive it, doesn't mean it can't be derived. That's the difference in the definition.

It may very well be (I highly doubt it), but we have so little understanding of these networks that you can't make a strong claim in either way. But remember that strong emergence is a VERY bold claim, considering that we do not know of it anywhere in physics. This includes thermodynamics and quantum mechanics, mind you. But also remember that these fields too centuries to get to the point we're at now, and that still isn't a complete understanding (but it has accelerated for sure).

Strong claims require strong evidence.

Spend less time on twitter and more time reading papers and math books if you want to understand ML.


Small pet-peeve with title: Everything is math/predictable. We just may or may not be able to predict it with current knowledge.


OP Should link to the original article which is listed in the first line:

https://www.quantamagazine.org/how-quickly-do-large-language...


How does this work? Why is the whole story just copied over?


Wired has a deal with Quanta to republish some of their stories. There are other stories from Quanta in Wired; I no longer read Wired, but I believe that this is how I first learned of Quanta.


Ahh, thanks, I didn't realize this was a thing.


I don't know but would assume it's marketing-related, the story seen as meeting the demographic requirements of both publications, with an established mutually-beneficial relationship satisfied in the transaction, eg. maybe cheaper content for Wired, bumped readership figures for Quanta.


That is a terrible headline. It implies that the _abilities_ are a mirage, but it's actually the "emergent" part that might be a mirage -- which is to say it's not an unpredictable "phase transition" but gradual and predictable improvement in ability as the models scale.


FWIW, I read the headline as saying the abilities of LLMs are not emergent.


Emergent and breakthrough are not the same quality. Something can be emergent and develop gradually


> Something can be emergent and develop gradually

Emergent can be defined as the sudden appearance of novel behavior.


emergence simply means something the model wasn't trained and expected to produce.


Not only that, but what's the point in evaluating a language model on its ability to do arithmetic? It's like criticizing a talking dog because the C++ code it wrote was full of buffer-overflow bugs.

In reality, if you ask a good LLM to solve a math-related problem, it will write and run a Python program to return the answer. Sometimes this even works. Sometimes it returns garbage. And sometimes it realizes that its own answer isn't realistic and tries a different approach. Sometimes people claim this isn't a valid manifestation of intelligence.

Completely pointless study, unworthy of Wired or HN.


This is a study of "emergent" properties of LLMs — whether something unexpected shows up (i.e. imagine that talking dog suddenly becoming great at pointer arithmetic and never using-after-free).

It was noticed that LLMs can do some arithmetic, but we are yet uncertain how much and how it happens exactly.


Arithmetic is treated as a proxy for general reasoning abilities.


Arithmetic is treated as a proxy for general reasoning abilities.

Which is stupid, because it's not. A pocket calculator can perform arithmetic, as can a Python program, or for that matter a Japanese cormorant, who counts the fish it helps you catch. None of those are considered capable of "reasoning" on their own.

Meanwhile, GPT4 will cheerfully write a program to carry out the required arithmetic operations (and then some.) A study that doesn't acknowledge that is worthless at best.


> Meanwhile, GPT4 will cheerfully write a program to carry out the required arithmetic operations (and then some.) A study that doesn't acknowledge that is worthless at best.

We don't want to see if the LLM can be used as a tool to do arithmetics, but whether it can learn complex data relationships like arithmetics. Arithmetics is a stepping stone, not a goal so that the model can solve it by invoking a calculator isn't relevant. The problems we want it to solve doesn't have tools like calculators so it doesn't help getting us there.


Exactly, because as Godel showed, aritmhmetic can in principle model mnany other formal systems, so if you can see arithmetic examples and generalize to the full set of rules and learn to apply them consistently, then you have a powerful reasoning tool.


These models are clearly capable of doing this. There is no theoretical reason why you should expect them to to fail at this. One day they will be able to do this perfectly and nobody gets the silly idea of generating a program to do it anymore. There is no need for another bitter lesson where "clever" AI researchers and engineers waste their career adding a hundred different workarounds to these minor problems.


I don't know... the ability to write code to solve an otherwise ill-suited problem seems pretty general to me. It seems like a big step in a concrete direction, as opposed to a lot of Goedelian navel-gazing about arithmetic and Peano axioms and whatnot.

Agreed that generalized architectures will ultimately win out over hand-tweaked ones. But the patent wars that will eventually be fought over this stuff are where the real bitter lessons will come into play. At some point, we'll be forced back into the hand-optimization business because someone like OpenAI (or another Microsoft proxy) will have locked down a lot of powerful basic techniques.


And “emergent” in this context usually means “behaviour resulting from the interaction of a large number of individually simple units”, not “suddenly appearing.”


This interpretation is most definitely right. As with all things, it's coming down to semantics.


There seems to be some sort of flaw or misalignment in the hypothesis. By my reading, it seems this trio is saying LLMs do arithmetic by predicting the digits, and at intermediate stages there are actually partial predictions, and so there isn't a huge jump, which would then be classifiable as emergent. But I'm thinking that in order for an LLM to generally predict arithmetic results in the way they are designed, there would be an infinite memory requirement as it'd have to "store" an infinite set of completions.

So the key may be to find sets of operands that GPT4/Claude Opus are unable to solve, even with hints similar to what a teacher may give a student learning the topic. Until such a set is found, I'd say this capability continues to meet the "emergent" definition, as the only other explanation - that I can think of - is that the models "discovered" how to do arithmetic in a general way.


>that I can think of - is that the models "discovered" how to do arithmetic in a general way.

I mean..yeah. Maybe i've simply misunderstood but It seems like you're presenting this as an outlandish idea but something like this is what Neural Networks regularly end up doing especially in the limit (as loss trends down to zero).

https://cprimozic.net/blog/reverse-engineering-a-small-neura...

https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mec...

Nobody is going to find any such set because time and time again, neural networks show us complete memorization is actually more difficult than just learning the distribution (even if imperfectly).




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: