Hacker News new | past | comments | ask | show | jobs | submit login

I feel very comfortable saying, as a mathematician, that the ability to solve grade school maths problems would not be at all a predictor of ability to solve real mathematical problems at a research level.

The reason LLMs fail at solving mathematical problems is because: 1) they are terrible at arithmetic, 2) they are terrible at algebra, but most importantly, 3) they are terrible at complex reasoning (more specifically they mix up quantifiers and don't really understand the complex logical structure of many arguments) 4) they (current LLMs) cannot backtrack when they find that what they already wrote turned out not to lead to a solution, and it is too expensive to give them the thousands of restarts they'd require to randomly guess their way through the problem if you did give them that facility

Solving grade-school problems might mean progress in 1 and 2, but that is not at all impressive, as there are perfectly good tools out there that solve those problems just fine, and old-style AI researchers have built perfectly good tools for 3. The hard problem to solve is problem 4, and this is something you teach people how to do at a university level.

(I should add that another important problem is what is known as premise selection. I didn't list that because LLMs have actually been shown to manage this ok in about 70% of theorems, which basically matches records set by other machine learning techniques.)

(Real mathematical research also involves what is known as lemma conjecturing. I have never once observed an LLM do it, and I suspect they cannot do so. Basically the parameter set of the LLM dedicated to mathematical reasoning is either large enough to model the entire solution from end to end, or the LLM is likely to completely fail to solve the problem.)

I personally think this entire article is likely complete bunk.

Edit: after reading replies I realise I should have pointed out that humans do not simply backtrack. They learn from failed attempts in ways that LLMs do not seem to. The material they are trained on surely contributes to this problem.




What I wonder, as a computer scientist:

If you want to solve grade school math problems, why not use an 'add' instruction? It's been around since the 50s, runs a billion times faster than an LLM, every assembly-language programmer knows how to use it, every high-level language has a one-token equivalent, and doesn't hallucinate answers (other than integer overflow).

We also know how to solve complex reasoning chains that require backtracking. Prolog has been around since 1972. It's not used that much because that's not the programming problem that most people are solving.

Why not use a tool for what it's good for and pick different tools for other problems they are better for? LLMs are good for summarization, autocompletion, and as an input to many other language problems like spelling and bigrams. They're not good at math. Computers are really good at math.

There's a theorem that an LLM can compute any computable function. That's true, but so can lambda calculus. We don't program in raw lambda calculus because it's terribly inefficient. Same with LLMs for arithmetic problems.


There is a general result in machine learning known as "the bitter lesson"[1], which is that methods which come from specialist knowledge tend to be beaten by methods which rely on brute force computation in the long run because of Moore's law and the ability to scale things by distributed computing. So the reason people don't use the "add instruction"[2] for example is that over the last 70 years of attempting to build out systems which do exactly what you are proposing, they have found that not to work very well whereas sacrificing what you are calling "efficiency" (which they would think of as special purpose domain-specific knowledge) turns out to give you a lot in terms of generality. And they can make up the lost efficiency by throwing more hardware at the problem.

[1] http://www.incompleteideas.net/IncIdeas/BitterLesson.html

[2] Which the people making these models are familiar with. The whole thing is a trillion+ parameter linear algebra crunching machine after all.


As someone with a CS background myself, I don't think this is what GP was talking about.

Let's forget for a moment that stuff has to run on an actual machine. If you had to represent a quadratic equation, would you rather write:

(a) x^2 + 5x + 4 = 0

(b) the square of the variable plus five times the variable plus four equals zero

When you are trying to solve problems with a level of sophistication beyond the toy stuff you usually see in these threads, formal language is an aid rather than an impediment. The trajectory of every scientific field (math, physics, computer science, chemistry, even economics!) is away from natural language and towards formal language, even before computers, precisely for that reason.

We have lots of formal languages (general-purpose programming languages, logical languages like Prolog/Datalog/SQL, "regular" expressions, configuration languages, all kinds of DSLs...) because we have lots of problems, and we choose the representation of the problem that most suits our needs.

Unless you are assuming you have some kind of superintelligence that can automagically take care of everything you throw at it, natural language breaks down when your problem becomes wide enough or deep enough. In a way this is like people making Rube-Goldberg contraptions with Excel. 50% of my job is cleaning up that stuff.


I quite agree and so would Wittgenstein, who (as I understand it) argued that precise language is essential to thought and reasoning[1]. I think one of the key things here is often what we think of as reasoning boils down to taking a problem in the real world and building a model of it using some precise language that we can then apply some set of known tools to deal with. Your example of a quadratic is perfect, because of course now I see (a) I know right away that it's an upwards-facing parabola with a line of symmetry at -5/2, that the roots are at -4 and -1 etc whereas if I saw (b) I would first have to write it down to get it in a proper form I could reason about.

I think this is a fundamental problem with the "chat" style of interaction with many of these models (that the language interface isn't the best way of representing any specific problem even if it's quite a useful compromise for problems in general). I think an intrinsic problem of this class of model is that they only have text generation to "hang computation off" meaning the "cognative ability" (if we can call it that) is very strongly related to how much text it's generating for a given problem which is why that technique of prompting using chain of thought generates much better results for many problems[2].

[1] Hence the famous payoff line "whereof we cannot speak, thereof we must remain silent"

[2] And I suspect why in general GPT-4 seems to have got a lot more verbose. It seems to be doing a lot of thinking out loud in my use, which gives better answers than if you ask it to be terse and just give the answer or to give the answer first and then the reasoning, both of which generally generate inferior answers in my experience and in the research eg https://arxiv.org/abs/2201.11903


> I quite agree and so would Wittgenstein

It depends on whether you ask him before or after he went camping -- but yeah, I was going for an early-Wittgenstein-esque "natural language makes it way too easy to say stuff that doesn't actually mean anything" (although my argument is much more limited).

> I think this is a fundamental problem with the "chat" style of interaction

The continuation of my argument would be that if the problem is effectively expressible in a formal language, then you likely have way better tools than LLMs to solve it. Tools that solve it every time, with perfect accuracy and near-optimal running time, and critically, tools that allow solutions to be composed arbitrarily.

Alpha Go and NNUE for computer chess, which are often cited for some reason as examples of this brave new science, would be completely worthless without "classical" tree search techniques straight out of the Russel-Norvig.

Hence my conclusion, contra what seems to be the popular opinion, is that these tools are potentially useful for some specific tasks, but make for very bad "universal" tools.


There are some domains that are in the twilight zone between language and deductive, formal reasoning. I've been into genealogy last year. It's very often deductive "detective work": say there are four women in a census with the same name and place that are listed on a birth certificate you're investigating. Which of them is it? You may rule one out on hard evidence (census suggests she would have been 70 when the birth would have happened), one on linked evidence (this one is the right age, but it's definitively the same one who died 5 years later and we know the child's mother didn't), one on combined softer evidence (she was in a fringe denomination and at the upper end of the age range) then you're left with one, etc.

Then as you collect more evidence you find that the age listed on the first one in the census was wildly off due to a transcription error and it's actually her.

You'd think some sort of rule-based system and database might help with these sorts of things. But the historical experience of expert system is that you then often automate the easy bits at the cost of demanding even more tedious data-entry. And you can't divorce data entry and deduction from each other either, because without context, good luck reading out a rare last name in the faded ink of some priest's messy gothic handwriting.

It feels like language models should be able to help. But they can't, yet. And it fundamentally isn't because they suck at grade school math.

Even linguistics, not something I know much about but another discipline where you try to make deductions from tons and tons of soft and vague evidence - you'd think language models, able to produce fluent text in more languages than any human, might be of use there. But no, it's the same thing: it can't actually combine common sense soft reasoning and formal rule-oriented reasoning very well.


> You'd think some sort of rule-based system and database might help with these sorts of things.

sounds like belief change systems (a bit) to me!

https://plato.stanford.edu/entries/logic-belief-revision/


I assumed seanhunter was suggesting getting the LLM to convert x^2 + 5x + 4 = 0 to a short bit of source code to solve for x.

IIRC Wolfram Alpha has (or had, hard to keep up) a way to connect with ChatGPT.


It does. This is the plugins methodology described in the toolformers paper which I've linked elsewhere[1]. The model learns that for certain types of problems certain specific "tools" are the best way to solve the problem. The problem is of course it's simple to argue that the LLM learns to use the tool(s) and can't reason itself about the underlying problem. The question boils down to whether you're more interested in machines which can think (whatever that means) or having a super-powered co-pilot which can help with a wide variety of tasks. I'm quite biased towards the second so I have the wolfram alpha plugin enabled in my chat gpt. I can't say it solves all the math-related hallucinations I see but I might not be using it right.

[1] But here it is again https://arxiv.org/abs/2302.04761


GPT4 does even without explicitly enabling plugins now, by constructing Python. If you want it to actually reason through it, you now need to ask it, sometimes fairly forcefully/in detail, before it will indulge you and not omit steps. E.g. see [1] for the problem given above.

But as I noted elsewhere, training its ability to do it from scratch matters not for the ability to do it from scratch, but for the transferability of the reasoning ability. And so I think that while it's a good choice for OpenAI to make it automatically pick more effective strategies to give the answer it's asked for, there is good reason for us to still dig into its ability to solve these problems "from scratch".

[1] https://chat.openai.com/share/694251c9-345b-4433-a856-7c38c5...


Ideally we'd have both worlds -- but if we're aiming for AGI and we have to choose, using a language that lets you encode everything seems preferable to one that only lets you talk about, say, constrained maximization problems.


the ml method doesnt require you to know how to solve the problem at all, and could someday presumably develop novel solutions. not just high efficiency symbolic graph search.


The bitter lesson isn't a "general result". It's an empirical observation (and extrapolation therefrom) akin to Moore's law itself. As with Moore's law there are potential limiting factors: physical limits for Moore's law and availability and cost of quality training data for the bitter lesson.


Surely the "efficiency" is just being transferred from software to hardware e.g the hardware designers are having to come up with more efficient designs, shrink die sizes etc to cope with the inefficiency of the software engineers? We're starting to run into the limits of Moore's law in this regard when it comes to processors, although it looks like another race might be about to kick off for AI but with RAM instead. When you've got to the physical limits of both, is there anywhere else to go other than to make the software more efficient?


When you say "a general result", what does that mean? In my world, a general result is something which is rigorously proved, e.g., the fundamental theorem of algebra. But this seems to be more along the lines of "we have lots of examples of this happening".

I'm certainly no expert, but it seems to me that Wolfram Alpha provides a counterexample to some extent, since they claim to fuse expert knowledge and "AI" (not sure what that means exactly). Wolfram Alpha certainly seems to do much better at solving math problems than an LLM.


As someone else pointed out I've used that term wrong. Rule of thumb/observation you might better say.


I would mention, that while yes, you can just throw computational power at the problem, the addition of human expertise didn't disappear. It moved from creating an add instruction, to coming up with a new Neural Net Architecture, and we've seen a lot of the ideas being super useful and pushing the boundaries.


>> the ability to solve grade school maths problems would not be at all a predictor of ability to solve real mathematical problems at a research level

> If you want to solve grade school math problems, why not use an 'add' instruction?

Certainly the objective is not for the AI to do research-level mathematics.

It's not really even to do grade-school math.

The point is that grade-school math requires reasoning capability that transcends probabilistic completion of the next token in a sequence.

And if Q-Star has that reasoning capability, then it's another step-function leap toward AGI.


> Certainly the objective is not for the AI to do research-level mathematics.

The problem is that there are different groups of people with different ideas about AI, and when talking about AI it's easy to end up tackling the ideas of a specific group but forgetting about the existence of the others. In this specific example, surely there are AI enthusiasts who see no limits to the applications of AI, including research-level mathematics.


This is so profoundly obvious you have to wonder the degree of motivated reasoning behind people’s attempt to cast this as “omg it can add… but so can my pocket calculator!”


There's no value in an LLM doing arithmetic for the sake of doing arithmetic with the LLM. There's value in testing an LLMs ability to follow the rules for doing arithmetic that it already knows, because the ability to recognise that a problem matches a set of rules it already knows in part or whole and then applying those rules with precision is likely to generalise to overall far better problem solving abilities.

By all means, we should give LLMs lots and lots of specialised tools to let them take shortcuts, but that does not remove the reasons for understanding how to strengthen the reasoning abilities that would also make them good at maths.

EDIT: After having just coerced the current GPT4 to do arithmetic manually: It appears to have drastically improved in its ability to systematically following the required method, while ironically being far less willing to do so (it took multiple attempts before I got it to stop taking shortcuts that appeared to involve recognising this was a calculation it could use tooling to carry out, or ignoring my instructions to do it step by step and just doing it "in its head" the way a recalcitrant student might. It's been a while since I tested this, but this is definitely "new-ish".


Gaslighting LLMs does wonders. In this case, e.g., priming it by convincing it the tool is either inaccessible/overloaded/laggy, or here perhaps, telling it the python tool computed wrong and can thus not be trusted.


Why would we teach kids maths then, when they can use a calculator? It's much easier and faster for them.

I believe it's because having a foundational understanding of maths and logic is important when solving other problems, and if you are looking to create an AI that can generally solve all problems it should probably have some intuitive understanding of maths too.

i.e. if we want an LLM to be able to solve unsolved theorems in the future, this requires a level of understanding of maths that is more than 'teach it to use a calculator'.

More broadly, I can imagine a world where LLM training is a bit more 'interactive' - right now if you ask it to play a game of chess with you it fails, but it has only ever read about chess and past games and guesses the next token based on that. What if it could actually play a game of chess - would it get a deeper appreciation for the game? How would this change it's internal model for other questions (e.g. would this make it answer better at questions about other games, or even game theory?)


It's also fun to use your brain I guess, I think we've truly forgotten that life should be about fun.

Watching my kids grow up, they just have fun doing things like trying to crawl, walk or drink. It's not about being the best at it, or the most efficient, it's just about the experience.

Now maths is taught in a boring way, but knowing it can help us lead more enjoable lives. When math is taught in an enjoyable way AND people get results out of it. Well that's glorious.


> Why would we teach kids maths then, when they can use a calculator? It's much easier and faster for them.

I am five years older than my brother, and we happened to land just on opposite sides of when children were still being taught mental arithmetic and when it was assumed they would, in fact, have calculators in their pockets.

It drives him crazy that I can do basic day-to-day arithmetic in my head faster than he can get out his calculator to do it. He feels like he really did get cheated out of something useful because of the proliferation of technology.


Skull has limited volume. What room is unused by one capacity may be used by another.


Even if that were true, I can count on one hand the number of times I've needed to use anything more than basic algebra (which is basically arithmetic with a placeholder) in my adult life. I think I'd genuinely rather keep arithmetic in my head than calculator use.


Is this intuition scientifically supported? I've read that people who remember every detail of their lives tend not to have spectacular intelligence, but outside of that extreme I'm unaware of having seen the tradeoff actually bite. And there are certainly complementarities in knowledge -- knowing physics helps with chemistry, knowing math and drama both help with music, etc.


Chimps have a much better working memory than humans. They can also count 100 times faster than humans. However, the area of their brain responsible for this faculty is used for language in humans... The theory is that the prior working memory and counting ability may have been optimized out at some point to make physical room, assuming human ancestors could do it too.

Lookup the chimp test. the videos of the best chimp are really quite incredible.

There is also the measured inflation of map traversing parts of the brain in pro tetris players and taxi drivers. I vaguely remember an explanation about atrophy in nearby areas of the brain, potentially to make room.


Judging by some YouTube videos I’ve seen, ChatGPT with GPT-4 can get pretty far through a game of chess. (Certainly much farther than GPT-3.5.) For that duration it makes reasonably strategic moves, though eventually it seems to inevitably lose track of the board state and start making illegal moves. I don’t know if that counts as being able to “actually play a game”, but it does have some ability, and that may have already influenced its answers about the other topics you mentioned.


What if you encoded the whole game state into a one-shot completion that fits into the context window every turn? It would likely not make those illegal moves. I suspect it's an artifact of the context window management that is designed to summarize lengthy chat conversations, rather than an actual limitation of GPT4's internal model of chess.


I am sorry, but I thought it was a bold assumption it has an internal model of chess?


Having an internal model of chess and maintaining an internal model of the game state of a specific given game when it's unable to see the board are two very different things.

EDIT: On re-reading I think I misunderstood you. No, I don't think it's a bold assumption to think it has an internal model of it at all. It may not be a sophisticated model, but it's fairly clear that LLM training builds world models.


Not that bold, given the results from OthelloGPT.

We know with reasonable certainty that an LLM fed on enough chess games will eventually develop an internal chess model. The only question is whether GPT4 got that far.


Doesn't really seem like an internal chess model if it's still probabalistic in nature. Seems like it could still produce illegal moves.


So can humans. And nothing stops probabilities in a probabilistic model from approaching or reaching 0 or 1 unless your architecture explicitly prevents that.


Why?

Or, given https://thegradient.pub/othello/, why wouldn't it have an internal model of chess? It probably saw more than enough example games and quite a few chess books during training.


> More broadly, I can imagine a world where LLM training is a bit more 'interactive'

Well, yes, assume that every conversation you have with ChatGPT without turning off history makes it into the training set.


Actually, OpenAI did a research[0] on solving some hard math problems by integrating language model and Lean theorem prover some time ago.

[0]: https://openai.com/research/formal-math


how do they achieve 41.2% in high school Olympiads but only 55% for grade school problems?

PS: also I thought GPT4 already achieved 90% in some university math grades? Oh I remember that was multiple-choice


I think the answer is Money, Money, Money. Sure it is 1000000000x more expensive in compute power, and error prown on top as well, to let a LLM solve an easy Problem. But the Monopolies generate a lot of hype around it to get more money from investors. Same as the self driving car hype was. Or the real time raytracing insanity in computer graphics. If one hype dies they artificially generate a new one. For me, I just watch all the ships sink to the ground. It is gold level comedy. Btw AGI is coming, haha, sure, we developers will be replaced by an script which can not bring B, A, C in a logical sequence. And this already needs massive town size data centers to train.


> If one hype dies they artificially generate a new one

They have a pipeline of hypes ready to be deployed at a moment's notice. The next one is quantum, it's already gathering in the background. Give it a couple of years.


Can LLM's compute any computable function? I thought that an LLM can approximate any computable function, if the function is within the distribution that it is are trained on. I think it's jolly interesting to think about different axiomizations in this context.

Also we know that LLM's can't do a few things - arithmetic, inference & planning are in there. They look like they can because they retrieve discussions from the internet that contain the problems, but when they are tested out of distribution then all of a sudden they fail. However, some other nn's can do these things because they have the architecture and infrastructure and training that enables it.

There is a question for some of these as to whether we want to make NN's do these tasks or just provide calculators, like for grade students, but on the other hand something like Alphazero looks like it could find new ways of doing some problems in planning. The challenge is to find architectures that integrate the different capabilities we can implement in a useful and synergistic way. Lots of people have drawn diagrams about how this can be done, then presented them with lots of hand waving at big conferences. What I love is that John Laird has been building this sort of thing for like, forty years, and is roundly ignored by NN people for some reason.

Maybe because he keeps saying it's really hard and then producing lots of reasons to believe him?


I still believe that A(G)I will consist of subsystems and different network architectures (if NN's are the path to that), just like we humans have.


Many of the "specialist" parts of the brain are still made from cortical columns, though. Also, they are in many cases partly interchangeable, with some reduction in efficiency.

Transformers may be like that, in that they can do generalized learning from different types of input, with only minor modifications needed to optimize for different input (or output) modes.


Cortical columns are one part of much more complex systems of neural compute that at a minimum includes recursive connections with thalamus, hypothalamus, midbrain, brainstem nuclei, cerebellum, basal forebrain, — and the list goes on.

So it really does look like a society of networks, all working in functional synchrony (parasynchrony might be a better word) with some firms of “consciousness” updated in time slabs of about 200-300 milliseconds.

LLMs are probably equivalent now to Wernicke’s and Broca’s areas, but much more is needed “on top” and “on bottom”—-motivation, affect, short and longterm memory, plasticity of synaptic weighting and dynamics, and perhaps most important, a self-steering attentional supervisor or conductor. That attentional driver system is what we probably mean by consciousness.


> That attentional driver system is what we probably mean by consciousness.

You may know much more about this than me, but how sure are you about this? To me it seems like a better fit that the "self-steering attentional supervisor" is associated with what we mentally model (and oversimplify) as "free will", while "consciousness" seems to be downstream from the attention itself, and has more to do with organizing and rationalizing experiences than with than with the directly controlling behavior.

This processed information then seems to become ONE input to the executive function in following cycles, but with a lag of at least 1 second, and often much more.

> one part of much more complex systems of neural compute

As for your main objection, you're obviously right. But I wonder how much of the computation that is relevant for intelligence is actually in those other areas. It seems to me that recent developments indicate that Transformer type models are able to self-organize into several different type of microstructures, even within present day transformer based models [1].

[1]: https://www.youtube.com/watch?v=Gg-w_n9NJIE (comment from Ilya somewhere)


Fun and insightful comment.

Not sure at all. Also some ambiguities in definitions. Above I mean “consciousness” of the type many would be willing to assume operates in a cat, dog, or mouse—attentional and occasionally, also intentional. I agree that this is downstream of pure attention. Attention needs to be steered and modulated. The combination of the two levels working together recursively is what I had in mind.

“Free will” gets us into more than that. I’ve been reading Daniel Dennett on levels of “intention” this week. This higher domain of an intentional stance (nice Wiki article) might get labeled “self-consciousness”.

Most humans seem to accept this as a cognitive and mainly linguistic domain—the internal discussions we have with ourselves, although I think we also accept that there is are major non-linguistic drivers. Language is an amazingly powerful tool for recursive attentional and semantic control.


Interesting points.

My take on "free will" is definitely partly based on Dennett's work.

As for "consciousness", it seems to me that most of not all actions we do are decided BEFORE they hit our consciousness. For actions that are not executed immediately, the processing that we experience as "consciousness" may then raise some warning flags if the action our pre-conscious mind has decided on is likely to cause som bad consequences. This MAY cause the decision-making part (executive function) of the brain to modify the decision, but not because the consciousness can override the decision directly.

Instead, when this happens, it seems to be that our consciousness extrapolates our story into the future in a way that creates fear, desire or similar more primal motivations that have more direct influence over the executive function.

One can test this by for instance standing at near the top of a cliff (don't du this if suicidal): Try to imagine that you have decided to jump of the cliff. Now imagine the fall from the cliff and you hitting the rocks below. Even if (and maybe especially if) you managed to convince yourself that you were going to jump, this is likely to trigger a fear response strong enough to ensure you will not jump (unless you're truely suicidal).

Or for a less synthetic situation. Let's say you're a married man, but in a situation where you have an opportunity to have sex with a beautiful woman. The executive part of the brain may already have decided that you will. But if your consciousness predicts that your wife is likely to find out and starts to spin a narrative about divorce, loosing access to your children and so on, this MAY cause your executive function to alter the decision.

Often in situations like this, though, people tend to proceed with what the preconcious executive function had already decided. Afterwards, they may have some mental crisis because they ended up doing something their consciousness seemed to protest against. They may feel they did it against their own will.

This is why I think that the executive function, even the "free will" is not "inside" of consciousness, but is separate from it. And while it may be influenced by the narratives that our consciousness spin up, it also takes many other inputs that we may or may not be conscious of.

The reason I still call this "free" will, is based on Dennett's model, though. And in fact, "free" doesn't mean what we tend to think it means. Rather, the "free" part means that there is a degree of freedom (like in a vector space) that is sensitive to the kind of incentives the poeple around you may provide for your actions.

For instance stealing something can be seen as a "free will" decision if you would NOT do it if you knew with 100% certainty that you would be caught and punished for it. In other words, "free will" actions are those that, ironically, other people can influence to the point where they can almost force you to take them, by providing strong enough incentives.


Afaik some are similar, yes. But we also have different types of neurons etc. Maybe we'll get there with a generalist approach, but imho the first step is a patchwork of specialists.


> Can LLM's compute any computable function?

In a single run, obviously not any, because it's context window is very limited. With a loop and access to an "API" (or willing conversation partner agreeing to act as one) to operate a Turing tape mechanism? It becomes a question of ability to coax it into complying. It trivially has the ability to carry out every step, and your main challenge becomes to get it to stick to it over and over.

One step "up", you can trivially get GPT4 to symbolically solve fairly complex runs of instructions of languages it can never have seen before if you specify a grammar and then give it a program, with the only real limitation again being getting it to continue to adhere to the instructions for long enough before it starts wanting to take shortcuts.

In other words: It can compute any computable function about as well as a reasonably easily distractable/bored human.


ML still cant do sin. Functions that repeat periodically.


What exactly is it you think it can't do? It can explain and apply a number of methods for calculating sin. For sin it knows the symmetry and periodicity, and so will treat requests for sin of larger values accordingly. To convince it to continue to write out the numbers for an arbitrary large number of values without emitting "... continue like this" or similar shortcut a human told to do annoyingly pointless repetitive work would also be prone to prefer is indeed tricky, but there's nothing to suggest it can't do it.


To err is human, after all.


You're missing the point: who's using the 'add' instruction ? You. We want 'something' to think about using the 'add' instruction to solve a problem.

We want to remove the human from the solution design. It would help us tremendously tbh, just like I don't know, Google map helped me never to have to look for direction ever again ?


When the solution requires arithmetic, one trick is to simply ask GPT to write a Python program to compute the answer.

There's your 'add'.


GPT4 now does this by default. You'll see a "analyzing" step before you get the answer, and a link which will show the generated python.


Interesting, how do you use this idea? If you prompt the LLM "create a python Add function Foo to add a number to another number", "using Foo add 1 and 2", or somesuch, but what's to stop it hallucinating and saying "Sure, let me do that for you, foo 1 and 2 is 347. Please let me know if you need anything else."


Nothing stops it from writing a recipe for soup for every request, but it does tend to do what it's told. When asked to do mathsy things and told it's got a tool for doing those it tends to lean into that if it's a good llm.


It writes a function and then you provide it to an interpreter which does the calculation output on which gpt proceeds to do the rest.

That’s how langchain works, chatgpt plugins and gpt function calling. It has proven to be pretty robust - that is, gpt4 realising when it needs to use a tool/write code for calculations when needed and then using the output.


With ChatGPT you now just state your problem, and if it looks like math, it will do so. E.g. see this transcript:

https://chat.openai.com/c/dd8de3f7-a50c-4b6d-bd3f-b52ed996d3...


We’re way beyond this kind of hallucinations now. OpenAI’s models are frighteningly good at producing code.

You can even route back runtime errors and ask it to fix its own code. And it does.

It can write code and even write a test to test that code. Give it an interpreter and you’re all set.


What you’re proposing is equivalent to training a monkey (or a child for that matter) to punch buttons that correspond to the symbols it sees without actually teaching it what any of the symbols mean.


What an absolutely idiotic comment.

> If you want to solve grade school math problems

That's not the aim here. Very obviously what we are talking about here is _complementing_ AI language models with improved mathematical abilities, and whether that leads to anything interesting. Surely you understand that? Aren't you one of the highest rated commenters on this site?


You make the asumption that Q* is a LLM, but I think OpenAI guys know very well that the current LLM architecture cannot achieve AGI.

As the name suggests, this things is likely using some form of Q learning algorithm, which makes it closer to the DeepMind models than a transformer.

My guess is that they pipe their LLM into some Q learnt net. The LLM may transform a natural language task into some internal representation that can then be handled by the Q-learnt model, which spits out something that can be transformed back again into natural language.


There is a paper about something called Q*. I have no idea if they are connected or if the name matched coincidentially.

https://arxiv.org/abs/2102.04518


The real world is a space of continuous actions. To this day Q algorithms have been ones of discrete action outputs. I'd be surprised if a Q algorithm could handle the huge action space of language. Honestly its weird they'd consider the Q family. I figured we were done with that after PPO performed so well.


As an ML programmer, i think that approach sounds really too complicated. It is always a bad idea to render the output of one neural network into output space before feeding it into another, rather than have them communicate in feature space.


Let's say a model runs through a few iterations and finds a small, meaningful piece of information via "self-play" (iterating with itself without further prompting from a human.)

If the model then distills that information down to a new feature, and re-examines the original prompt with the new feature embedded in an extra input tensor, then repeats this process ad-infinitum, will the language model's "prime directive" and reasoning ability be sufficient to arrive at new, verifiable and provable conjectures, outside the realm of the dataset it was trained on?

If GPT-4,5,...,n can progress in this direction, then we should all see the writing on the wall. Also, the day will come where we don't need to manually prepare an updated dataset and "kick off a new training". Self-supervised LLMs are going to be so shocking.


People have done experiments trying to get GPT-4 to come up with viable conjectures. So far it does such a woefully bad job that it isn't worth even trying.

Unfortunately there are rather a lot of issues which are difficult to describe concisely, so here is probably not the best place.

Primary amongst them is the fact that an LLM would be a horribly inefficient way to do this. There are much, much better ways, which have been tried, with limited success.


After a year the entire argument you make boils down to “so far”.


Whereas your post sounds like "Just give the approach more time, it shall continue to incrementally improve until it finally works someday, cuz reasons."

Early attempts at human flight approached it by strapping wings to people's arms and flapping: Do you think that would have eventually worked too, if only we had just given it a bit more time and faith?


> Early attempts at human flight approached it by strapping wings to people's arms and flapping: Do you think that would have eventually worked too, if only we had just given it a bit more time and faith?

Interestingly, we how have human powered aircraft... We have flown ~60km with human leg power alone. We've also got human powered ornithopters (flapping wing designs) which can fly but only for very short times before the pilot is exhausted.

I expect that another 100 years from now, both records will be exceeded, altough probably for scientific curiosity more than because human powered flight is actually useful.


I knew about the legs (there was a model in the London Science Museum when I was a kid), but I didn't know about the ornithopter.

https://en.wikipedia.org/wiki/UTIAS_Snowbird

13 years ago! Wow, how did I miss that?


> Just give the approach more time, it shall continue to incrementally improve until it finally works someday, cuz reasons

Yes. Because we haven't yet reached the limit of deep learning models. GPT-3.5 has 175 billion parameters. GPT-4 has an estimated 1.8 trillion parameters. That was nearly a year ago. Wait until you see what's next.


Why would adding more parameters suddenly make it better at this sort of reasoning? It feels a bit of a “god of the gaps” where it’ll just stop being a stochastic parrot in just a few more million parameters.


I don't think it's guaranteed, but I do think it's very plausible because we've seen these models gain emerging abilities at every iteration, just from sheer scaling. So extrapolation tells us that they may keep gaining more capabilities (we don't know how exactly it does it, though, so of course it's all speculation).

I don't think many people would describe GPT-4 as a stochastic parrot already... when the paper that coined (or at least popularized) the term came up in early 2021, the term made a lot of sense. In late 2023, with models that at the very least show clear signs of creativity (I'm sticking to that because "reasoning" or not is more controversial), it's relegated to reductionistic philosophical arguments, but not really a practical description anymore.


I don’t think we should throw out the stochastic parrot so easily. As you say there are “clear signs of creativity” but that could be it getting significantly better as a stochastic parrot. We have no real test to tell mimicry apart from reasoning and as you note we also can only speculate about how any of it works. I don’t think it’s reductionist in light of that, maybe cautious or pessimistic.


They can write original stories in a setting deliberately designed to not be found in the training set (https://arxiv.org/abs/2310.08433). To me that's rather strong evidence of being beyond stochastic parrots by now, although I must concede that we know so little about how everything works, that who knows.


I didn't look at the paper but... How do you design a setting in a way that you're sure there isn't a similar one in the training set, when we don't even precisely know what the training set for the various GPT models was?


Basically by making it unlikely enough to exist.

The setting in the paper is about narrating a single combat between Ignatius J. Reilly and a pterodactyl. Ignatius J. Reilly is a literary character with some very idiosyncratic characteristics, that appears in a single book, where he of course didn't engage in single combats at all or interact with pterodactyls. He doesn't seem to have been the target of fanfiction either (which could be a problem if characters like, say, Harry Potter or Darth Vader were used instead), so the paper argues that it's very unlikely that a story like that had been ever written at all prior to this paper.


Well, we've been writing stories for thousands of years, so I'm a bit skeptical that the concept of "unlikely enough to exist" is a thing. More to the specific example, maybe there isn't a story about this specific character fighting a pterodactyl, but surely there are tons of stories of people fighting all kind of animals, and maybe there are some about someone fighting a pterodactyl too.


Sure, but the evaluation explicitly addresses (among other points) how well that specific character is characterized. If an LLM took a pre-existing story about (say) Superman fighting a pterodactyl, and changed Superman to Ignatius J. Reilly, it wouldn't get a high rating.


> very least show clear signs of creativity

Do you know how that “creativity” is achieved? It’s done with a random number generator. Instead of having the LLM pick the absolute most likely next token, they have it select from a set of most likely next tokens - size of the set depends on “temperature”.

Set temperature to 0, and the LLM will talk in circles and not really say anything interesting. Set it too high and it will output nonsense.

The whole design of LLMs don’t seem very well thought out. Things are done a certain way not because it makes sense but because it seems to produce “impressive” results.


I know that, but to me that statement isn't much more helpful than "modern AI is just matrix multiplication" or "human intelligence is just electric current through neurons".

Saying that it's done with a random number generator doesn't really explain the wonder of achieving meaningful creative output, as in being able to generate literature, for example.


> Set temperature to 0, and the LLM will talk in circles and not really say anything interesting. Set it too high and it will output nonsense.

Sounds like some people I know, at both extremes.

> The whole design of LLMs don’t seem very well thought out. Things are done a certain way not because it makes sense but because it seems to produce “impressive” results.

They have been designed and trained to solve natural language processing tasks, and are already outperforming humans on many of those tasks. The transformer architecture is extremely well thought out, based on extensive R&D. The attention mechanism is a brilliant design. Can you explain exactly which part of the transformer architecture is poorly designed?


> They have been designed and trained to solve natural language processing tasks

They aren’t really designed to do anything actually. LLMs are models of human languages - it’s literally in the name, Large Language Model .

https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...

I’m sorry but I don’t trust something that uses a random number generator as part of its output generation.


> They aren’t really designed to do anything actually. LLMs are models of human languages - it’s literally in the name, Large Language Model .

No. And the article you linked to does not say that (because Wolfram is not an idiot).

Transformers are designed and trained specifically for solving NLP tasks.

> I’m sorry but I don’t trust something that uses a random number generator as part of its output generation.

The human brain also has stochastic behaviour.


People use the term "stochastic parrot" in different ways ... some just as a put-down ("it's just autocomplete"), but others like Geoff Hinton acknowledging that there is of course some truth to it (an LLM is, at the end of the day, a system who's (only) goal is to predict "what would a human say"), while pointing out the depth of "understanding" needed to be a really good at this.

There are fundamental limitations to LLMs though - a limit to what can be learned by training a system to predict next word form a fixed training corpus. It can get REALLY good at that task, as we've seen, to extent that it's not just predicting next word but rather predicting an entire continuation/response that is statistically consistent with the training set. However, what is fundamentally missing is any grounding in anything other than the training set, which is the what causes hallucinations/bullshitting. In a biological intelligent system predicting reality is the goal, not just predicting what "sounds good".

LLMs are a good start in as much as they prove the power of prediction as a form of feedback, but to match biological systems we need a closed-loop cognitive architecture that can predict then self-correct based on mismatch between reality and prediction (which is what our cortex does).

For all of the glib prose that an LLM can generate, even if it seems to understand what you are asking (after all, it was trained with the goal of sounding good), it doesn't have the intelligence of even a simple animal like a rat that doesn't use language at all, but is grounded in reality.


> even if it seems to understand what you are asking (after all, it was trained with the goal of sounding good

It was trained not only to "sound good" aesthetically but also to solve a wide range of NLP tasks accurately. It not only "seems to" understand the prompt but it actually does have a mechanical understanding of it. With ~100 layers in the network it mechanically builds a model of very abstract concepts at the higher layers.

> it doesn't have the intelligence of even a simple animal

It has higher intelligence than humans by some metrics, but no consciousness.


> It was trained not only to "sound good" aesthetically but also to solve a wide range of NLP tasks accurately.

Was it? I've only heard of pre-training (predict next word) and subsequent RLHF + SFT "alignment" (incl. aligning to goal of being conversational). AFAIK the NLP skills that these LLMs achieve are all emergent rather than explicitly trained.

I'm not sure we can really say the net fully understands even if it answers as if it does - it was only trained to "predict next word", which in effect means being trained to generate a human-like response. It will have learnt enough to accomplish that goal, and no more (training loss tends to zero as goal is met).

Contrast this to an animal with a much richer type of feedback - reality, and with continual (aka online) learning. The animal truly understands it's actions - i.e. has learnt to accurately predict what will happen as a result of them.

The LLM does not understand it's own output in this sense - it exists only in a world of words, and has no idea if the ideas it is expressing are true or not (hence all the hallucinating/bullshitting). It only knew enough to generate something that sounded like what a person might say.


> Was it? I've only heard of pre-training (predict next word) and subsequent RLHF + SFT "alignment" (incl. aligning to goal of being conversational). AFAIK the NLP skills that these LLMs achieve are all emergent rather than explicitly trained.

I believe you are right about that. I did some research after reading your comment. Transformers were certainly designed for NLP, but with large enough models the abilities can emerge without necessarily being explicitly trained for it.

> I'm not sure we can really say the net fully understands even if it answers as if it does - it was only trained to "predict next word", which in effect means being trained to generate a human-like response.

It depends on your definition of "understand". If that requires consciousness then there is no universally agreed formal definition.

Natural Language Understanding (NLU) is a subset of Natural Language Processing (NLP). If we take the word "understanding" as used in an academic and technical context then yes they do understand quite well. In order to simply "predict the next word" they learn an abstract model of syntax, semantics, meaning, relationships, etc, from the text.

> and has no idea if the ideas it is expressing are true or not (hence all the hallucinating/bullshitting).

That is not really an issue when solving tasks that are within it's context window. It is an issue for factual recall. The model is not a type of database that stores its training set verbatim. Humans have analogous problems with long term memory recall. I can think straight within my working memory but my brain will "hallucinate" to some extent when recalling distant memories.


The context window only has to do with the size of input it has access to - its not related to what it's outputting, which is ultimately constrained by what it was trained on.

If you ask it a question where the training data (or input data = context) either didn't include the answer, or where it was not obvious how to get the right answer, that will not (unfortunately) stop it from confidently answering!


> The context window only has to do with the size of input it has access to - its not related to what it's outputting, which is ultimately constrained by what it was trained on.

Wait a minute. You are completely missing the entire "attention mechanism" thing which is what makes transformers so capable. For each output token generated in sequence, the attention mechanism evaluates the current tokens relationship to all tokens in the context window, weighing their relevance. There are multiple "attention heads" running in parallel (16 in GPT-3.5). Now for each layer of the neural network there is an attention mechanism, independently processing the entire context window for each token. There are ~100 layers in ChatGPT. So now we have 100 layers times 16 attention heads = 1600 attention mechanisms evaluating the entire context window over many deep layers of abstraction for each output token.


I'm not sure what your point is ... Hallucinations are where the net hadn't seen enough training data similar/related to the prompt to enable it to generate a good continuation/response. Of course in cases where it is sufficiently trained and the context contained what it needs then in can make full use of it, even copying context words to the output (zero shot learning) when appropriate.

The real issue isn't that the net often "makes a statistical guess" rather than saying "I don't know", but rather that when it does make errors it has no way to self-detect the error and learn from the mistake, as a closed-loop biological system is able to do.


I was responding to this.

> The context window only has to do with the size of input it has access to - its not related to what it's outputting

The sequential token generation process is closely related to the content of the context window at every step.

Maybe I misunderstood your point. I know these things can hallucinate when asked about obscure facts that they weren't sufficiently trained on.


> If you ask it a question where the training data (or input data = context) either didn't include the answer, or where it was not obvious how to get the right answer, that will not (unfortunately) stop it from confidently answering!

I haven't found this to be the case in my experience. I use ChatGPT-4. It often tells me when it doesn't know or have enough information.

If you haven't used GPT-4 I recommend signing up for a month. It is next level, way better than 3.5. (10x the parameter count). (No I'm not being paid to recommend it.)


You can predict performance of certain tasks before training and it's continuous:

https://twitter.com/mobav0/status/1653048872795791360


I read that paper back in the day and honestly I don't find it very meaningful.

What they find is that for every emerging ability where an evaluation metric seems to have a sudden jump, there is some other underlying metric that is continuous.

The thing is that the metric with the jump is the one people would actually care about (like actually being able to answer questions correctly, etc.) while the continuous one is an internal metric. I don't think that refutes the existence of emerging abilities, it just explains a little bit of how they arise.


Why would it not? We've observed them getting significantly better through multiple iterations. It is quite possible they'll hit a barrier at some point, but what makes you believe this iteration will be the point where the advanced stop?


Because for what we’re discussing it would represent a step change in capability not an incremental improvement as we’ve seen.


You're moving goal posts. You asked why it'd get better, not about a step change.


No I'm not that's what this whole sub-thread is about how bad LLMs are at the stuff thats described in the OP.

For context this is the grandparent within which my original reply was scoped:

I feel very comfortable saying, as a mathematician, that the ability to solve grade school maths problems would not be at all a predictor of ability to solve real mathematical problems at a research level. The reason LLMs fail at solving mathematical problems is because: 1) they are terrible at arithmetic, 2) they are terrible at algebra, but most importantly, 3) they are terrible at complex reasoning (more specifically they mix up quantifiers and don't really understand the complex logical structure of many arguments) 4) they (current LLMs) cannot backtrack when they find that what they already wrote turned out not to lead to a solution, and it is too expensive to give them the thousands of restarts they'd require to randomly guess their way through the problem if you did give them that facility

Solving grade-school problems might mean progress in 1 and 2, but that is not at all impressive, as there are perfectly good tools out there that solve those problems just fine, and old-style AI researchers have built perfectly good tools for 3. The hard problem to solve is problem 4, and this is something you teach people how to do at a university level.

(I should add that another important problem is what is known as premise selection. I didn't list that because LLMs have actually been shown to manage this ok in about 70% of theorems, which basically matches records set by other machine learning techniques.)

(Real mathematical research also involves what is known as lemma conjecturing. I have never once observed an LLM do it, and I suspect they cannot do so. Basically the parameter set of the LLM dedicated to mathematical reasoning is either large enough to model the entire solution from end to end, or the LLM is likely to completely fail to solve the problem.)

I personally think this entire article is likely complete bunk.

Edit: after reading replies I realise I should have pointed out that humans do not simply backtrack. They learn from failed attempts in ways that LLMs do not seem to. The material they are trained on surely contributes to this problem.


I responded specificially to this:

> Why would adding more parameters suddenly make it better at this sort of reasoning?

My response to you was about that specific claim as worded. Nothing more, nothing less.


Humans and other animals definitely different when it comes to reasoning. At the same time, biologically humans and many other animals are very similar, when it comes to brain, but humans have more "processing power". So it's only natural to expect some emergent properties from increasing number of parameters.


> it’ll just stop being a stochastic parrot in just a few more million parameters.

Is is not a stochastic parrot today. Deep learning models can solve problems, recognize patterns, and generate new creative output that is not explicitly in their training set. Aside from adding more parameters there are new neural network architectures to discover and experiment with. Transformers aren't the final stage of deep learning.


Probabilistically serializing tokens in a fashion that isn't 100% identical to training set data is not creative in the context of novel reasoning. If all it did was reproduce its training set it would be the grossest example of overfitting ever, and useless.

Any actually creative output from these models is by pure random chance, which is most definitely different from the deliberate human reasoning that has produced our intellectual advances throughout history. It may or may not be inferior: there's a good argument to be made that "random creativity" will outperform human capabilities due to the sheer scale and rate at which the models can evolve, but there's no evidence that this is the case (right now).


There is also no evidence for your conjecture about there being some sort of grand distinction between "probabilistically serializing tokens" and "deliberate human reasoning" other than scale. There might be, but there is no evidence.


There's plenty of evidence that humans reason differently than ML models; namely basically any human intellectual discovery in history versus the (approximately) zero randomly generated ones by ML.

We don't know exactly how human reasoning works, but the observational evidence clearly indicates it is not by randomly piecing together tokens already known.


> There's plenty of evidence that humans reason differently than ML models; namely basically any human intellectual discovery in history versus the (approximately) zero randomly generated ones by ML.

This reasoning is invalid. For fun, I checked if GPT4 would catch the logical errors you made, and it did. Specifically, it correctly pointed out that absence of evidence is not evidence of absence. But even if there had been evidence of absence, this reasoning is invalid because it presumes that human reasoning must result in intellectual discovery irrespective of how it is employed, and so that if we can't find intellectual discoveries, it must mean an absence of human reasoning. In other words, it invalidly assumes that a difference in outcomes must represent a difference in the structure of reasoning. This is trivially invalid because humans think without making intellectual discoveries all the time.

However, it's also a strawman because I did not claim that humans and ML models reason the same way. I claimed there is no evidence of 'some sort of grand distinction between "probabilistically serializing tokens" and "deliberate human reasoning" other than scale'.

1) This explicitly recognizes that there is a difference, but that it might be just scale, and that we don't have evidence it doesn't. Your argument fails to address this entirely.

2) Even at scale, it does not claim they would be the same, but argues we don't have evidence that "probabilistically serializing tokens" must be inherently different from deliberate human reasoning" to an extent sufficient to call it "some sort of grand distinction". We can assume with near 100% certainty that there are differences - the odds of us happening upon the exact same structure is near zero. That does however not mean that we have any basis for saying that human reasoning isn't just another variant of "probabilistically serializing tokens".

I'll note that unlike you, GPT4 also correctly interpreted my intent when asked to review the paragraph and asked whether it implies the two must function the same. I could* take that to imply that LLMs are somehow better at humans at reasoning, but that would be logically invalid for the same reasons as your argument.

> We don't know exactly how human reasoning works, but the observational evidence clearly indicates it is not by randomly piecing together tokens already known.

Neither does LLMs. Piecing together tokens in a stochastic manner based on a model is not "randomly piecing together" - the model guides the process strongly enough that it's a wildly misleading characterization, as you can indeed trivially demonstrate by actually randomly piecing together words.

But even if we assume a less flippant and misleading idea of what LLMs do, your claim is incorrect. Observational evidence does nothing of the sort. If anything, the rapidly closing gap between human communication and LLMs shows that while it is extremely likely to be structural differences at the low level, it is increasingly unclear whether they are a material distinction. In other words, it's unclear whether the hardware and even hardwired network matters much relative to the computational structure the trained model itself creates.

You're welcome to your beliefs - but they are not supported by evidence. We also don't have evidence the other way, so it's not unreasonable to hold beliefs about what the evidence might eventually show.


Ever heard of something called diminishingly returns?

The value improvement between 17.5b parameters and 175b parameters is much greater than the value improvement between 175b parameters and 18t parameters.

IOW, each time we throw 100 times more processing power at the problem, we get a measly 2 time increase in value.


Yes that's a good point. But the algorithms are improving too.


You are missing the point that it can be a model limit. LLMs were a breakthrough but that doesn’t mean they are a good model for some other problems, no matter the number of parameters. Language contains more than we thought, as GPT has impressively showed (ie semantics embedded in the syntax emerging from text compression), but still not every intellectual process is language based.


I know that, but deep learning is more than LLMs. Transformers aren't the final ultimate stage of deep learning. We haven't found the limit yet.


You were talking about the number of parameters on existing models. Like the history of Deep Learning has shown, simply throwing more computing power at an existing approach will plateau and not result in a fundamental breakthrough. Maybe we'll find new architectures, but the point was that the current ones might be showing their limits, and we shouldn't expect the model suddenly become good at something they are currently unable to handle because "more parameters".


Yes you're right I only mentioned the size of the model. The rate of progress has been astonishing and we haven't reached the end, in terms of both of size and algorithmic sophistication of the models. There is no evidence that we have reached a fundamental limit of AI in the context of deep learning.


Indeed. LLM is an application on a transformer trained with backpropagation. What stops you from adding a logic/mathematic "application" on the same transformer?


Nothing, and there are methods which allow these types of models to learn to use special purpose tools of this kind[1].

[1] https://arxiv.org/abs/2302.04761 Toolformer: Language Models Can Teach Themselves to Use Tools


Yes, it seems like this is a direction to replace RLHF so another way to scale without baremetal and if not this then still just a matter of time before some model optimization outperforms the raw epoch/parameters/token approach.


Friend, the creator of this new progress is a machine learning PhD with a decade of experience in pushing machine learning forward. He knows a lot of math too. Maybe there is a chance that he too can tell the difference between a meaningless advance and an important one?


That is as pure an example of the fallacy of argument from authority[1] as I have ever seen especially when you consider that any nuance in the supposed letter from the researchers to the board will have been lost in the translation from "sources" to the journalist to the article.

[1] https://en.wikipedia.org/wiki/Argument_from_authority


That fallacy's existence alone doesn't discount anything (nor have you shown it's applicable here), otherwise we'd throw out the entire idea of authorities and we'd be in trouble


Authorities are useful within a context. Appealing to authority is not an argument. At most, it is an heuristic.

_Using_ this fallacy in an argument invalidates the argument (or shows it did not exist in the first place)


When the person arguing uses their own authority (job, education) to give their answer relevance, then stating that the authority of another person is greater (job, education) to give that person's answer preeminence is valid.


I am neither a mathematician or LLM creator but I do know how to evaluate interesting tech claims.

The absolute best case scenario for a new technology is that it when it seems like a toy for nerds, and doesn't outperform anything we have today, but the scaling path is clear.

Its problems just won't matter if it does that one thing with scaling. The web is a pretty good hypermedia platform, but a disastrously bad platform for most other computer applications. Nevertheless the scaling of URIs and internet protocols have caused us to reorganize our lives around it. And then if there really are unsolvable problems with the platform they just get offloaded onto users. Passwords? Privacy? Your problem now. Surely you know to use a password manager?

I think this new wave of AI is going to be like that. If they never solve the hallucination/confabulation issue, it's just going to become your problem. If they never really gain insight, it's going to become your problem to instruct them carefully. Your peers will chide for not using a robust AI-guardrail thing or not learning the basics of prompt engineering like all the kids do instinctively these days.


How on earth could you evaluate the scaling path with too little information. That's my point. You can't possibly know that a technology can solve a given kind of problem if it can only so far solve a completely different kind of problem which is largely unrelated!

Saying that performance on grade-school problems is predictive of performance on complex reasoning tasks (including theorem proving) is like saying that a new kind of mechanical engine that has 90% efficiency can be scaled 10x.

These kind of scaling claims drive investment, I get it. But to someone who understands (and is actually working on) the actual problem that needs solving, this kind of claim is perfectly transparent!


Any claims of objective, quantitative measurements of "scaling" in LLMs is voodoo snake oil when measured against some benchmarks consisting of "which questions does it answer correctly". Any machine learning PhD will admit this, albeit only in a quiet corner of a noisy bar after a few more drinks than is advisable when they're earning money from companies who claim scaling wins on such benchmarks.


For the current generative AI wave, this is how I understand it:

1. The scaling path is decreased val/test loss during training.

2. We have seen multiples times that large decreases in this loss have resulted in very impressive improvements in model capability across a diverse set of tasks (e.g. gpt-1 through gpt-4, and many other examples).

3. By now, there is tons of robust data demonstrating really nice relationships between model size, quantity of data, length of training, quality of data, etc and decreased loss. Evidence keeps building that most multi-billion param LLMs are probably undertrained, perhaps significantly so.

4. Ergo, we should expect continued capability improvement with continued scaling. Make a bigger model, get more data, get higher data quality, and/or train for longer and we will see improved capabilities. The graphs demand that it is so.

---

This is the fundamental scaling hypothesis that labs like OpenAI and Anthropic have been operating off of for the past 5+ years. They looked at the early versions of the curves mentioned above, extended the lines, and said, "Huh... These lines are so sharp. Why wouldn't it keep going? It seems like it would."

And they were right. The scaling curves may break at some point. But they don't show indications of that yet.

Lastly, all of this is largely just taking existing model architectures and scaling up. Neural nets are a very young technology. There will be better architectures in the future.


We're at the point now where the harder problem is obtaining the high quality data you need for the initial training in sufficient quantities.


These European efforts to create competitive LLMs need to know that.


I don't think they will go anywhere. Europe doesn't have the ruthlessness required to compete in such an arena, it would need far more unification first before that could happen. And we're only drifting further apart it seems.


I didn’t say “certain success”, I said “interesting”


Honestly, OpenAI seem more like a cult that a company to me.

The hyperbole that surrounds them fits the mould nicely.


They did build the most advanced LLM tool, though.


Maybe it takes a cult


But he also has the incentive to exaggerate the AI's ability.

The whole idea of double-blind test (and really, the whole scientific methodology) is based on one simple thing: even the most experienced and informed professionals can be comfortably wrong.

We'll only know when we see it. Or at least when several independent research groups see it.


> even the most experienced and informed professionals can be comfortably wrong

That's the human hallucination problem. In science it's a very difficult issue to deal with, only in hindsight you can tell which papers from a given period were the good ones. It takes a whole scientific community to come up with the truth, and sometimes we fail.


No. It takes just one person to come up with the truth. It then can takes ages to convince the "scientific community".


Well, one person will usually add a tiny bit of detail to the "truth". It's still a collective task.


I don't think so. The truth is advanced by individuals, not by the collective. The collective is usually wrong about things as long as they possibly can be. Usually the collective first has too die before it accepts the truth.


I thought (and could be wrong) that all of these concerns are based on a very low probability of a very bad outcome.

So: we might be close to a breakthrough, that breakthrough could get out of hand, then it could kill a billion+ people.


> I thought (and could be wrong) that all of these concerns are based on a very low probability of a very bad outcome.

Among knowledgeable people who have concerns in the first place, I'd say giving the probability of a very bad outcome of cumulative advances as "very low" is a fringe position. It seems to vary more between "significant" and "close to unity".

There are some knowledgeable people like Yann LeCun who have no concerns whatsoever but they seem singularly bad at communicating why this would be a rational position to take.


Given how dismissive LeCun is of the capabilities of SotA models, I think he thinks the state of the art is very far from human, and will never be human-like.

Myself, I think I count as a massive optimist, as my P(doom) is only about 15% — basically the same as Russian Roulette — half of which is humans using AI to do bad things directly.


Unlikely. We'll know when OpenAI has declared itself ruler of the new world, imposes martial law, and takes over.


Why would you ever know? Why would the singularity reveal itself in such an obvious way(until it's too late to stop it)?


Also, wbhart is referring to publicly released LLMs, while the OpenAI researchers are most likely referring to an un-released in-research LLM.


Sure... but that machine learning PhD has a vested interest in being optimistically biased in his observations.


Ah finally the engineers approach to the news. I'm not sure why we have to have hot takes, instead of dissecting the news and trying to tease out the how.


FWIW The Verge is reporting that people inside are also saying the Reuters story is bunk:

https://www.theverge.com/2023/11/22/23973354/a-recent-openai...


> After being contacted by Reuters, OpenAI, which declined to comment, acknowledged in an internal message to staffers a project called Q* and a letter to the board before the weekend's events, one of the people said.

Reuters update 6:51 PST

The Verge has acted like an intermediary for Sam's camp during this whole saga, from my reading.


We have an algorithm and computational hardware that will tune a universal function approximator to fit any dataset with emergent intelligence as it discovers abstractions, patterns, features and hierarchies.

So far, we have not yet found hard limits that cannot be overcome by scaling the number of model parameters, increasing the size and quality of training data or, very infrequently, adopting a new architecture.

The number of model parameters required to achieve a defined level of intelligence is a function of the architecture and training data. The important question is, what is N, the number of model parameters at which we cross an intelligence threshold and it becomes theoretically possible to solve mathematics problems at a research level for an optimal architecture that we may not yet have discovered. Our understanding does not extend to the level where we can predict N but I doubt that anyone still believes that it is infinity after seeing what GPT4 can do.

This claim here is essentially a discovery that N may be much closer to where we are with today's largest models. Researchers at the absolute frontier are more likely to be able to gauge how close they are to a breakthrough of that magnitude from how quickly they are blowing past less impressive milestones like grade school math.

My intuition is that we are in a suboptimal part of the search space and it is theoretically possible to achieve GPT4 level intelligence with a model that is orders of magnitude smaller. This could happen when we figure out how to separate the reasoning from the factual knowledge encoded in the model.


intelligence isn't a function unless you're talking about over every possible state of the universe.


There are well described links between intelligence and information theory. Intelligence is connected to prediction and compression as measures of understanding.

Intelligence has nothing specific to do with The Universe as we known it. Any universe will do, a simulation, images or a set of possible tokens. The universe is every possible input. The training set is a sampling drawn from the universe. LLMs compress this sampling and learn the processes and patterns behind it so well that they can predict what should come next without any direct experience of our world.

All machine learning models and neural networks are pure functions. Arguing that no function can have intelligence as a property is equivalent to claiming that artificial intelligence is impossible.


Intelligence must inherently be a function unless there is a third form of cause-effect transition that can't be modelled as a function of determinism and randomness.


Functions are by definition not random. Randomness would break: "In mathematics, a function from a set X to a set Y assigns to each element of X exactly one element of Y"


"Function" has (at least) two meanings. The last clause is not talking about functions in the mathematical sense. It could have been worded clearer, sure.


> The reason LLMs fail at solving mathematical problems is because

That's exactly what Go/Baduk/Weiqi players think some years ago. And superalignment is defintely OpenAI's major research objective:

> https://openai.com/blog/our-approach-to-alignment-research

> our AI systems are proposing very creative solutions (like AlphaGo’s move 37)

When will mathematicians face the move 37 moment?


Probably in <3 years


I don't know whether this particular article is bunk. I do know I've read many, many similar comments about how some complex task is beyond an conceivable model or system and then, years later, marveled at exactly that complex task being solved.


The article isn't describing something that will happen years later, but now. The comment author is saying that this current model is not AGI as it likely can't solve university-level mathematics, and they are presumably open to the possibility of a model years down the line that can do that.


This comment seems to presume that Q* is related to existing LLM work -- which isn't stated in the article. Others have guessed that the 'Q' in Q* is from Q-learning in RL. In particular backtracking, which you point out LLMs cannot do, would not be an issue in an appropriate RL setup.


> which you point out LLMs cannot do, would not be an issue in an appropriate RL setup.

Hm? it's pretty trivial to use a sampler for LLMs that has a beam search and will effectively 'backtrack' a 'bad' selection.

It just doesn't normally help-- by construction the LLM sampled normally already approximates the correct overall distribution for the entire output, without any search.

I assume using a beam search does help when your sampler does have some non-trivial constraints (like the output satisfies some grammar or passes an algebraic test, or even just top-n sampling since those adjustments on a token by token basis result in a different approximate distribution than the original distribution filtered by the constraints).


Back-tracking is a very nearly solved problem in the context of Prolog-like languages or mathematical theorem provers (as you probably well know). There are many ways you could integrate an LLM-like system into a tactic-based theorem prover without having to restart from the beginning for each alternative. Simply checkpointing and backtracking to a checkpoint would naively improve upon your described Monte Carlo algorithm. More likely I assume they are using RL to unwind state backwards and update based on the negative result, which would be significantly more complicated but also much more powerful (essentially it would one-shot learn from each failure).

That's just what I came up with after thinking on it for 2 minutes. I'm sure they have even better ideas.


You can also consider the chatGPT app as a RL environment. The environment is made of the agent (AI), a second agent (human), and some tools (web search, code, APIs, vision). This grounds the AI into human and tool responses. They can generate feedback that can be incorporated into the model by RL methods.

Basically every reply from a human can be interpreted as a reward signal. If the human restates the question, it means a negative reward, the AI didn't get it. If the human corrects the AI, another negative reward, but if they continue the thread then it is positive. You can judge turn-by-turn and end-to-end all chat logs with GPT4 to annotate.

The great thing about chat based feedback is that it is scalable. OpenAI has 100M users, they generate these chat sessions by the millions every day. Then they just need to do a second pass (expensive, yes) to annotate the chat logs with RL reward signals and retrain. But they get the human-in-the-loop for free, and that is the best source of feedback.

AI-human chat data is in-domain for both the AI and human, something we can't say about other training data. It will contain the kind of mistakes AI does, and the kind of demands humans want to solve with AI. My bet is that OpenAI have realized this and created GPTs in order to enrich and empower the AI to create the best training data for GPT-5.

The secret sauce of OpenAI is not their people, or Sam, or the computers, but the training set, especially the augmented and synthetic parts.


There are certainly efforts along the lines of what you suggest. There are problems though. The number of backtracks is 10^k where k is not 2, or 3, or 4.....

Another issue is that of autoformalisation. This is the one part of the problem where an LLM might be able to help, if it were reliable enough (it isn't currently) or if it could truly understand the logical structure of mathematical problems correctly (currently they can't).


> That's just what I came up with after thinking on it for 2 minutes. I'm sure they have even better ideas.

the thing is that ideas not necessary easy to implement. There will be many obstacles on route you described:

- quality of provers, is there good ergo provers which also can run at large scales (say billions of facts)

- you need some formalization approach, probably LLM will do some work, but we don't know what will be quality

- LLM likely will generate many individual factoids, which are losely compatible, contradicting, etc, and untrivial effort is required to reconcile and connect them


I agree that in and of itself it's not enough to be alarmed. Also i have to say i don't really know what grade school mathematics means here(multiplication? Proving triangles are congruent?). But I think the question is, whether the breakthrough is an algorithmic change in reasoning. If it is, then it could challenge all 4 of your limitations. Again this article is low on details so really we are arguing over our best guesses. But I wouldn't be so confident that an improvement on simple math problems due to algorithms can have huge implications.

Also, do you remember what go players said when they beat Fan Hui? Change can come quick


I think maybe I didn't make myself quite clear here. There are already algorithms which can solve advanced mathematical problems 100% reliably (prove theorems). There are even algorithms which can prove any correct theorem that can be stated in a certain logical language, given enough time. There are even systems in which these algorithms have actually been implemented.

My point is that no technology which can solve grade school maths problems would be viewed as a breakthrough by anyone who understood the problem. The fundamental problems which need to be solved are not problems you encounter in grade school mathematics. The article is just ill-informed.


>no technology which can solve grade school maths problems would be viewed as a breakthrough ...

Not perhaps in the sense of making mathematicians redundant but it seems like a breakthrough for ChatGPT type programs.

You've got to remember these things have gone from kind of rubbish a year or so ago to being able to beat most students at law exams now and by the sounds of it beat students at math tests shortly. At that rate or progress they'd be competing with the experts before very long.


The article suggests the way Q* solves basic math problems matters more than the difficulty of the problems themselves. Either way, I think judging the claims made remains premature without seeing the supporting documentation.


“Given enough time” makes that a useless statement. Every kid in college learns this.

The ability to eventually solve a given theorem isn’t interesting — especially if the time is longer than the time left in the universe.

It’s far more interesting to see if an AI can, given an arbitrarily stated problem make clear progress quickly.


On backtracking, I thought tree-of-thought enabled that?

"considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices"

https://arxiv.org/abs/2305.10601

Generally with you though, this thing is not leading to real smarts and that's accepted by many. Yes, it'll fill in a few gaps with exponentially more compute but it's more likely that an algo change is required once we've maxed out LLM's.


Yes, there are various approaches like tree-of-thought. They don't fundamentally solve the problem because there are just too many paths to explore and inference is just too slow and too expensive to explore 10,000 or 100,000 paths just for basic problems that no one wanted to solve anyway.

The problem with solving such problems with LLMs is that if the solution to the problem is unlike problems seen in training, the LLM will almost every time take the wrong path and very likely won't even think of the right path at all.

The AI really does need to understand why the paths it tried failed in order to get insight into what might work. That's how humans work (well, one of many techniques we use). And despite what people think, LLMs really don't understand what they are doing. That's relatively easy to demonstrate if you get an LLM off distribution. They will double down on obviously erroneous illogic, rather than learn from the entirely new situation.


Thank you for the thoughtful response


As someone who studied math in grad school as part of a PhD program, worked at a hedge fund and went on to work on software and applied math, I call bullshit on this.

Math and Logic is just low-dimensional symbol manipulation that computers can easily do. You can throw data at them and they’ll show you theories that involve vectors of 42,000 variables while Isaac Newton had 4 and Einstein had 7 with Levi-Civita calculus. In short, what you consider “reasoning”, while beautiful in its simplicity, is nevertheless crude approximations to complex systems, such as linear regression or least squares.

3 days ago AI predicted fluid dynamics better than humans: https://www.sciencedaily.com/releases/2023/11/231120170956.h...

Google’s AI predicts weather now faster and better than Current systems built by humans: https://www.zdnet.com/google-amp/article/ai-is-outperforming...

AlphaZero based on MCTS years ago beat Rybka and all human-built systems in chess: https://www.quora.com/Did-AlphaZero-really-beat-Stockfish

And it can automate science and send it into overdrive: https://www.pbs.org/newshour/amp/science/analysis-how-ai-is-...


But, isn't AlphaGo a solution to kind of specific mathematical problem? And that it has passed with flying colors?

What I mean is, yes, neural networks are stochastic and that seems to be why they're bad at logic; on the other hand it' not exactly hallucinating a game of Go, and that seems different to how neural networks are prone to hallucination and confabulation on natural language or X-ray imaging.


Sure, but people have already applied deep learning techniques to theorem proving. There are some impressive results (which the press doesn't seem at all interested in because it doesn't have ChatGPT in the title).

It's really harder than one might imagine to develop a system which is good at higher order logic, premise selection, backtracking, algebraic manipulation, arithmetic, conjecturing, pattern recognition, visual modeling, has a good mathematical knowledge, is autonomous and fast enough to be useful.

For my money, it isn't just a matter of fitting a few existing jigsaw pieces together in some new combination. Some of the pieces don't exist yet.


Then your critique is about LLMs specifically.

But even there, can we say scientifically that LLMs cannot do math? Do we actually know that? And in my mind, that would imply LLMs cannot achieve AGI either. What do we actually know about the limitations of various approaches?

And couldn't people argue that it's not even necessary to think in terms of capabilities as if they were modules or pieces? Maybe just brute-force the whole thing, make a planetary scale computer. In principle.


You seem knowledgeable. Can you share a couple of interesting papers for theorem proving that came out in the last year? I read a few of them as they came out, and it seemed neural nets can advance the field by mixing "soft" language with "hard" symbolic systems.


The field is fairly new to me. I'm originally from computer algebra, and somehow struggling into the field of ATP.

The most interesting papers to me personally are the following three:

* Making higher order superposition work. https://doi.org/10.1007/978-3-030-79876-5_24

* MizAR 60 for Mizar 50. https://doi.org/10.48550/arXiv.2303.06686

* Magnus Hammer, a Transformer Based Approach to Premise Selection. https://doi.org/10.48550/arXiv.2303.04488

Your mileage may vary.


How about this:

- The Q* model is very small and trained with little compute.

- The OpenAI team thinks the model will scale in capability in the same way the GPT models do.

- Throwing (much) more compute at the model will likely allow it to solve research level math and beyond, perhaps also do actual logic reasoning in other areas.

- Sam goes to investors to raise more money (Saudi++) to fund the extra compute needed. He wants to create a company making AI chips to get more compute etc.

- The board and a few other OpenAI employees (notably Ilya) wants to be cautious and adopt a more "wait and see" approach.

All of this is speculation of course.


Your comment is regarding LLMs, but Q* may not refer to an LLM. As such, our intuition about the failure of LLM's may not apply. The name Q* likely refers to a deep reinforcement learning based model.

To comment, in my personal experience, reinforcement learning agents learn in a more relatable human way than traditional ml, which act like stupid aliens. RL Agents try something a bunch of times, mess up, and tweak their strategy. After some extreme level of experience, they can make wider strategic decisions that are a little less myopic. RL agents can take in their own output, as their actions modify the environment. RL Agents also modify the environment during training, (which I think you will agree with me is important if you're trying to learn the influence of your own actions as a basic concept). LLM's, and traditional ml in general, are never trained in a loop on their own output. But in DRL, this is normal.

So if RL is so great and superior to traditional ml why is RL not used for everything? Well the full time horizon that can be taken into consideration in a DRL Agent is very limited, often a handful of frames, or distilled frame predictions. That prevents them from learning things like math. Traditionally RL bots have been only used for things like robotic locomotion, chess, go. Short term decision making that is made given one or some frames of data. I don't even think any RL bots have learned how to read english yet lol.

For me, as a human, my frame predictions exist on the scale of days, months, and years. To learn math I've had to sit and do nothing for many hours, and days at a time, consuming my own output. For a classical RL bot, math is out of the question.

But, my physical actions, for ambulation, manipulation, and balance, are made for me by specialized high speed neural circuits that operate on short time horizons, taking in my high level intentions, and all the muscle positions, activation, sensor data, etc. Physical movement is obfuscated from me almost in entirety. (RL has so far been good at tasks like this.)

With a longer frame horizon, that predicts frames far into the future, RL can be able to make long term decisions. It would likely take a lifetime to train. So you see now why math has not been accomplished by RL yet, but I don't think the faculty would be impossible to build into an ml architecture.

An RL bot that does math would likely spin on its own output for many many frames, until deciding that it is done, much like a person.


It's also hard to know what the LLM has reasoned out vs has memorized.

I like the very last example in my tongue-in-cheek article, https://nt4tn.net/articles/aixy.html

Certainly the LLM didn't derive Fermat's theorem on sums of two squares under the hood (and, of course, very obviously didn't prove it correct-- as the code is technically incorrect for 2), but I'm somewhat doubtful that there was any function exactly like the template in codex's training set either (at least I couldn't quickly find any published code that did that). The line between creating something and applying a memorized fact in a different context is not always super clear.


Thinking is about associations and object visualisation. Surely a non-human system can build those, right? Pointing out only to a single product exposed to the public does not prove limitations for a theoretical limit.


You're underestimating the power of LLM's.

I'll address two of your points as the other two stem from this.

They can't backtrack that's purely just design and can be easily trained there's no need to simulate at random until it gets the answer, if allowed to review it's prior answers and consider this, if often can reason a better answer. Further more breaking down problems. This is easily demonstrated when looking at how accuracy improves when you ask it to explain it's reasoning as it calculates (break it down into smaller problems). The same for humans, large mathematical problems are solved using learned methods to breakdown and simplify calculations into those easier for us to calculate and build up.

If the model was able to self adjust weightings based on it's finding this would further improve it (another design limitation we'll eventually get to improve, reinforcement learning). Much like 2+2=4 is your instantaneous answer, the neural connection has been made so strong in our brains by constant emphasis we no longer need to think of an abacus each time we get to the answer 4.

You're also ignoring the emergent properties of these LLMs, theyre obviously not yet at human level but they do understand the underlying values and can reason using this value. Semantic search/embeddings is evidence of this.


The thing is that a LLMs can point out a logic error in their reasoning if specifically asked to do so.

So maybe OpenAI just slapped an RL agent on top of the next-token generator.


> 4) they (current LLMs) cannot backtrack when they find that what they already wrote turned out not to lead to a solution, and it is too expensive to give them the thousands of restarts they'd require to randomly guess their way through the problem if you did give them that facility

This sounds like a reward function? If correctly implemented couldn't it enable an LLM to self-learn?


Specifically what deep-Q learning (as in Q*?) does....


To some degree you are right, but I think you forget, that the things they solved already (talking and reasoning about a world that was only presented in the form of abstractions (words)) were supposed to be harder than having a good understanding of numbers, quantities, and logical terms.

My guess is, that they saw the problems ChatGPT has today and worked on solving those problems. And given how important numbers are, they tried to fix how ChatGPT handles/understands numbers. After doing that, they saw how this new version performed much better and predicted, that further work in this area could lead to solving real-world math problems.

I don't think that we will be presented with the highway to singularity, but probably one more step in that direction.


The reason LLMs solve school problems is because they've been trained on solutions. The problems are actually very repetitive. Not surprising for each 'new' of them there was something similar in training set. For research level problems there is nothing in training set. That's why they don't perform well.

Just today I asked GPT4 a simple task. Having mouse position in zoomed and scrolled image find it's position in the original image. GPT4 happily wrote the code, but it was completely wrong. I had to fix it manually.

However, the performance can be increased if there are several threads working on solution. Some suggesting and others analyzing the solution(s). This will increase the size of 'active' memory, at least. And decrease the load on threads, making them more specialized and deeper. This requires more resources, of course. And good management with task split. May be a dedicated thread for that.


1. OpenAI researchers used loaded and emotional words, implying shock or surprise. It's not easy to impress an OpenAI researcher like this, and above all, they understand the difficulty difference between teaching AI grade school and complex math since many years. They also understand that solving math with any form of reliability is only an emergent property in quite advanced LLM's.

2. Often, research is made on toy models and if this would be such a model, acing grade school problems (as per the article) would be quite impressive to say the least as this ability simply isn't emergent early in current LLM's.

What I think might have happened here is a step forward in AI capacity that has surprised researchers not because it is able to do things it couldn't at all do before, but how _early_ it is able to do so.


I don't know for Q* of course, but all the tests I made with GPT4, and all what I've read and seen about it, show that it is unable to reason. It was trained with an unfathomable amount of data, so it can simulate reasoning very well, but it is unable to reason


What is the difference between simulating reasoning very well and "actual" reasoning?


I think the poster meant that it's capable of having a high probability of correct reasoning - simulating reasoning is lossy, actual reasoning is not. Though, human reasoning is still lossy.


Being able to extrapolate with newly found data.

You can get a LLM to simulate it "discovering" the pythagorean theorem, but can it actually, with the knowledge that was available at the time, discover the pythagorean theorem by itself?

Any parent will tell you, it's easy to simulate discovery and reasoning, it's a trick played for kids all the time. The actual, real stuff, that's way harder.


Probably best to say "simulate the appearance of reasoning": looks and feels 100% acceptable at a surface level, but the actual details and conclusions are completely wrong / do not follow.


Actual reasoning shows the understanding and use of a model of the key features of the underlying problem/domain.

As a simple example that you can replicate using chatgpt, ask it to solve some simple maths problem. Very frequently you will get a solution that looks like reasoning but is not, and reveals that it does not have an actual model of the underlying maths but is in fact doing text prediction based on a history of maths. For example see here[1]. I ask it for some quadratics in x with some specification on the number of roots. It gives me what looks at first glance like a decent answer. Then I ask the same exact question but asking for quadratics in x and y[2]. Again the answer looks plausible except that for the solution "with one real root" it says the solution has one real root when x + y =1. Well there are infinite real values for x and y such that x + y =1, not one real root. It looks like it has solved the problem but instead it has simulated the solving of the problem.

Likewise stacking problems, used to check for whether an AI has a model of the world. This is covered in "From task structures to world models: What do LLMs know?"[3] but for example here[4] I ask it whether it's easier to balance a barrel on a plank or a plank on a barrel. The model says it's easier to balance a plank on a barrel with an output text that simulates reasoning discussing center of mass and the difference between the flatness of the plank and the tendency of the barrel to roll because of its curvature. Actual reasoning would say to put the barrel on its end so it doesn't roll (whether you put the plank on top or not).

[1] https://chat.openai.com/share/64556be8-ad20-41aa-99af-ed5a42...

[2] https://chat.openai.com/share/2cd39197-dc09-4d07-a0d6-6cd800...

[3] https://arxiv.org/abs/2310.04276

[4] https://chat.openai.com/share/4b631a92-0d55-4ae5-8892-9be025...


I generally agree with what you're saying and the first half of your answer makes perfect sense but I think the second is unfair (i.e. "[is it] easier to balance a barrel on a plank or a plank on a barrel"). It's a trick question and "it" tried to answer in good faith.

If you were to ask the same question of a real person and they replied with the exact same answer you could not conclude that person was not capable of "actual reasoning". It's a bit of witch-hunt question set to give you the conclusion you want.


I should have said, as I understand it, the point of this type of question is not that one particular answer is the right answer and another is wrong, it's that often the model in giving an answer will do something really weird that shows that it doesn't have a model of the world.


I didn't make up this methodology and it's genuinely not a trick question (or not intended as such), it's a simple example of an actual class of questions that researchers ask when trying to determine whether a model of the world exists. The paper I linked uses a ball and a plank iirc. Often they use a much wider range of objects eg: something like "Suggest a stable way of stacking a laptop, a book, 4 wine classes, a wine bottle and an orange" is one that I've seen in a paper for example.


ok I believe it may not have been intended as a trick but I think it is. As a human, I'd have assumed you meant the trickier balancing scenario i.e. the plank and barrel on its side.

The question you quoted ("Suggest a stable way of stacking a laptop, a book, 4 wine classes, a wine bottle and an orange") I would consider much fairer and cgpt3.5 gives a perfectly "reasonable" answer:

https://chat.openai.com/share/fdf62be7-5cb2-4088-9131-40e089...


What's interesting about that one is I think that specific set of objects is part of its training set because when I have played around with swapping out a few of them it sometimes goes really bananas.


Actual reasoning is made up of various biological feedback loops that happen in the body and brain, essentially your physical senses give you the ability to reason in the first place, without the eyes, ears etc there is no ability to learn basic reasoning, which is why kids who are blind or mute from birth have huge issues learning about object permanence, spatial awaraness etc. You cant expect human reasoning without human perception.

My question is how does the AI perceive. Basically how good is the simulation for its perception. If we know that, then we can probably assess its ability to reason because we can compare it to the closest benchmark we have (your average human being). How do AI's see, how did they learn concepts in strings of words and pixels? How does the concept it learnt in text carry through to images of colors, of shapes? Does it show a transfer of conceptual understanding across both two and three dimentional shapes?

I know these are more questions than answers, but its just things that I've been wondering about.


This ship can't swim because only living creatures swim. It's true but it only shows your definition sucks.


Similarly, AlphaGo and Stockfish are only able to simulate reasoning their way through a game of Go or a game of Chess.

That simulated reasoning is enough to annihilate any human player they're faced with.

As Dijkstra famously said, "whether Machines Can Think... is about as relevant as the question of whether Submarines Can Swim".

Submarines don't swim, cars don't walk or gallop, cameras don't paint or draw... So what?

Once AI can simulate reasoning better than we can do the genuine thing, the question really becomes utterly irrelevant to the likely outcome.


I feel very comfortable to say that while the ability to solve grade school maths is not a predictor of abilities at a research level, the advances needed to solve 1 and 2 will mean improving results across the board unless you take shortcuts (e.g. adding an "add" instruction as proposed elsewhere), because if you actually dig into prompting an LLM to follow steps for arithmetic what you quickly see is that problem has not been the ability to reason on the whole (that is not to suggest that the ability to reason is good enough), but ability to consistently and precisely follow steps a sufficient number of times.

It's acting like a bored child who hasn't had following the steps and verifying the results repetitively drilled into it in primary school. That is not to say that their ability to reason is sufficient to reason at an advanced level yet, but so far what has hampered a lot of it has been far more basic.

Ironically, GPT4 is prone to take shortcuts and make use of the tooling enabled for it to paper over its abilities, but at the same time having pushed it until I got it to actually do arithmetic of large numbers step by step, it seems to do significantly better than it used to at systematically and repetitively following the methods it knows, and at applying "manual" sanity checks to its results afterward.

As for lemma conjecturing, there is research ongoing, and while it's by no means solved, it's also not nearly as dire as you suggest. See e.g.[1]

That's not to suggest it's reasoning abilities are sufficient, but I also don't think we've seen anything to suggest we're anywhere close to hitting the ceiling of what current models can be taught to do, even before considering advancements in tooling around them, such as giving them "methods" to work to and a loop with injected feedback, access to tools and working memory.

[1] https://research.chalmers.se/en/publication/537034


Did anyone claim that it would be a predictor of solving math problems at a research level? Inasmuch as we can extrapolate from the few words in the article it seems more likely that the researchers working on this project identified some emergent reasoning abilities exemplified with grade level math. Math literacy/ability that is comparable to the 0.1% of humans is not the end goal of OpenAI, "general intelligence" is. I have plenty of people in my social circle who are most certainly "generally intelligent" yet have no hope of attaining those levels of mathematical understanding.

Also note that we don't know if Q* is just a "current LLM" (with some changes)


"A Mathematician" (Lenat and co.) did indeed attempt to approach creative theorem development from a radically different approach (syllogistic search-space exploration, not dissimilar to forward-chaining in Prolog), although they ran into problems distinguishing "interesting" results from merely true results: https://web.archive.org/web/20060528011654/http://www.comp.g...


I don't understand your thesis here it seems self-contradictory:

1. "I don't think this is real news / important because solving grade school math is not a predictor of ability to do complex reasoning."

2. "LLMs can't solve grade school math because they're bad at arithmetic, algebra and most importantly reasoning."

So... from 2 automatically follows that LLMs with sufficiently better math may be sufficiently better at reasoning as you said "most importantly" reasoning is relevant for their ability to do math. Saying "most importantly reasoning" and then saying that reasoning is irrelevant if they can do math, is odd.


Everything you said about LLMs being "terrible at X" is true of the current generation of LLM architectures.

From the sound of it, this Q* model has a fundamentally different architecture, which will almost certainly make some of those issues not terrible any more.

Most likely, the Q* design is the very similar to the one suggested recently by one of the Google AI teams: doing a tree search instead of greedy next token selection.

Essentially, current-gen LLMs predict a sequence of tokens: A->B->C->D, etc... where the next "E" token depends on {A,B,C,D} and then is "locked in". While we don't know exactly how GPT4 works, reading between the lines of the leaked info it seems that it evaluates 8 or 16 of these sequences in parallel, then picks the best overall sequence. On modern GPUs, small workloads waste the available computer power because of scheduling overheads, so "doing redundant work" is basically free up to a point. This gives GPT4 a "best 1 of 16" output quality improvement.

That's great, but each option is still a linear greedy search individually. Especially for longer outputs the chance of a "mis-step" at some point goes up a lot, and then the AI has no chance to correct itself. All 16 of the alternatives could have a mistake in them, and now its got to choose between 16 mistakes.

It's as if you were trying to write a maths proof, asked 16 students, and instructed them to not cooperate and write their proof left-to-right, top-to-bottom without pausing, editing, or backtracking in any way! It'd like to see how "smart" humans would be at maths under those circumstances.

This Q* model likely does what Google suggested: Do a tree search instead of a strictly linear search. At each step, the next token is presented as a list of "likely candidates" with probabilities assigned to each one. Simply pick to "top n" instead of the "top 1", branch for a bit like that, and then prune based on the best overall confidence instead of the best next token confidence. This would allow a low-confidence next token to be selected, as long as it leads to a very good overall result. Pruning bad branches is also effectively the same as back-tracking. It allows the model to explore but then abandon dead ends instead of being "forced" to stick with bad chains of thought.

What's especially scary -- the type of scary that would result in a board of directors firing an overly commercially-minded CEO -- is that naive tree searches aren't the only option! Google showed that you can train a neural network to get better at tree search itself, making it exponentially more efficient at selecting likely branches and pruning dead ends very early. If you throw enough computer power at this, you can make an AI that can beat the world's best chess champion, the world's best Go player, etc...

Now apply this "AI-driven tree search" to an AI LLM model and... oh-boy, now you're cooking with gas!

But wait, there's more: GPT 3.5 and 4.0 were trained with either no synthetically generated data, or very little as a percentage of their total input corpus.

You know what is really easy to generate synthetic training data for? Maths problems, that's what.

Even up to the point of "solve this hideous integral that would take a human weeks with pen and paper" can be bulk generated and fed into it using computer algebra software like Wolfram Mathematica or whatever.

If they cranked out a few terabytes of randomly generated maths problems and trained a tree-searching LLM that has more weights than GPT4, I can picture it being able to solve pretty much any maths problem you can throw at it. Literally anything Mathematica could do, except with English prompting!

Don't be so confident in the superiority of the human mind. We all thought Chess was impossible for computers until it wasn't. Then we all moved the goal posts to Go. Then English text. And now... mathematics.

Good luck with holding on to that crown.


> We all thought Chess was impossible for computers until it wasn't.

I don't know who 'we' is but Chess was a program for computers before computers powerful enough existed with the hardware represented by people computing the next move.

https://en.wikipedia.org/wiki/Turochamp


The point was to not overvalue the superiority of humans, not that chess engines didn't exist.


I immediately thought of A* path finding, I'm pretty sure Q* is the LLM "equivalent". Much like you describe.


LLMs by themselves don’t learn from past past mistakes, but you could cycle inference steps and fine tuning/retraining steps.

Also, you can store failed attempts and lessons learned in context.


What amazes me is how close it gets to the right answer, though. Pick a random 10-digit number, then ask the next 20 numbers in sequence.

I feel like the magic in these LLMs is in how they work well in stacks, trees or in seqence. They become elements of other data structures. Consider a network of these, combined with other specialized systems and an ability to take and give orders. With reinforcement learning, it could begin building better versions of itself.


Did they say it was an LLM? I didn’t see that in the reporting.


What do you think of integrating propositional logic, first order logic and sat solvers in LLM output? ie forcing each symbol an LLM outputs to have its place in a formal proposition. And getting a prompt from the user to force that some parts be satisfiable.

I know this is not how us humans craft our thoughts, but maybe an AI can optimize to death the conjunction of these tools. The LLM just being an universal API to the core of formal logic.


ChatGPT (3.5) seems to do some rudimentary backtracking when told it's wrong enough times. However, it does seem to do very poorly in the logic department. LLMs can't seem to pick out nuance and separate similar ideas that are technically/logically different.

They're good at putting things together commonly found together but not so good at separating concepts back out into more detailed sub pieces.


I've tested GPT-4 on this and it can be induced to give up on certain lines of argument after recognising they aren't leading anywhere and to try something else. But it would require thousands (I'm really under exaggerating here) of restarts to get through even fairly simple problems that professional mathematicians solve routinely.

Currently the context length isn't even long enough for it to remember what problem it was solving. And I've tried to come up with a bunch of ways around this. They all fail for one reason or another. LLMs are really a long, long way off managing this efficiently in my opinion.


Weird time estimate given that a little more than a year ago, the leading use of LLMs was generating short coherent paragraphs (3-4 sentences)


I've pasted docs and error messages into GPT 3.5 and it's admitted it's wrong but usually it'll go through a few different answers before returning back to the original and looping


> They learn from failed attempts in ways that LLMs do not seem to. The material they are trained on surely contributes to this problem.

For transformer models, they do learn from their mistakes but only during the training stage.

There’s no feedback loop during inference, and perhaps there needs to be something; like real-time fine-tuning.


> I feel very comfortable saying, as a mathematician, that the ability to solve grade school maths problems would not be at all a predictor of ability to solve real mathematical problems at a research level.

At some point in the past, you yourself were only capable of solving grade school maths problems.


The statement you quoted also holds for humans. Of those who can solve grade school math problems, very, very few can solve mathematical problems at a research level.


We're moving the goalposts all the time. First we had the Turing test, now AI solving math problems "isn't impressive". Any small mistake is a proof it cannot reason at all. Meanwhile 25% humans think the Sun revolves around the Earth and 50% of students get the bat and ball problem wrong.


Thank you for mentioning the "bat and ball" problem. Having neither a math nor CS background, I hadn't heard of it - and got it wrong. And reflecting on why I got it wrong I gained a little understanding of my own flawed mind. Why did I focus on a single variable and not a relationship? It set my mind wandering and was a nice morsel to digest with my breakfast. Thanks!


You missed the point. Deep learning models are in the early stages of development.

With recent advancements they can already outperform humans at many tasks that were considered to require AGI level machine intelligence just a few years ago.


> The reason LLMs fail at solving mathematical problems is because

...because they are too small and have too little weights. Cats cannot solve mathematical problems too, but unlike cats, neural network evolve.


>Cats cannot solve mathematical problems too, but unlike cats, neural network evolve.

cats evolve plenty, pressure towards mathematical reasoning has stymied as of late what with the cans of food and humans.


"Also, Crysis runs like crap on my Commodore 64."


Hubris


Whose, in this instance? I can see an argument for both


> 1) they are terrible at arithmetic, 2) they are terrible at algebra

The interaction can be amusing. Proving algebra non-theorems by cranking through examples until an arithmetic mistake finally leads to a "counter-example."

It's like https://xkcd.com/882/ for theorems.


Just wait a little bit.

You are not better than a huge GPU cluster with Monte Carlo search and computer verification for much longer.

It will be more your job to find the interesting finds than doing the work of finding things in the first olace


Good point. What would these AI people know about AI? You’re right, what they’re doing will never work

You should make your own, shouldn’t take more than a weekend, right?


it's always nice to see HN commenters with so much confidence in themselves that they feel they know a situation better than the people who are actually in the situation being discussed.

Do you really believe that they don't have skilled people on staff?

Do you really believe that your knowledge of what OpenAI is doing is a superset of the knowledge of the people who work at OpenAI?

give me 0.1% of your confidence and I would be able to change the world.


The people inside a cult are not the most trustworthy sources for what the cult is doing.


This is defamatory and unfounded.

OpenAI is exploring the limits of computation.

The more I see this kind of unfounded slander, the more confident I become that this outfit might be the most important in the face of planet earth.

So many commentors here are starting to sound like priests of the Spanish inquisition. Do you seriously expect a community of technologist and science advocates to be fearful of such assertions without evidence? It's a waste of breath. All credibility just leaves the room instantly.


it is very odd that you consider OpenAI a cult.

go spend time in a cult, then come back and tell us how much of a cult OpenAI is.

you know nothing


It's a text generator that spits out tokens. It has absolutely no understanding of what it's saying. We as humans are attaching meaning to the generated text.

It's the humans that are hallucinating, not the text generator.


They've already researched this and have found model inside the LLM such as a map of the world - https://x.com/wesg52/status/1709551516577902782. Understanding is key to how so much data can be compressed into a LLM. There really isn't a better way to store all of it better than plain understanding it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: