OpenAI researchers warned board of AI breakthrough ahead of CEO ouster

wbhart · on Nov 23, 2023

I feel very comfortable saying, as a mathematician, that the ability to solve grade school maths problems would not be at all a predictor of ability to solve real mathematical problems at a research level.

The reason LLMs fail at solving mathematical problems is because: 1) they are terrible at arithmetic, 2) they are terrible at algebra, but most importantly, 3) they are terrible at complex reasoning (more specifically they mix up quantifiers and don't really understand the complex logical structure of many arguments) 4) they (current LLMs) cannot backtrack when they find that what they already wrote turned out not to lead to a solution, and it is too expensive to give them the thousands of restarts they'd require to randomly guess their way through the problem if you did give them that facility

Solving grade-school problems might mean progress in 1 and 2, but that is not at all impressive, as there are perfectly good tools out there that solve those problems just fine, and old-style AI researchers have built perfectly good tools for 3. The hard problem to solve is problem 4, and this is something you teach people how to do at a university level.

(I should add that another important problem is what is known as premise selection. I didn't list that because LLMs have actually been shown to manage this ok in about 70% of theorems, which basically matches records set by other machine learning techniques.)

(Real mathematical research also involves what is known as lemma conjecturing. I have never once observed an LLM do it, and I suspect they cannot do so. Basically the parameter set of the LLM dedicated to mathematical reasoning is either large enough to model the entire solution from end to end, or the LLM is likely to completely fail to solve the problem.)

I personally think this entire article is likely complete bunk.

Edit: after reading replies I realise I should have pointed out that humans do not simply backtrack. They learn from failed attempts in ways that LLMs do not seem to. The material they are trained on surely contributes to this problem.

nostrademons · on Nov 23, 2023

What I wonder, as a computer scientist:

If you want to solve grade school math problems, why not use an 'add' instruction? It's been around since the 50s, runs a billion times faster than an LLM, every assembly-language programmer knows how to use it, every high-level language has a one-token equivalent, and doesn't hallucinate answers (other than integer overflow).

We also know how to solve complex reasoning chains that require backtracking. Prolog has been around since 1972. It's not used that much because that's not the programming problem that most people are solving.

Why not use a tool for what it's good for and pick different tools for other problems they are better for? LLMs are good for summarization, autocompletion, and as an input to many other language problems like spelling and bigrams. They're not good at math. Computers are really good at math.

There's a theorem that an LLM can compute any computable function. That's true, but so can lambda calculus. We don't program in raw lambda calculus because it's terribly inefficient. Same with LLMs for arithmetic problems.

seanhunter · on Nov 23, 2023

There is a general result in machine learning known as "the bitter lesson"[1], which is that methods which come from specialist knowledge tend to be beaten by methods which rely on brute force computation in the long run because of Moore's law and the ability to scale things by distributed computing. So the reason people don't use the "add instruction"[2] for example is that over the last 70 years of attempting to build out systems which do exactly what you are proposing, they have found that not to work very well whereas sacrificing what you are calling "efficiency" (which they would think of as special purpose domain-specific knowledge) turns out to give you a lot in terms of generality. And they can make up the lost efficiency by throwing more hardware at the problem.

[1] http://www.incompleteideas.net/IncIdeas/BitterLesson.html

[2] Which the people making these models are familiar with. The whole thing is a trillion+ parameter linear algebra crunching machine after all.

qsort · on Nov 23, 2023

As someone with a CS background myself, I don't think this is what GP was talking about.

Let's forget for a moment that stuff has to run on an actual machine. If you had to represent a quadratic equation, would you rather write:

(a) x^2 + 5x + 4 = 0

(b) the square of the variable plus five times the variable plus four equals zero

When you are trying to solve problems with a level of sophistication beyond the toy stuff you usually see in these threads, formal language is an aid rather than an impediment. The trajectory of every scientific field (math, physics, computer science, chemistry, even economics!) is away from natural language and towards formal language, even before computers, precisely for that reason.

We have lots of formal languages (general-purpose programming languages, logical languages like Prolog/Datalog/SQL, "regular" expressions, configuration languages, all kinds of DSLs...) because we have lots of problems, and we choose the representation of the problem that most suits our needs.

Unless you are assuming you have some kind of superintelligence that can automagically take care of everything you throw at it, natural language breaks down when your problem becomes wide enough or deep enough. In a way this is like people making Rube-Goldberg contraptions with Excel. 50% of my job is cleaning up that stuff.

seanhunter · on Nov 23, 2023

I quite agree and so would Wittgenstein, who (as I understand it) argued that precise language is essential to thought and reasoning[1]. I think one of the key things here is often what we think of as reasoning boils down to taking a problem in the real world and building a model of it using some precise language that we can then apply some set of known tools to deal with. Your example of a quadratic is perfect, because of course now I see (a) I know right away that it's an upwards-facing parabola with a line of symmetry at -5/2, that the roots are at -4 and -1 etc whereas if I saw (b) I would first have to write it down to get it in a proper form I could reason about.

I think this is a fundamental problem with the "chat" style of interaction with many of these models (that the language interface isn't the best way of representing any specific problem even if it's quite a useful compromise for problems in general). I think an intrinsic problem of this class of model is that they only have text generation to "hang computation off" meaning the "cognative ability" (if we can call it that) is very strongly related to how much text it's generating for a given problem which is why that technique of prompting using chain of thought generates much better results for many problems[2].

[1] Hence the famous payoff line "whereof we cannot speak, thereof we must remain silent"

[2] And I suspect why in general GPT-4 seems to have got a lot more verbose. It seems to be doing a lot of thinking out loud in my use, which gives better answers than if you ask it to be terse and just give the answer or to give the answer first and then the reasoning, both of which generally generate inferior answers in my experience and in the research eg https://arxiv.org/abs/2201.11903

qsort · on Nov 23, 2023

> I quite agree and so would Wittgenstein

It depends on whether you ask him before or after he went camping -- but yeah, I was going for an early-Wittgenstein-esque "natural language makes it way too easy to say stuff that doesn't actually mean anything" (although my argument is much more limited).

> I think this is a fundamental problem with the "chat" style of interaction

The continuation of my argument would be that if the problem is effectively expressible in a formal language, then you likely have way better tools than LLMs to solve it. Tools that solve it every time, with perfect accuracy and near-optimal running time, and critically, tools that allow solutions to be composed arbitrarily.

Alpha Go and NNUE for computer chess, which are often cited for some reason as examples of this brave new science, would be completely worthless without "classical" tree search techniques straight out of the Russel-Norvig.

Hence my conclusion, contra what seems to be the popular opinion, is that these tools are potentially useful for some specific tasks, but make for very bad "universal" tools.

vintermann · on Nov 23, 2023

There are some domains that are in the twilight zone between language and deductive, formal reasoning. I've been into genealogy last year. It's very often deductive "detective work": say there are four women in a census with the same name and place that are listed on a birth certificate you're investigating. Which of them is it? You may rule one out on hard evidence (census suggests she would have been 70 when the birth would have happened), one on linked evidence (this one is the right age, but it's definitively the same one who died 5 years later and we know the child's mother didn't), one on combined softer evidence (she was in a fringe denomination and at the upper end of the age range) then you're left with one, etc.

Then as you collect more evidence you find that the age listed on the first one in the census was wildly off due to a transcription error and it's actually her.

You'd think some sort of rule-based system and database might help with these sorts of things. But the historical experience of expert system is that you then often automate the easy bits at the cost of demanding even more tedious data-entry. And you can't divorce data entry and deduction from each other either, because without context, good luck reading out a rare last name in the faded ink of some priest's messy gothic handwriting.

It feels like language models should be able to help. But they can't, yet. And it fundamentally isn't because they suck at grade school math.

Even linguistics, not something I know much about but another discipline where you try to make deductions from tons and tons of soft and vague evidence - you'd think language models, able to produce fluent text in more languages than any human, might be of use there. But no, it's the same thing: it can't actually combine common sense soft reasoning and formal rule-oriented reasoning very well.

igleria · on Nov 23, 2023

> You'd think some sort of rule-based system and database might help with these sorts of things.

sounds like belief change systems (a bit) to me!

https://plato.stanford.edu/entries/logic-belief-revision/

ben_w · on Nov 23, 2023

I assumed seanhunter was suggesting getting the LLM to convert x^2 + 5x + 4 = 0 to a short bit of source code to solve for x.

IIRC Wolfram Alpha has (or had, hard to keep up) a way to connect with ChatGPT.

seanhunter · on Nov 23, 2023

It does. This is the plugins methodology described in the toolformers paper which I've linked elsewhere[1]. The model learns that for certain types of problems certain specific "tools" are the best way to solve the problem. The problem is of course it's simple to argue that the LLM learns to use the tool(s) and can't reason itself about the underlying problem. The question boils down to whether you're more interested in machines which can think (whatever that means) or having a super-powered co-pilot which can help with a wide variety of tasks. I'm quite biased towards the second so I have the wolfram alpha plugin enabled in my chat gpt. I can't say it solves all the math-related hallucinations I see but I might not be using it right.

[1] But here it is again https://arxiv.org/abs/2302.04761

vidarh · on Nov 23, 2023

GPT4 does even without explicitly enabling plugins now, by constructing Python. If you want it to actually reason through it, you now need to ask it, sometimes fairly forcefully/in detail, before it will indulge you and not omit steps. E.g. see [1] for the problem given above.

But as I noted elsewhere, training its ability to do it from scratch matters not for the ability to do it from scratch, but for the transferability of the reasoning ability. And so I think that while it's a good choice for OpenAI to make it automatically pick more effective strategies to give the answer it's asked for, there is good reason for us to still dig into its ability to solve these problems "from scratch".

[1] https://chat.openai.com/share/694251c9-345b-4433-a856-7c38c5...

Jeff_Brown · on Nov 23, 2023

Ideally we'd have both worlds -- but if we're aiming for AGI and we have to choose, using a language that lets you encode everything seems preferable to one that only lets you talk about, say, constrained maximization problems.

wegfawefgawefg · on Nov 23, 2023

the ml method doesnt require you to know how to solve the problem at all, and could someday presumably develop novel solutions. not just high efficiency symbolic graph search.

omnicognate · on Nov 23, 2023

The bitter lesson isn't a "general result". It's an empirical observation (and extrapolation therefrom) akin to Moore's law itself. As with Moore's law there are potential limiting factors: physical limits for Moore's law and availability and cost of quality training data for the bitter lesson.

rcarr · on Nov 24, 2023

Surely the "efficiency" is just being transferred from software to hardware e.g the hardware designers are having to come up with more efficient designs, shrink die sizes etc to cope with the inefficiency of the software engineers? We're starting to run into the limits of Moore's law in this regard when it comes to processors, although it looks like another race might be about to kick off for AI but with RAM instead. When you've got to the physical limits of both, is there anywhere else to go other than to make the software more efficient?

patrick451 · on Nov 23, 2023

When you say "a general result", what does that mean? In my world, a general result is something which is rigorously proved, e.g., the fundamental theorem of algebra. But this seems to be more along the lines of "we have lots of examples of this happening".

I'm certainly no expert, but it seems to me that Wolfram Alpha provides a counterexample to some extent, since they claim to fuse expert knowledge and "AI" (not sure what that means exactly). Wolfram Alpha certainly seems to do much better at solving math problems than an LLM.

seanhunter · on Nov 24, 2023

As someone else pointed out I've used that term wrong. Rule of thumb/observation you might better say.

rgavuliak · on Nov 23, 2023

I would mention, that while yes, you can just throw computational power at the problem, the addition of human expertise didn't disappear. It moved from creating an add instruction, to coming up with a new Neural Net Architecture, and we've seen a lot of the ideas being super useful and pushing the boundaries.

panarky · on Nov 23, 2023

>> the ability to solve grade school maths problems would not be at all a predictor of ability to solve real mathematical problems at a research level

> If you want to solve grade school math problems, why not use an 'add' instruction?

Certainly the objective is not for the AI to do research-level mathematics.

It's not really even to do grade-school math.

The point is that grade-school math requires reasoning capability that transcends probabilistic completion of the next token in a sequence.

And if Q-Star has that reasoning capability, then it's another step-function leap toward AGI.

GTP · on Nov 23, 2023

> Certainly the objective is not for the AI to do research-level mathematics.

The problem is that there are different groups of people with different ideas about AI, and when talking about AI it's easy to end up tackling the ideas of a specific group but forgetting about the existence of the others. In this specific example, surely there are AI enthusiasts who see no limits to the applications of AI, including research-level mathematics.

llamaimperative · on Nov 23, 2023

This is so profoundly obvious you have to wonder the degree of motivated reasoning behind people’s attempt to cast this as “omg it can add… but so can my pocket calculator!”

vidarh · on Nov 23, 2023

There's no value in an LLM doing arithmetic for the sake of doing arithmetic with the LLM. There's value in testing an LLMs ability to follow the rules for doing arithmetic that it already knows, because the ability to recognise that a problem matches a set of rules it already knows in part or whole and then applying those rules with precision is likely to generalise to overall far better problem solving abilities.

By all means, we should give LLMs lots and lots of specialised tools to let them take shortcuts, but that does not remove the reasons for understanding how to strengthen the reasoning abilities that would also make them good at maths.

EDIT: After having just coerced the current GPT4 to do arithmetic manually: It appears to have drastically improved in its ability to systematically following the required method, while ironically being far less willing to do so (it took multiple attempts before I got it to stop taking shortcuts that appeared to involve recognising this was a calculation it could use tooling to carry out, or ignoring my instructions to do it step by step and just doing it "in its head" the way a recalcitrant student might. It's been a while since I tested this, but this is definitely "new-ish".

namibj · on Nov 23, 2023

Gaslighting LLMs does wonders. In this case, e.g., priming it by convincing it the tool is either inaccessible/overloaded/laggy, or here perhaps, telling it the python tool computed wrong and can thus not be trusted.

Closi · on Nov 23, 2023

Why would we teach kids maths then, when they can use a calculator? It's much easier and faster for them.

I believe it's because having a foundational understanding of maths and logic is important when solving other problems, and if you are looking to create an AI that can generally solve all problems it should probably have some intuitive understanding of maths too.

i.e. if we want an LLM to be able to solve unsolved theorems in the future, this requires a level of understanding of maths that is more than 'teach it to use a calculator'.

More broadly, I can imagine a world where LLM training is a bit more 'interactive' - right now if you ask it to play a game of chess with you it fails, but it has only ever read about chess and past games and guesses the next token based on that. What if it could actually play a game of chess - would it get a deeper appreciation for the game? How would this change it's internal model for other questions (e.g. would this make it answer better at questions about other games, or even game theory?)

ChatGTP · on Nov 23, 2023

It's also fun to use your brain I guess, I think we've truly forgotten that life should be about fun.

Watching my kids grow up, they just have fun doing things like trying to crawl, walk or drink. It's not about being the best at it, or the most efficient, it's just about the experience.

Now maths is taught in a boring way, but knowing it can help us lead more enjoable lives. When math is taught in an enjoyable way AND people get results out of it. Well that's glorious.

smeej · on Nov 23, 2023

> Why would we teach kids maths then, when they can use a calculator? It's much easier and faster for them.

I am five years older than my brother, and we happened to land just on opposite sides of when children were still being taught mental arithmetic and when it was assumed they would, in fact, have calculators in their pockets.

It drives him crazy that I can do basic day-to-day arithmetic in my head faster than he can get out his calculator to do it. He feels like he really did get cheated out of something useful because of the proliferation of technology.

wegfawefgawefg · on Nov 23, 2023

Skull has limited volume. What room is unused by one capacity may be used by another.

smeej · on Nov 27, 2023

Even if that were true, I can count on one hand the number of times I've needed to use anything more than basic algebra (which is basically arithmetic with a placeholder) in my adult life. I think I'd genuinely rather keep arithmetic in my head than calculator use.

Jeff_Brown · on Nov 23, 2023

Is this intuition scientifically supported? I've read that people who remember every detail of their lives tend not to have spectacular intelligence, but outside of that extreme I'm unaware of having seen the tradeoff actually bite. And there are certainly complementarities in knowledge -- knowing physics helps with chemistry, knowing math and drama both help with music, etc.

wegfawefgawefg · on Nov 23, 2023

Chimps have a much better working memory than humans. They can also count 100 times faster than humans. However, the area of their brain responsible for this faculty is used for language in humans... The theory is that the prior working memory and counting ability may have been optimized out at some point to make physical room, assuming human ancestors could do it too.

Lookup the chimp test. the videos of the best chimp are really quite incredible.

There is also the measured inflation of map traversing parts of the brain in pro tetris players and taxi drivers. I vaguely remember an explanation about atrophy in nearby areas of the brain, potentially to make room.

comex · on Nov 23, 2023

Judging by some YouTube videos I’ve seen, ChatGPT with GPT-4 can get pretty far through a game of chess. (Certainly much farther than GPT-3.5.) For that duration it makes reasonably strategic moves, though eventually it seems to inevitably lose track of the board state and start making illegal moves. I don’t know if that counts as being able to “actually play a game”, but it does have some ability, and that may have already influenced its answers about the other topics you mentioned.

vczf · on Nov 23, 2023

What if you encoded the whole game state into a one-shot completion that fits into the context window every turn? It would likely not make those illegal moves. I suspect it's an artifact of the context window management that is designed to summarize lengthy chat conversations, rather than an actual limitation of GPT4's internal model of chess.

actionfromafar · on Nov 23, 2023

I am sorry, but I thought it was a bold assumption it has an internal model of chess?

vidarh · on Nov 23, 2023

Having an internal model of chess and maintaining an internal model of the game state of a specific given game when it's unable to see the board are two very different things.

EDIT: On re-reading I think I misunderstood you. No, I don't think it's a bold assumption to think it has an internal model of it at all. It may not be a sophisticated model, but it's fairly clear that LLM training builds world models.

PoignardAzur · on Nov 23, 2023

Not that bold, given the results from OthelloGPT.

We know with reasonable certainty that an LLM fed on enough chess games will eventually develop an internal chess model. The only question is whether GPT4 got that far.

tedajax · on Nov 23, 2023

Doesn't really seem like an internal chess model if it's still probabalistic in nature. Seems like it could still produce illegal moves.

vidarh · on Nov 24, 2023

So can humans. And nothing stops probabilities in a probabilistic model from approaching or reaching 0 or 1 unless your architecture explicitly prevents that.

baq · on Nov 23, 2023

Why?

Or, given https://thegradient.pub/othello/, why wouldn't it have an internal model of chess? It probably saw more than enough example games and quite a few chess books during training.

vidarh · on Nov 23, 2023

> More broadly, I can imagine a world where LLM training is a bit more 'interactive'

Well, yes, assume that every conversation you have with ChatGPT without turning off history makes it into the training set.

curling_grad · on Nov 23, 2023

Actually, OpenAI did a research[0] on solving some hard math problems by integrating language model and Lean theorem prover some time ago.

[0]: https://openai.com/research/formal-math

singularity2001 · on Nov 23, 2023

how do they achieve 41.2% in high school Olympiads but only 55% for grade school problems?

PS: also I thought GPT4 already achieved 90% in some university math grades? Oh I remember that was multiple-choice

setuid9002 · on Nov 23, 2023

I think the answer is Money, Money, Money. Sure it is 1000000000x more expensive in compute power, and error prown on top as well, to let a LLM solve an easy Problem. But the Monopolies generate a lot of hype around it to get more money from investors. Same as the self driving car hype was. Or the real time raytracing insanity in computer graphics. If one hype dies they artificially generate a new one. For me, I just watch all the ships sink to the ground. It is gold level comedy. Btw AGI is coming, haha, sure, we developers will be replaced by an script which can not bring B, A, C in a logical sequence. And this already needs massive town size data centers to train.

resource0x · on Nov 23, 2023

> If one hype dies they artificially generate a new one

They have a pipeline of hypes ready to be deployed at a moment's notice. The next one is quantum, it's already gathering in the background. Give it a couple of years.

sgt101 · on Nov 23, 2023

Can LLM's compute any computable function? I thought that an LLM can approximate any computable function, if the function is within the distribution that it is are trained on. I think it's jolly interesting to think about different axiomizations in this context.

Also we know that LLM's can't do a few things - arithmetic, inference & planning are in there. They look like they can because they retrieve discussions from the internet that contain the problems, but when they are tested out of distribution then all of a sudden they fail. However, some other nn's can do these things because they have the architecture and infrastructure and training that enables it.

There is a question for some of these as to whether we want to make NN's do these tasks or just provide calculators, like for grade students, but on the other hand something like Alphazero looks like it could find new ways of doing some problems in planning. The challenge is to find architectures that integrate the different capabilities we can implement in a useful and synergistic way. Lots of people have drawn diagrams about how this can be done, then presented them with lots of hand waving at big conferences. What I love is that John Laird has been building this sort of thing for like, forty years, and is roundly ignored by NN people for some reason.

Maybe because he keeps saying it's really hard and then producing lots of reasons to believe him?

RamblingCTO · on Nov 23, 2023

I still believe that A(G)I will consist of subsystems and different network architectures (if NN's are the path to that), just like we humans have.

trashtester · on Nov 23, 2023

Many of the "specialist" parts of the brain are still made from cortical columns, though. Also, they are in many cases partly interchangeable, with some reduction in efficiency.

Transformers may be like that, in that they can do generalized learning from different types of input, with only minor modifications needed to optimize for different input (or output) modes.

robwwilliams · on Nov 23, 2023

Cortical columns are one part of much more complex systems of neural compute that at a minimum includes recursive connections with thalamus, hypothalamus, midbrain, brainstem nuclei, cerebellum, basal forebrain, — and the list goes on.

So it really does look like a society of networks, all working in functional synchrony (parasynchrony might be a better word) with some firms of “consciousness” updated in time slabs of about 200-300 milliseconds.

LLMs are probably equivalent now to Wernicke’s and Broca’s areas, but much more is needed “on top” and “on bottom”—-motivation, affect, short and longterm memory, plasticity of synaptic weighting and dynamics, and perhaps most important, a self-steering attentional supervisor or conductor. That attentional driver system is what we probably mean by consciousness.

trashtester · on Nov 24, 2023

> That attentional driver system is what we probably mean by consciousness.

You may know much more about this than me, but how sure are you about this? To me it seems like a better fit that the "self-steering attentional supervisor" is associated with what we mentally model (and oversimplify) as "free will", while "consciousness" seems to be downstream from the attention itself, and has more to do with organizing and rationalizing experiences than with than with the directly controlling behavior.

This processed information then seems to become ONE input to the executive function in following cycles, but with a lag of at least 1 second, and often much more.

> one part of much more complex systems of neural compute

As for your main objection, you're obviously right. But I wonder how much of the computation that is relevant for intelligence is actually in those other areas. It seems to me that recent developments indicate that Transformer type models are able to self-organize into several different type of microstructures, even within present day transformer based models [1].

[1]: https://www.youtube.com/watch?v=Gg-w_n9NJIE (comment from Ilya somewhere)

robwwilliams · on Nov 24, 2023

Fun and insightful comment.

Not sure at all. Also some ambiguities in definitions. Above I mean “consciousness” of the type many would be willing to assume operates in a cat, dog, or mouse—attentional and occasionally, also intentional. I agree that this is downstream of pure attention. Attention needs to be steered and modulated. The combination of the two levels working together recursively is what I had in mind.

“Free will” gets us into more than that. I’ve been reading Daniel Dennett on levels of “intention” this week. This higher domain of an intentional stance (nice Wiki article) might get labeled “self-consciousness”.

Most humans seem to accept this as a cognitive and mainly linguistic domain—the internal discussions we have with ourselves, although I think we also accept that there is are major non-linguistic drivers. Language is an amazingly powerful tool for recursive attentional and semantic control.

trashtester · on Nov 25, 2023

Interesting points.

My take on "free will" is definitely partly based on Dennett's work.

As for "consciousness", it seems to me that most of not all actions we do are decided BEFORE they hit our consciousness. For actions that are not executed immediately, the processing that we experience as "consciousness" may then raise some warning flags if the action our pre-conscious mind has decided on is likely to cause som bad consequences. This MAY cause the decision-making part (executive function) of the brain to modify the decision, but not because the consciousness can override the decision directly.

Instead, when this happens, it seems to be that our consciousness extrapolates our story into the future in a way that creates fear, desire or similar more primal motivations that have more direct influence over the executive function.

One can test this by for instance standing at near the top of a cliff (don't du this if suicidal): Try to imagine that you have decided to jump of the cliff. Now imagine the fall from the cliff and you hitting the rocks below. Even if (and maybe especially if) you managed to convince yourself that you were going to jump, this is likely to trigger a fear response strong enough to ensure you will not jump (unless you're truely suicidal).

Or for a less synthetic situation. Let's say you're a married man, but in a situation where you have an opportunity to have sex with a beautiful woman. The executive part of the brain may already have decided that you will. But if your consciousness predicts that your wife is likely to find out and starts to spin a narrative about divorce, loosing access to your children and so on, this MAY cause your executive function to alter the decision.

Often in situations like this, though, people tend to proceed with what the preconcious executive function had already decided. Afterwards, they may have some mental crisis because they ended up doing something their consciousness seemed to protest against. They may feel they did it against their own will.

This is why I think that the executive function, even the "free will" is not "inside" of consciousness, but is separate from it. And while it may be influenced by the narratives that our consciousness spin up, it also takes many other inputs that we may or may not be conscious of.

The reason I still call this "free" will, is based on Dennett's model, though. And in fact, "free" doesn't mean what we tend to think it means. Rather, the "free" part means that there is a degree of freedom (like in a vector space) that is sensitive to the kind of incentives the poeple around you may provide for your actions.

For instance stealing something can be seen as a "free will" decision if you would NOT do it if you knew with 100% certainty that you would be caught and punished for it. In other words, "free will" actions are those that, ironically, other people can influence to the point where they can almost force you to take them, by providing strong enough incentives.

RamblingCTO · on Nov 23, 2023

Afaik some are similar, yes. But we also have different types of neurons etc. Maybe we'll get there with a generalist approach, but imho the first step is a patchwork of specialists.

vidarh · on Nov 23, 2023

> Can LLM's compute any computable function?

In a single run, obviously not any, because it's context window is very limited. With a loop and access to an "API" (or willing conversation partner agreeing to act as one) to operate a Turing tape mechanism? It becomes a question of ability to coax it into complying. It trivially has the ability to carry out every step, and your main challenge becomes to get it to stick to it over and over.

One step "up", you can trivially get GPT4 to symbolically solve fairly complex runs of instructions of languages it can never have seen before if you specify a grammar and then give it a program, with the only real limitation again being getting it to continue to adhere to the instructions for long enough before it starts wanting to take shortcuts.

In other words: It can compute any computable function about as well as a reasonably easily distractable/bored human.

wegfawefgawefg · on Nov 23, 2023

ML still cant do sin. Functions that repeat periodically.

vidarh · on Nov 24, 2023

What exactly is it you think it can't do? It can explain and apply a number of methods for calculating sin. For sin it knows the symmetry and periodicity, and so will treat requests for sin of larger values accordingly. To convince it to continue to write out the numbers for an arbitrary large number of values without emitting "... continue like this" or similar shortcut a human told to do annoyingly pointless repetitive work would also be prone to prefer is indeed tricky, but there's nothing to suggest it can't do it.

Jeff_Brown · on Nov 23, 2023

To err is human, after all.

xwolfi · on Nov 23, 2023

You're missing the point: who's using the 'add' instruction ? You. We want 'something' to think about using the 'add' instruction to solve a problem.

We want to remove the human from the solution design. It would help us tremendously tbh, just like I don't know, Google map helped me never to have to look for direction ever again ?

marshray · on Nov 23, 2023

When the solution requires arithmetic, one trick is to simply ask GPT to write a Python program to compute the answer.

There's your 'add'.

vidarh · on Nov 23, 2023

GPT4 now does this by default. You'll see a "analyzing" step before you get the answer, and a link which will show the generated python.

davidwritesbugs · on Nov 23, 2023

Interesting, how do you use this idea? If you prompt the LLM "create a python Add function Foo to add a number to another number", "using Foo add 1 and 2", or somesuch, but what's to stop it hallucinating and saying "Sure, let me do that for you, foo 1 and 2 is 347. Please let me know if you need anything else."

IanCal · on Nov 23, 2023

Nothing stops it from writing a recipe for soup for every request, but it does tend to do what it's told. When asked to do mathsy things and told it's got a tool for doing those it tends to lean into that if it's a good llm.

kolinko · on Nov 23, 2023

It writes a function and then you provide it to an interpreter which does the calculation output on which gpt proceeds to do the rest.

That’s how langchain works, chatgpt plugins and gpt function calling. It has proven to be pretty robust - that is, gpt4 realising when it needs to use a tool/write code for calculations when needed and then using the output.

vidarh · on Nov 23, 2023

With ChatGPT you now just state your problem, and if it looks like math, it will do so. E.g. see this transcript:

https://chat.openai.com/c/dd8de3f7-a50c-4b6d-bd3f-b52ed996d3...

LASR · on Nov 23, 2023

We’re way beyond this kind of hallucinations now. OpenAI’s models are frighteningly good at producing code.

You can even route back runtime errors and ask it to fix its own code. And it does.

It can write code and even write a test to test that code. Give it an interpreter and you’re all set.

throwuwu · on Nov 23, 2023

What you’re proposing is equivalent to training a monkey (or a child for that matter) to punch buttons that correspond to the symbols it sees without actually teaching it what any of the symbols mean.

da39a3ee · on Nov 23, 2023

What an absolutely idiotic comment.

> If you want to solve grade school math problems

That's not the aim here. Very obviously what we are talking about here is _complementing_ AI language models with improved mathematical abilities, and whether that leads to anything interesting. Surely you understand that? Aren't you one of the highest rated commenters on this site?

d--b · on Nov 23, 2023

You make the asumption that Q* is a LLM, but I think OpenAI guys know very well that the current LLM architecture cannot achieve AGI.

As the name suggests, this things is likely using some form of Q learning algorithm, which makes it closer to the DeepMind models than a transformer.

My guess is that they pipe their LLM into some Q learnt net. The LLM may transform a natural language task into some internal representation that can then be handled by the Q-learnt model, which spits out something that can be transformed back again into natural language.

jansan · on Nov 23, 2023

There is a paper about something called Q*. I have no idea if they are connected or if the name matched coincidentially.

https://arxiv.org/abs/2102.04518

wegfawefgawefg · on Nov 23, 2023

The real world is a space of continuous actions. To this day Q algorithms have been ones of discrete action outputs. I'd be surprised if a Q algorithm could handle the huge action space of language. Honestly its weird they'd consider the Q family. I figured we were done with that after PPO performed so well.

wegfawefgawefg · on Nov 23, 2023

As an ML programmer, i think that approach sounds really too complicated. It is always a bad idea to render the output of one neural network into output space before feeding it into another, rather than have them communicate in feature space.

insomagent · on Nov 23, 2023

Let's say a model runs through a few iterations and finds a small, meaningful piece of information via "self-play" (iterating with itself without further prompting from a human.)

If the model then distills that information down to a new feature, and re-examines the original prompt with the new feature embedded in an extra input tensor, then repeats this process ad-infinitum, will the language model's "prime directive" and reasoning ability be sufficient to arrive at new, verifiable and provable conjectures, outside the realm of the dataset it was trained on?

If GPT-4,5,...,n can progress in this direction, then we should all see the writing on the wall. Also, the day will come where we don't need to manually prepare an updated dataset and "kick off a new training". Self-supervised LLMs are going to be so shocking.

wbhart · on Nov 23, 2023

People have done experiments trying to get GPT-4 to come up with viable conjectures. So far it does such a woefully bad job that it isn't worth even trying.

Unfortunately there are rather a lot of issues which are difficult to describe concisely, so here is probably not the best place.

Primary amongst them is the fact that an LLM would be a horribly inefficient way to do this. There are much, much better ways, which have been tried, with limited success.

gmerc · on Nov 23, 2023

After a year the entire argument you make boils down to “so far”.

Terr_ · on Nov 23, 2023

Whereas your post sounds like "Just give the approach more time, it shall continue to incrementally improve until it finally works someday, cuz reasons."

Early attempts at human flight approached it by strapping wings to people's arms and flapping: Do you think that would have eventually worked too, if only we had just given it a bit more time and faith?

londons_explore · on Nov 23, 2023

> Early attempts at human flight approached it by strapping wings to people's arms and flapping: Do you think that would have eventually worked too, if only we had just given it a bit more time and faith?

Interestingly, we how have human powered aircraft... We have flown ~60km with human leg power alone. We've also got human powered ornithopters (flapping wing designs) which can fly but only for very short times before the pilot is exhausted.

I expect that another 100 years from now, both records will be exceeded, altough probably for scientific curiosity more than because human powered flight is actually useful.

ben_w · on Nov 23, 2023

I knew about the legs (there was a model in the London Science Museum when I was a kid), but I didn't know about the ornithopter.

https://en.wikipedia.org/wiki/UTIAS_Snowbird

13 years ago! Wow, how did I miss that?

xcv123 · on Nov 23, 2023

> Just give the approach more time, it shall continue to incrementally improve until it finally works someday, cuz reasons

Yes. Because we haven't yet reached the limit of deep learning models. GPT-3.5 has 175 billion parameters. GPT-4 has an estimated 1.8 trillion parameters. That was nearly a year ago. Wait until you see what's next.

meheleventyone · on Nov 23, 2023

Why would adding more parameters suddenly make it better at this sort of reasoning? It feels a bit of a “god of the gaps” where it’ll just stop being a stochastic parrot in just a few more million parameters.

Al-Khwarizmi · on Nov 23, 2023

I don't think it's guaranteed, but I do think it's very plausible because we've seen these models gain emerging abilities at every iteration, just from sheer scaling. So extrapolation tells us that they may keep gaining more capabilities (we don't know how exactly it does it, though, so of course it's all speculation).

I don't think many people would describe GPT-4 as a stochastic parrot already... when the paper that coined (or at least popularized) the term came up in early 2021, the term made a lot of sense. In late 2023, with models that at the very least show clear signs of creativity (I'm sticking to that because "reasoning" or not is more controversial), it's relegated to reductionistic philosophical arguments, but not really a practical description anymore.

meheleventyone · on Nov 23, 2023

I don’t think we should throw out the stochastic parrot so easily. As you say there are “clear signs of creativity” but that could be it getting significantly better as a stochastic parrot. We have no real test to tell mimicry apart from reasoning and as you note we also can only speculate about how any of it works. I don’t think it’s reductionist in light of that, maybe cautious or pessimistic.

Al-Khwarizmi · on Nov 23, 2023

They can write original stories in a setting deliberately designed to not be found in the training set (https://arxiv.org/abs/2310.08433). To me that's rather strong evidence of being beyond stochastic parrots by now, although I must concede that we know so little about how everything works, that who knows.

GTP · on Nov 23, 2023

I didn't look at the paper but... How do you design a setting in a way that you're sure there isn't a similar one in the training set, when we don't even precisely know what the training set for the various GPT models was?

Al-Khwarizmi · on Nov 23, 2023

Basically by making it unlikely enough to exist.

The setting in the paper is about narrating a single combat between Ignatius J. Reilly and a pterodactyl. Ignatius J. Reilly is a literary character with some very idiosyncratic characteristics, that appears in a single book, where he of course didn't engage in single combats at all or interact with pterodactyls. He doesn't seem to have been the target of fanfiction either (which could be a problem if characters like, say, Harry Potter or Darth Vader were used instead), so the paper argues that it's very unlikely that a story like that had been ever written at all prior to this paper.

GTP · on Nov 23, 2023

Well, we've been writing stories for thousands of years, so I'm a bit skeptical that the concept of "unlikely enough to exist" is a thing. More to the specific example, maybe there isn't a story about this specific character fighting a pterodactyl, but surely there are tons of stories of people fighting all kind of animals, and maybe there are some about someone fighting a pterodactyl too.

Al-Khwarizmi · on Nov 23, 2023

Sure, but the evaluation explicitly addresses (among other points) how well that specific character is characterized. If an LLM took a pre-existing story about (say) Superman fighting a pterodactyl, and changed Superman to Ignatius J. Reilly, it wouldn't get a high rating.

TerrifiedMouse · on Nov 23, 2023

> very least show clear signs of creativity

Do you know how that “creativity” is achieved? It’s done with a random number generator. Instead of having the LLM pick the absolute most likely next token, they have it select from a set of most likely next tokens - size of the set depends on “temperature”.

Set temperature to 0, and the LLM will talk in circles and not really say anything interesting. Set it too high and it will output nonsense.

The whole design of LLMs don’t seem very well thought out. Things are done a certain way not because it makes sense but because it seems to produce “impressive” results.

Al-Khwarizmi · on Nov 23, 2023

I know that, but to me that statement isn't much more helpful than "modern AI is just matrix multiplication" or "human intelligence is just electric current through neurons".

Saying that it's done with a random number generator doesn't really explain the wonder of achieving meaningful creative output, as in being able to generate literature, for example.

xcv123 · on Nov 23, 2023

> Set temperature to 0, and the LLM will talk in circles and not really say anything interesting. Set it too high and it will output nonsense.

Sounds like some people I know, at both extremes.

> The whole design of LLMs don’t seem very well thought out. Things are done a certain way not because it makes sense but because it seems to produce “impressive” results.

They have been designed and trained to solve natural language processing tasks, and are already outperforming humans on many of those tasks. The transformer architecture is extremely well thought out, based on extensive R&D. The attention mechanism is a brilliant design. Can you explain exactly which part of the transformer architecture is poorly designed?

TerrifiedMouse · on Nov 23, 2023

> They have been designed and trained to solve natural language processing tasks

They aren’t really designed to do anything actually. LLMs are models of human languages - it’s literally in the name, Large Language Model .

https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...

I’m sorry but I don’t trust something that uses a random number generator as part of its output generation.

xcv123 · on Nov 23, 2023

> They aren’t really designed to do anything actually. LLMs are models of human languages - it’s literally in the name, Large Language Model .

No. And the article you linked to does not say that (because Wolfram is not an idiot).

Transformers are designed and trained specifically for solving NLP tasks.

> I’m sorry but I don’t trust something that uses a random number generator as part of its output generation.

The human brain also has stochastic behaviour.

HarHarVeryFunny · on Nov 23, 2023

People use the term "stochastic parrot" in different ways ... some just as a put-down ("it's just autocomplete"), but others like Geoff Hinton acknowledging that there is of course some truth to it (an LLM is, at the end of the day, a system who's (only) goal is to predict "what would a human say"), while pointing out the depth of "understanding" needed to be a really good at this.

There are fundamental limitations to LLMs though - a limit to what can be learned by training a system to predict next word form a fixed training corpus. It can get REALLY good at that task, as we've seen, to extent that it's not just predicting next word but rather predicting an entire continuation/response that is statistically consistent with the training set. However, what is fundamentally missing is any grounding in anything other than the training set, which is the what causes hallucinations/bullshitting. In a biological intelligent system predicting reality is the goal, not just predicting what "sounds good".

LLMs are a good start in as much as they prove the power of prediction as a form of feedback, but to match biological systems we need a closed-loop cognitive architecture that can predict then self-correct based on mismatch between reality and prediction (which is what our cortex does).

For all of the glib prose that an LLM can generate, even if it seems to understand what you are asking (after all, it was trained with the goal of sounding good), it doesn't have the intelligence of even a simple animal like a rat that doesn't use language at all, but is grounded in reality.

xcv123 · on Nov 23, 2023

> even if it seems to understand what you are asking (after all, it was trained with the goal of sounding good

It was trained not only to "sound good" aesthetically but also to solve a wide range of NLP tasks accurately. It not only "seems to" understand the prompt but it actually does have a mechanical understanding of it. With ~100 layers in the network it mechanically builds a model of very abstract concepts at the higher layers.

> it doesn't have the intelligence of even a simple animal

It has higher intelligence than humans by some metrics, but no consciousness.

HarHarVeryFunny · on Nov 23, 2023

> It was trained not only to "sound good" aesthetically but also to solve a wide range of NLP tasks accurately.

Was it? I've only heard of pre-training (predict next word) and subsequent RLHF + SFT "alignment" (incl. aligning to goal of being conversational). AFAIK the NLP skills that these LLMs achieve are all emergent rather than explicitly trained.

I'm not sure we can really say the net fully understands even if it answers as if it does - it was only trained to "predict next word", which in effect means being trained to generate a human-like response. It will have learnt enough to accomplish that goal, and no more (training loss tends to zero as goal is met).

Contrast this to an animal with a much richer type of feedback - reality, and with continual (aka online) learning. The animal truly understands it's actions - i.e. has learnt to accurately predict what will happen as a result of them.

The LLM does not understand it's own output in this sense - it exists only in a world of words, and has no idea if the ideas it is expressing are true or not (hence all the hallucinating/bullshitting). It only knew enough to generate something that sounded like what a person might say.

xcv123 · on Nov 23, 2023

> Was it? I've only heard of pre-training (predict next word) and subsequent RLHF + SFT "alignment" (incl. aligning to goal of being conversational). AFAIK the NLP skills that these LLMs achieve are all emergent rather than explicitly trained.

I believe you are right about that. I did some research after reading your comment. Transformers were certainly designed for NLP, but with large enough models the abilities can emerge without necessarily being explicitly trained for it.

> I'm not sure we can really say the net fully understands even if it answers as if it does - it was only trained to "predict next word", which in effect means being trained to generate a human-like response.

It depends on your definition of "understand". If that requires consciousness then there is no universally agreed formal definition.

Natural Language Understanding (NLU) is a subset of Natural Language Processing (NLP). If we take the word "understanding" as used in an academic and technical context then yes they do understand quite well. In order to simply "predict the next word" they learn an abstract model of syntax, semantics, meaning, relationships, etc, from the text.

> and has no idea if the ideas it is expressing are true or not (hence all the hallucinating/bullshitting).

That is not really an issue when solving tasks that are within it's context window. It is an issue for factual recall. The model is not a type of database that stores its training set verbatim. Humans have analogous problems with long term memory recall. I can think straight within my working memory but my brain will "hallucinate" to some extent when recalling distant memories.

HarHarVeryFunny · on Nov 23, 2023

The context window only has to do with the size of input it has access to - its not related to what it's outputting, which is ultimately constrained by what it was trained on.

If you ask it a question where the training data (or input data = context) either didn't include the answer, or where it was not obvious how to get the right answer, that will not (unfortunately) stop it from confidently answering!

xcv123 · on Nov 23, 2023

> The context window only has to do with the size of input it has access to - its not related to what it's outputting, which is ultimately constrained by what it was trained on.

Wait a minute. You are completely missing the entire "attention mechanism" thing which is what makes transformers so capable. For each output token generated in sequence, the attention mechanism evaluates the current tokens relationship to all tokens in the context window, weighing their relevance. There are multiple "attention heads" running in parallel (16 in GPT-3.5). Now for each layer of the neural network there is an attention mechanism, independently processing the entire context window for each token. There are ~100 layers in ChatGPT. So now we have 100 layers times 16 attention heads = 1600 attention mechanisms evaluating the entire context window over many deep layers of abstraction for each output token.

HarHarVeryFunny · on Nov 24, 2023

I'm not sure what your point is ... Hallucinations are where the net hadn't seen enough training data similar/related to the prompt to enable it to generate a good continuation/response. Of course in cases where it is sufficiently trained and the context contained what it needs then in can make full use of it, even copying context words to the output (zero shot learning) when appropriate.

The real issue isn't that the net often "makes a statistical guess" rather than saying "I don't know", but rather that when it does make errors it has no way to self-detect the error and learn from the mistake, as a closed-loop biological system is able to do.

xcv123 · on Nov 24, 2023

I was responding to this.

> The context window only has to do with the size of input it has access to - its not related to what it's outputting

The sequential token generation process is closely related to the content of the context window at every step.

Maybe I misunderstood your point. I know these things can hallucinate when asked about obscure facts that they weren't sufficiently trained on.

xcv123 · on Nov 23, 2023

> If you ask it a question where the training data (or input data = context) either didn't include the answer, or where it was not obvious how to get the right answer, that will not (unfortunately) stop it from confidently answering!

I haven't found this to be the case in my experience. I use ChatGPT-4. It often tells me when it doesn't know or have enough information.

If you haven't used GPT-4 I recommend signing up for a month. It is next level, way better than 3.5. (10x the parameter count). (No I'm not being paid to recommend it.)

lossolo · on Nov 23, 2023

You can predict performance of certain tasks before training and it's continuous:

https://twitter.com/mobav0/status/1653048872795791360

Al-Khwarizmi · on Nov 23, 2023

I read that paper back in the day and honestly I don't find it very meaningful.

What they find is that for every emerging ability where an evaluation metric seems to have a sudden jump, there is some other underlying metric that is continuous.

The thing is that the metric with the jump is the one people would actually care about (like actually being able to answer questions correctly, etc.) while the continuous one is an internal metric. I don't think that refutes the existence of emerging abilities, it just explains a little bit of how they arise.

vidarh · on Nov 23, 2023

Why would it not? We've observed them getting significantly better through multiple iterations. It is quite possible they'll hit a barrier at some point, but what makes you believe this iteration will be the point where the advanced stop?

meheleventyone · on Nov 23, 2023

Because for what we’re discussing it would represent a step change in capability not an incremental improvement as we’ve seen.

vidarh · on Nov 23, 2023

You're moving goal posts. You asked why it'd get better, not about a step change.

meheleventyone · on Nov 23, 2023

No I'm not that's what this whole sub-thread is about how bad LLMs are at the stuff thats described in the OP.

For context this is the grandparent within which my original reply was scoped:

I feel very comfortable saying, as a mathematician, that the ability to solve grade school maths problems would not be at all a predictor of ability to solve real mathematical problems at a research level. The reason LLMs fail at solving mathematical problems is because: 1) they are terrible at arithmetic, 2) they are terrible at algebra, but most importantly, 3) they are terrible at complex reasoning (more specifically they mix up quantifiers and don't really understand the complex logical structure of many arguments) 4) they (current LLMs) cannot backtrack when they find that what they already wrote turned out not to lead to a solution, and it is too expensive to give them the thousands of restarts they'd require to randomly guess their way through the problem if you did give them that facility

Solving grade-school problems might mean progress in 1 and 2, but that is not at all impressive, as there are perfectly good tools out there that solve those problems just fine, and old-style AI researchers have built perfectly good tools for 3. The hard problem to solve is problem 4, and this is something you teach people how to do at a university level.

(I should add that another important problem is what is known as premise selection. I didn't list that because LLMs have actually been shown to manage this ok in about 70% of theorems, which basically matches records set by other machine learning techniques.)

(Real mathematical research also involves what is known as lemma conjecturing. I have never once observed an LLM do it, and I suspect they cannot do so. Basically the parameter set of the LLM dedicated to mathematical reasoning is either large enough to model the entire solution from end to end, or the LLM is likely to completely fail to solve the problem.)

I personally think this entire article is likely complete bunk.

Edit: after reading replies I realise I should have pointed out that humans do not simply backtrack. They learn from failed attempts in ways that LLMs do not seem to. The material they are trained on surely contributes to this problem.

vidarh · on Nov 23, 2023

I responded specificially to this:

> Why would adding more parameters suddenly make it better at this sort of reasoning?

My response to you was about that specific claim as worded. Nothing more, nothing less.

vbezhenar · on Nov 23, 2023

Humans and other animals definitely different when it comes to reasoning. At the same time, biologically humans and many other animals are very similar, when it comes to brain, but humans have more "processing power". So it's only natural to expect some emergent properties from increasing number of parameters.

xcv123 · on Nov 23, 2023

> it’ll just stop being a stochastic parrot in just a few more million parameters.

Is is not a stochastic parrot today. Deep learning models can solve problems, recognize patterns, and generate new creative output that is not explicitly in their training set. Aside from adding more parameters there are new neural network architectures to discover and experiment with. Transformers aren't the final stage of deep learning.

sidlls · on Nov 23, 2023

Probabilistically serializing tokens in a fashion that isn't 100% identical to training set data is not creative in the context of novel reasoning. If all it did was reproduce its training set it would be the grossest example of overfitting ever, and useless.

Any actually creative output from these models is by pure random chance, which is most definitely different from the deliberate human reasoning that has produced our intellectual advances throughout history. It may or may not be inferior: there's a good argument to be made that "random creativity" will outperform human capabilities due to the sheer scale and rate at which the models can evolve, but there's no evidence that this is the case (right now).

vidarh · on Nov 23, 2023

There is also no evidence for your conjecture about there being some sort of grand distinction between "probabilistically serializing tokens" and "deliberate human reasoning" other than scale. There might be, but there is no evidence.

sidlls · on Nov 24, 2023

There's plenty of evidence that humans reason differently than ML models; namely basically any human intellectual discovery in history versus the (approximately) zero randomly generated ones by ML.

We don't know exactly how human reasoning works, but the observational evidence clearly indicates it is not by randomly piecing together tokens already known.

vidarh · on Nov 24, 2023

> There's plenty of evidence that humans reason differently than ML models; namely basically any human intellectual discovery in history versus the (approximately) zero randomly generated ones by ML.

This reasoning is invalid. For fun, I checked if GPT4 would catch the logical errors you made, and it did. Specifically, it correctly pointed out that absence of evidence is not evidence of absence. But even if there had been evidence of absence, this reasoning is invalid because it presumes that human reasoning must result in intellectual discovery irrespective of how it is employed, and so that if we can't find intellectual discoveries, it must mean an absence of human reasoning. In other words, it invalidly assumes that a difference in outcomes must represent a difference in the structure of reasoning. This is trivially invalid because humans think without making intellectual discoveries all the time.

However, it's also a strawman because I did not claim that humans and ML models reason the same way. I claimed there is no evidence of 'some sort of grand distinction between "probabilistically serializing tokens" and "deliberate human reasoning" other than scale'.

1) This explicitly recognizes that there is a difference, but that it might be just scale, and that we don't have evidence it doesn't. Your argument fails to address this entirely.

2) Even at scale, it does not claim they would be the same, but argues we don't have evidence that "probabilistically serializing tokens" must be inherently different from deliberate human reasoning" to an extent sufficient to call it "some sort of grand distinction". We can assume with near 100% certainty that there are differences - the odds of us happening upon the exact same structure is near zero. That does however not mean that we have any basis for saying that human reasoning isn't just another variant of "probabilistically serializing tokens".

I'll note that unlike you, GPT4 also correctly interpreted my intent when asked to review the paragraph and asked whether it implies the two must function the same. I could* take that to imply that LLMs are somehow better at humans at reasoning, but that would be logically invalid for the same reasons as your argument.

> We don't know exactly how human reasoning works, but the observational evidence clearly indicates it is not by randomly piecing together tokens already known.

Neither does LLMs. Piecing together tokens in a stochastic manner based on a model is not "randomly piecing together" - the model guides the process strongly enough that it's a wildly misleading characterization, as you can indeed trivially demonstrate by actually randomly piecing together words.

But even if we assume a less flippant and misleading idea of what LLMs do, your claim is incorrect. Observational evidence does nothing of the sort. If anything, the rapidly closing gap between human communication and LLMs shows that while it is extremely likely to be structural differences at the low level, it is increasingly unclear whether they are a material distinction. In other words, it's unclear whether the hardware and even hardwired network matters much relative to the computational structure the trained model itself creates.

You're welcome to your beliefs - but they are not supported by evidence. We also don't have evidence the other way, so it's not unreasonable to hold beliefs about what the evidence might eventually show.

lelanthran · on Nov 23, 2023

Ever heard of something called diminishingly returns?

The value improvement between 17.5b parameters and 175b parameters is much greater than the value improvement between 175b parameters and 18t parameters.

IOW, each time we throw 100 times more processing power at the problem, we get a measly 2 time increase in value.

xcv123 · on Nov 23, 2023

Yes that's a good point. But the algorithms are improving too.

tarsinge · on Nov 23, 2023

You are missing the point that it can be a model limit. LLMs were a breakthrough but that doesn’t mean they are a good model for some other problems, no matter the number of parameters. Language contains more than we thought, as GPT has impressively showed (ie semantics embedded in the syntax emerging from text compression), but still not every intellectual process is language based.

xcv123 · on Nov 23, 2023

I know that, but deep learning is more than LLMs. Transformers aren't the final ultimate stage of deep learning. We haven't found the limit yet.

tarsinge · on Nov 23, 2023

You were talking about the number of parameters on existing models. Like the history of Deep Learning has shown, simply throwing more computing power at an existing approach will plateau and not result in a fundamental breakthrough. Maybe we'll find new architectures, but the point was that the current ones might be showing their limits, and we shouldn't expect the model suddenly become good at something they are currently unable to handle because "more parameters".

xcv123 · on Nov 23, 2023

Yes you're right I only mentioned the size of the model. The rate of progress has been astonishing and we haven't reached the end, in terms of both of size and algorithmic sophistication of the models. There is no evidence that we have reached a fundamental limit of AI in the context of deep learning.

ra · on Nov 23, 2023

Indeed. LLM is an application on a transformer trained with backpropagation. What stops you from adding a logic/mathematic "application" on the same transformer?

seanhunter · on Nov 23, 2023

Nothing, and there are methods which allow these types of models to learn to use special purpose tools of this kind[1].

[1] https://arxiv.org/abs/2302.04761 Toolformer: Language Models Can Teach Themselves to Use Tools

jimmySixDOF · on Nov 23, 2023

Yes, it seems like this is a direction to replace RLHF so another way to scale without baremetal and if not this then still just a matter of time before some model optimization outperforms the raw epoch/parameters/token approach.

muskmusk · on Nov 23, 2023

Friend, the creator of this new progress is a machine learning PhD with a decade of experience in pushing machine learning forward. He knows a lot of math too. Maybe there is a chance that he too can tell the difference between a meaningless advance and an important one?

seanhunter · on Nov 23, 2023

That is as pure an example of the fallacy of argument from authority[1] as I have ever seen especially when you consider that any nuance in the supposed letter from the researchers to the board will have been lost in the translation from "sources" to the journalist to the article.

[1] https://en.wikipedia.org/wiki/Argument_from_authority

abhpro · on Nov 23, 2023

That fallacy's existence alone doesn't discount anything (nor have you shown it's applicable here), otherwise we'd throw out the entire idea of authorities and we'd be in trouble

mejutoco · on Nov 23, 2023

Authorities are useful within a context. Appealing to authority is not an argument. At most, it is an heuristic.

_Using_ this fallacy in an argument invalidates the argument (or shows it did not exist in the first place)

Eisenstein · on Nov 23, 2023

When the person arguing uses their own authority (job, education) to give their answer relevance, then stating that the authority of another person is greater (job, education) to give that person's answer preeminence is valid.

neilk · on Nov 23, 2023

I am neither a mathematician or LLM creator but I do know how to evaluate interesting tech claims.

The absolute best case scenario for a new technology is that it when it seems like a toy for nerds, and doesn't outperform anything we have today, but the scaling path is clear.

Its problems just won't matter if it does that one thing with scaling. The web is a pretty good hypermedia platform, but a disastrously bad platform for most other computer applications. Nevertheless the scaling of URIs and internet protocols have caused us to reorganize our lives around it. And then if there really are unsolvable problems with the platform they just get offloaded onto users. Passwords? Privacy? Your problem now. Surely you know to use a password manager?

I think this new wave of AI is going to be like that. If they never solve the hallucination/confabulation issue, it's just going to become your problem. If they never really gain insight, it's going to become your problem to instruct them carefully. Your peers will chide for not using a robust AI-guardrail thing or not learning the basics of prompt engineering like all the kids do instinctively these days.

wbhart · on Nov 23, 2023

How on earth could you evaluate the scaling path with too little information. That's my point. You can't possibly know that a technology can solve a given kind of problem if it can only so far solve a completely different kind of problem which is largely unrelated!

Saying that performance on grade-school problems is predictive of performance on complex reasoning tasks (including theorem proving) is like saying that a new kind of mechanical engine that has 90% efficiency can be scaled 10x.

These kind of scaling claims drive investment, I get it. But to someone who understands (and is actually working on) the actual problem that needs solving, this kind of claim is perfectly transparent!

uoaei · on Nov 23, 2023

Any claims of objective, quantitative measurements of "scaling" in LLMs is voodoo snake oil when measured against some benchmarks consisting of "which questions does it answer correctly". Any machine learning PhD will admit this, albeit only in a quiet corner of a noisy bar after a few more drinks than is advisable when they're earning money from companies who claim scaling wins on such benchmarks.

dwaltrip · on Nov 23, 2023

For the current generative AI wave, this is how I understand it:

1. The scaling path is decreased val/test loss during training.

2. We have seen multiples times that large decreases in this loss have resulted in very impressive improvements in model capability across a diverse set of tasks (e.g. gpt-1 through gpt-4, and many other examples).

3. By now, there is tons of robust data demonstrating really nice relationships between model size, quantity of data, length of training, quality of data, etc and decreased loss. Evidence keeps building that most multi-billion param LLMs are probably undertrained, perhaps significantly so.

4. Ergo, we should expect continued capability improvement with continued scaling. Make a bigger model, get more data, get higher data quality, and/or train for longer and we will see improved capabilities. The graphs demand that it is so.

---

This is the fundamental scaling hypothesis that labs like OpenAI and Anthropic have been operating off of for the past 5+ years. They looked at the early versions of the curves mentioned above, extended the lines, and said, "Huh... These lines are so sharp. Why wouldn't it keep going? It seems like it would."

And they were right. The scaling curves may break at some point. But they don't show indications of that yet.

Lastly, all of this is largely just taking existing model architectures and scaling up. Neural nets are a very young technology. There will be better architectures in the future.

jacquesm · on Nov 23, 2023

We're at the point now where the harder problem is obtaining the high quality data you need for the initial training in sufficient quantities.

dr_dshiv · on Nov 23, 2023

These European efforts to create competitive LLMs need to know that.

jacquesm · on Nov 23, 2023

I don't think they will go anywhere. Europe doesn't have the ruthlessness required to compete in such an arena, it would need far more unification first before that could happen. And we're only drifting further apart it seems.

neilk · on Nov 23, 2023

I didn’t say “certain success”, I said “interesting”

OOPMan · on Nov 23, 2023

Honestly, OpenAI seem more like a cult that a company to me.

The hyperbole that surrounds them fits the mould nicely.

hutzlibu · on Nov 23, 2023

They did build the most advanced LLM tool, though.

dr_dshiv · on Nov 23, 2023

Maybe it takes a cult

raincole · on Nov 23, 2023

But he also has the incentive to exaggerate the AI's ability.

The whole idea of double-blind test (and really, the whole scientific methodology) is based on one simple thing: even the most experienced and informed professionals can be comfortably wrong.

We'll only know when we see it. Or at least when several independent research groups see it.

visarga · on Nov 23, 2023

> even the most experienced and informed professionals can be comfortably wrong

That's the human hallucination problem. In science it's a very difficult issue to deal with, only in hindsight you can tell which papers from a given period were the good ones. It takes a whole scientific community to come up with the truth, and sometimes we fail.

auggierose · on Nov 23, 2023

No. It takes just one person to come up with the truth. It then can takes ages to convince the "scientific community".

visarga · on Nov 23, 2023

Well, one person will usually add a tiny bit of detail to the "truth". It's still a collective task.

auggierose · on Nov 27, 2023

I don't think so. The truth is advanced by individuals, not by the collective. The collective is usually wrong about things as long as they possibly can be. Usually the collective first has too die before it accepts the truth.

lokar · on Nov 23, 2023

I thought (and could be wrong) that all of these concerns are based on a very low probability of a very bad outcome.

So: we might be close to a breakthrough, that breakthrough could get out of hand, then it could kill a billion+ people.

patrec · on Nov 23, 2023

> I thought (and could be wrong) that all of these concerns are based on a very low probability of a very bad outcome.

Among knowledgeable people who have concerns in the first place, I'd say giving the probability of a very bad outcome of cumulative advances as "very low" is a fringe position. It seems to vary more between "significant" and "close to unity".

There are some knowledgeable people like Yann LeCun who have no concerns whatsoever but they seem singularly bad at communicating why this would be a rational position to take.

ben_w · on Nov 23, 2023

Given how dismissive LeCun is of the capabilities of SotA models, I think he thinks the state of the art is very far from human, and will never be human-like.

Myself, I think I count as a massive optimist, as my P(doom) is only about 15% — basically the same as Russian Roulette — half of which is humans using AI to do bad things directly.

aidaman · on Nov 23, 2023

Unlikely. We'll know when OpenAI has declared itself ruler of the new world, imposes martial law, and takes over.

Gud · on Nov 23, 2023

Why would you ever know? Why would the singularity reveal itself in such an obvious way(until it's too late to stop it)?

nobrains · on Nov 23, 2023

Also, wbhart is referring to publicly released LLMs, while the OpenAI researchers are most likely referring to an un-released in-research LLM.

las_balas_tres · on Nov 23, 2023

Sure... but that machine learning PhD has a vested interest in being optimistically biased in his observations.

smrtinsert · on Nov 23, 2023

Ah finally the engineers approach to the news. I'm not sure why we have to have hot takes, instead of dissecting the news and trying to tease out the how.

caesil · on Nov 23, 2023

FWIW The Verge is reporting that people inside are also saying the Reuters story is bunk:

https://www.theverge.com/2023/11/22/23973354/a-recent-openai...

himaraya · on Nov 23, 2023

> After being contacted by Reuters, OpenAI, which declined to comment, acknowledged in an internal message to staffers a project called Q* and a letter to the board before the weekend's events, one of the people said.

Reuters update 6:51 PST

The Verge has acted like an intermediary for Sam's camp during this whole saga, from my reading.

gmt2027 · on Nov 23, 2023

We have an algorithm and computational hardware that will tune a universal function approximator to fit any dataset with emergent intelligence as it discovers abstractions, patterns, features and hierarchies.

So far, we have not yet found hard limits that cannot be overcome by scaling the number of model parameters, increasing the size and quality of training data or, very infrequently, adopting a new architecture.

The number of model parameters required to achieve a defined level of intelligence is a function of the architecture and training data. The important question is, what is N, the number of model parameters at which we cross an intelligence threshold and it becomes theoretically possible to solve mathematics problems at a research level for an optimal architecture that we may not yet have discovered. Our understanding does not extend to the level where we can predict N but I doubt that anyone still believes that it is infinity after seeing what GPT4 can do.

This claim here is essentially a discovery that N may be much closer to where we are with today's largest models. Researchers at the absolute frontier are more likely to be able to gauge how close they are to a breakthrough of that magnitude from how quickly they are blowing past less impressive milestones like grade school math.

My intuition is that we are in a suboptimal part of the search space and it is theoretically possible to achieve GPT4 level intelligence with a model that is orders of magnitude smaller. This could happen when we figure out how to separate the reasoning from the factual knowledge encoded in the model.

waveBidder · on Nov 23, 2023

intelligence isn't a function unless you're talking about over every possible state of the universe.

gmt2027 · on Nov 23, 2023

There are well described links between intelligence and information theory. Intelligence is connected to prediction and compression as measures of understanding.

Intelligence has nothing specific to do with The Universe as we known it. Any universe will do, a simulation, images or a set of possible tokens. The universe is every possible input. The training set is a sampling drawn from the universe. LLMs compress this sampling and learn the processes and patterns behind it so well that they can predict what should come next without any direct experience of our world.

All machine learning models and neural networks are pure functions. Arguing that no function can have intelligence as a property is equivalent to claiming that artificial intelligence is impossible.

vidarh · on Nov 23, 2023

Intelligence must inherently be a function unless there is a third form of cause-effect transition that can't be modelled as a function of determinism and randomness.

mk67 · on Nov 23, 2023

Functions are by definition not random. Randomness would break: "In mathematics, a function from a set X to a set Y assigns to each element of X exactly one element of Y"

vidarh · on Nov 23, 2023

"Function" has (at least) two meanings. The last clause is not talking about functions in the mathematical sense. It could have been worded clearer, sure.

est · on Nov 23, 2023

> The reason LLMs fail at solving mathematical problems is because

That's exactly what Go/Baduk/Weiqi players think some years ago. And superalignment is defintely OpenAI's major research objective:

> https://openai.com/blog/our-approach-to-alignment-research

> our AI systems are proposing very creative solutions (like AlphaGo’s move 37)

When will mathematicians face the move 37 moment?

Davidzheng · on Nov 23, 2023

Probably in <3 years

topspin · on Nov 23, 2023

I don't know whether this particular article is bunk. I do know I've read many, many similar comments about how some complex task is beyond an conceivable model or system and then, years later, marveled at exactly that complex task being solved.

jhanschoo · on Nov 23, 2023

The article isn't describing something that will happen years later, but now. The comment author is saying that this current model is not AGI as it likely can't solve university-level mathematics, and they are presumably open to the possibility of a model years down the line that can do that.

abeppu · on Nov 23, 2023

This comment seems to presume that Q* is related to existing LLM work -- which isn't stated in the article. Others have guessed that the 'Q' in Q* is from Q-learning in RL. In particular backtracking, which you point out LLMs cannot do, would not be an issue in an appropriate RL setup.

nullc · on Nov 23, 2023

> which you point out LLMs cannot do, would not be an issue in an appropriate RL setup.

Hm? it's pretty trivial to use a sampler for LLMs that has a beam search and will effectively 'backtrack' a 'bad' selection.

It just doesn't normally help-- by construction the LLM sampled normally already approximates the correct overall distribution for the entire output, without any search.

I assume using a beam search does help when your sampler does have some non-trivial constraints (like the output satisfies some grammar or passes an algebraic test, or even just top-n sampling since those adjustments on a token by token basis result in a different approximate distribution than the original distribution filtered by the constraints).

adastra22 · on Nov 23, 2023

Back-tracking is a very nearly solved problem in the context of Prolog-like languages or mathematical theorem provers (as you probably well know). There are many ways you could integrate an LLM-like system into a tactic-based theorem prover without having to restart from the beginning for each alternative. Simply checkpointing and backtracking to a checkpoint would naively improve upon your described Monte Carlo algorithm. More likely I assume they are using RL to unwind state backwards and update based on the negative result, which would be significantly more complicated but also much more powerful (essentially it would one-shot learn from each failure).

That's just what I came up with after thinking on it for 2 minutes. I'm sure they have even better ideas.

visarga · on Nov 23, 2023

You can also consider the chatGPT app as a RL environment. The environment is made of the agent (AI), a second agent (human), and some tools (web search, code, APIs, vision). This grounds the AI into human and tool responses. They can generate feedback that can be incorporated into the model by RL methods.

Basically every reply from a human can be interpreted as a reward signal. If the human restates the question, it means a negative reward, the AI didn't get it. If the human corrects the AI, another negative reward, but if they continue the thread then it is positive. You can judge turn-by-turn and end-to-end all chat logs with GPT4 to annotate.

The great thing about chat based feedback is that it is scalable. OpenAI has 100M users, they generate these chat sessions by the millions every day. Then they just need to do a second pass (expensive, yes) to annotate the chat logs with RL reward signals and retrain. But they get the human-in-the-loop for free, and that is the best source of feedback.

AI-human chat data is in-domain for both the AI and human, something we can't say about other training data. It will contain the kind of mistakes AI does, and the kind of demands humans want to solve with AI. My bet is that OpenAI have realized this and created GPTs in order to enrich and empower the AI to create the best training data for GPT-5.

The secret sauce of OpenAI is not their people, or Sam, or the computers, but the training set, especially the augmented and synthetic parts.

wbhart · on Nov 23, 2023

There are certainly efforts along the lines of what you suggest. There are problems though. The number of backtracks is 10^k where k is not 2, or 3, or 4.....

Another issue is that of autoformalisation. This is the one part of the problem where an LLM might be able to help, if it were reliable enough (it isn't currently) or if it could truly understand the logical structure of mathematical problems correctly (currently they can't).

riku_iki · on Nov 23, 2023

> That's just what I came up with after thinking on it for 2 minutes. I'm sure they have even better ideas.

the thing is that ideas not necessary easy to implement. There will be many obstacles on route you described:

- quality of provers, is there good ergo provers which also can run at large scales (say billions of facts)

- you need some formalization approach, probably LLM will do some work, but we don't know what will be quality

- LLM likely will generate many individual factoids, which are losely compatible, contradicting, etc, and untrivial effort is required to reconcile and connect them

Davidzheng · on Nov 23, 2023

I agree that in and of itself it's not enough to be alarmed. Also i have to say i don't really know what grade school mathematics means here(multiplication? Proving triangles are congruent?). But I think the question is, whether the breakthrough is an algorithmic change in reasoning. If it is, then it could challenge all 4 of your limitations. Again this article is low on details so really we are arguing over our best guesses. But I wouldn't be so confident that an improvement on simple math problems due to algorithms can have huge implications.

Also, do you remember what go players said when they beat Fan Hui? Change can come quick

wbhart · on Nov 23, 2023

I think maybe I didn't make myself quite clear here. There are already algorithms which can solve advanced mathematical problems 100% reliably (prove theorems). There are even algorithms which can prove any correct theorem that can be stated in a certain logical language, given enough time. There are even systems in which these algorithms have actually been implemented.

My point is that no technology which can solve grade school maths problems would be viewed as a breakthrough by anyone who understood the problem. The fundamental problems which need to be solved are not problems you encounter in grade school mathematics. The article is just ill-informed.

tim333 · on Nov 23, 2023

>no technology which can solve grade school maths problems would be viewed as a breakthrough ...

Not perhaps in the sense of making mathematicians redundant but it seems like a breakthrough for ChatGPT type programs.

You've got to remember these things have gone from kind of rubbish a year or so ago to being able to beat most students at law exams now and by the sounds of it beat students at math tests shortly. At that rate or progress they'd be competing with the experts before very long.

himaraya · on Nov 23, 2023

The article suggests the way Q* solves basic math problems matters more than the difficulty of the problems themselves. Either way, I think judging the claims made remains premature without seeing the supporting documentation.

kenjackson · on Nov 23, 2023

“Given enough time” makes that a useless statement. Every kid in college learns this.

The ability to eventually solve a given theorem isn’t interesting — especially if the time is longer than the time left in the universe.

It’s far more interesting to see if an AI can, given an arbitrarily stated problem make clear progress quickly.

richardw · on Nov 23, 2023

On backtracking, I thought tree-of-thought enabled that?

"considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices"

https://arxiv.org/abs/2305.10601

Generally with you though, this thing is not leading to real smarts and that's accepted by many. Yes, it'll fill in a few gaps with exponentially more compute but it's more likely that an algo change is required once we've maxed out LLM's.

wbhart · on Nov 23, 2023

Yes, there are various approaches like tree-of-thought. They don't fundamentally solve the problem because there are just too many paths to explore and inference is just too slow and too expensive to explore 10,000 or 100,000 paths just for basic problems that no one wanted to solve anyway.

The problem with solving such problems with LLMs is that if the solution to the problem is unlike problems seen in training, the LLM will almost every time take the wrong path and very likely won't even think of the right path at all.

The AI really does need to understand why the paths it tried failed in order to get insight into what might work. That's how humans work (well, one of many techniques we use). And despite what people think, LLMs really don't understand what they are doing. That's relatively easy to demonstrate if you get an LLM off distribution. They will double down on obviously erroneous illogic, rather than learn from the entirely new situation.

richardw · on Nov 23, 2023

Thank you for the thoughtful response

EGreg · on Nov 23, 2023

As someone who studied math in grad school as part of a PhD program, worked at a hedge fund and went on to work on software and applied math, I call bullshit on this.

Math and Logic is just low-dimensional symbol manipulation that computers can easily do. You can throw data at them and they’ll show you theories that involve vectors of 42,000 variables while Isaac Newton had 4 and Einstein had 7 with Levi-Civita calculus. In short, what you consider “reasoning”, while beautiful in its simplicity, is nevertheless crude approximations to complex systems, such as linear regression or least squares.

3 days ago AI predicted fluid dynamics better than humans: https://www.sciencedaily.com/releases/2023/11/231120170956.h...

Google’s AI predicts weather now faster and better than Current systems built by humans: https://www.zdnet.com/google-amp/article/ai-is-outperforming...

AlphaZero based on MCTS years ago beat Rybka and all human-built systems in chess: https://www.quora.com/Did-AlphaZero-really-beat-Stockfish

And it can automate science and send it into overdrive: https://www.pbs.org/newshour/amp/science/analysis-how-ai-is-...

calf · on Nov 23, 2023

But, isn't AlphaGo a solution to kind of specific mathematical problem? And that it has passed with flying colors?

What I mean is, yes, neural networks are stochastic and that seems to be why they're bad at logic; on the other hand it' not exactly hallucinating a game of Go, and that seems different to how neural networks are prone to hallucination and confabulation on natural language or X-ray imaging.

wbhart · on Nov 23, 2023

Sure, but people have already applied deep learning techniques to theorem proving. There are some impressive results (which the press doesn't seem at all interested in because it doesn't have ChatGPT in the title).

It's really harder than one might imagine to develop a system which is good at higher order logic, premise selection, backtracking, algebraic manipulation, arithmetic, conjecturing, pattern recognition, visual modeling, has a good mathematical knowledge, is autonomous and fast enough to be useful.

For my money, it isn't just a matter of fitting a few existing jigsaw pieces together in some new combination. Some of the pieces don't exist yet.

calf · on Nov 23, 2023

Then your critique is about LLMs specifically.

But even there, can we say scientifically that LLMs cannot do math? Do we actually know that? And in my mind, that would imply LLMs cannot achieve AGI either. What do we actually know about the limitations of various approaches?

And couldn't people argue that it's not even necessary to think in terms of capabilities as if they were modules or pieces? Maybe just brute-force the whole thing, make a planetary scale computer. In principle.

visarga · on Nov 23, 2023

You seem knowledgeable. Can you share a couple of interesting papers for theorem proving that came out in the last year? I read a few of them as they came out, and it seemed neural nets can advance the field by mixing "soft" language with "hard" symbolic systems.

wbhart · on Nov 23, 2023

The field is fairly new to me. I'm originally from computer algebra, and somehow struggling into the field of ATP.

The most interesting papers to me personally are the following three:

* Making higher order superposition work. https://doi.org/10.1007/978-3-030-79876-5_24

* MizAR 60 for Mizar 50. https://doi.org/10.48550/arXiv.2303.06686

* Magnus Hammer, a Transformer Based Approach to Premise Selection. https://doi.org/10.48550/arXiv.2303.04488

Your mileage may vary.

Tangokat · on Nov 23, 2023

How about this:

- The Q* model is very small and trained with little compute.

- The OpenAI team thinks the model will scale in capability in the same way the GPT models do.

- Throwing (much) more compute at the model will likely allow it to solve research level math and beyond, perhaps also do actual logic reasoning in other areas.

- Sam goes to investors to raise more money (Saudi++) to fund the extra compute needed. He wants to create a company making AI chips to get more compute etc.

- The board and a few other OpenAI employees (notably Ilya) wants to be cautious and adopt a more "wait and see" approach.

All of this is speculation of course.

wegfawefgawefg · on Nov 23, 2023

Your comment is regarding LLMs, but Q* may not refer to an LLM. As such, our intuition about the failure of LLM's may not apply. The name Q* likely refers to a deep reinforcement learning based model.

To comment, in my personal experience, reinforcement learning agents learn in a more relatable human way than traditional ml, which act like stupid aliens. RL Agents try something a bunch of times, mess up, and tweak their strategy. After some extreme level of experience, they can make wider strategic decisions that are a little less myopic. RL agents can take in their own output, as their actions modify the environment. RL Agents also modify the environment during training, (which I think you will agree with me is important if you're trying to learn the influence of your own actions as a basic concept). LLM's, and traditional ml in general, are never trained in a loop on their own output. But in DRL, this is normal.

So if RL is so great and superior to traditional ml why is RL not used for everything? Well the full time horizon that can be taken into consideration in a DRL Agent is very limited, often a handful of frames, or distilled frame predictions. That prevents them from learning things like math. Traditionally RL bots have been only used for things like robotic locomotion, chess, go. Short term decision making that is made given one or some frames of data. I don't even think any RL bots have learned how to read english yet lol.

For me, as a human, my frame predictions exist on the scale of days, months, and years. To learn math I've had to sit and do nothing for many hours, and days at a time, consuming my own output. For a classical RL bot, math is out of the question.

But, my physical actions, for ambulation, manipulation, and balance, are made for me by specialized high speed neural circuits that operate on short time horizons, taking in my high level intentions, and all the muscle positions, activation, sensor data, etc. Physical movement is obfuscated from me almost in entirety. (RL has so far been good at tasks like this.)

With a longer frame horizon, that predicts frames far into the future, RL can be able to make long term decisions. It would likely take a lifetime to train. So you see now why math has not been accomplished by RL yet, but I don't think the faculty would be impossible to build into an ml architecture.

An RL bot that does math would likely spin on its own output for many many frames, until deciding that it is done, much like a person.

nullc · on Nov 23, 2023

It's also hard to know what the LLM has reasoned out vs has memorized.

I like the very last example in my tongue-in-cheek article, https://nt4tn.net/articles/aixy.html

Certainly the LLM didn't derive Fermat's theorem on sums of two squares under the hood (and, of course, very obviously didn't prove it correct-- as the code is technically incorrect for 2), but I'm somewhat doubtful that there was any function exactly like the template in codex's training set either (at least I couldn't quickly find any published code that did that). The line between creating something and applying a memorized fact in a different context is not always super clear.

greendesk · on Nov 23, 2023

Thinking is about associations and object visualisation. Surely a non-human system can build those, right? Pointing out only to a single product exposed to the public does not prove limitations for a theoretical limit.

KoolKat23 · on Nov 23, 2023

You're underestimating the power of LLM's.

I'll address two of your points as the other two stem from this.

They can't backtrack that's purely just design and can be easily trained there's no need to simulate at random until it gets the answer, if allowed to review it's prior answers and consider this, if often can reason a better answer. Further more breaking down problems. This is easily demonstrated when looking at how accuracy improves when you ask it to explain it's reasoning as it calculates (break it down into smaller problems). The same for humans, large mathematical problems are solved using learned methods to breakdown and simplify calculations into those easier for us to calculate and build up.

If the model was able to self adjust weightings based on it's finding this would further improve it (another design limitation we'll eventually get to improve, reinforcement learning). Much like 2+2=4 is your instantaneous answer, the neural connection has been made so strong in our brains by constant emphasis we no longer need to think of an abacus each time we get to the answer 4.

You're also ignoring the emergent properties of these LLMs, theyre obviously not yet at human level but they do understand the underlying values and can reason using this value. Semantic search/embeddings is evidence of this.

whatever1 · on Nov 23, 2023

The thing is that a LLMs can point out a logic error in their reasoning if specifically asked to do so.

So maybe OpenAI just slapped an RL agent on top of the next-token generator.

bambax · on Nov 23, 2023

> 4) they (current LLMs) cannot backtrack when they find that what they already wrote turned out not to lead to a solution, and it is too expensive to give them the thousands of restarts they'd require to randomly guess their way through the problem if you did give them that facility

This sounds like a reward function? If correctly implemented couldn't it enable an LLM to self-learn?

ddalex · on Nov 23, 2023

Specifically what deep-Q learning (as in Q*?) does....

arendtio · on Nov 25, 2023

To some degree you are right, but I think you forget, that the things they solved already (talking and reasoning about a world that was only presented in the form of abstractions (words)) were supposed to be harder than having a good understanding of numbers, quantities, and logical terms.

My guess is, that they saw the problems ChatGPT has today and worked on solving those problems. And given how important numbers are, they tried to fix how ChatGPT handles/understands numbers. After doing that, they saw how this new version performed much better and predicted, that further work in this area could lead to solving real-world math problems.

I don't think that we will be presented with the highway to singularity, but probably one more step in that direction.

two_in_one · on Nov 23, 2023

The reason LLMs solve school problems is because they've been trained on solutions. The problems are actually very repetitive. Not surprising for each 'new' of them there was something similar in training set. For research level problems there is nothing in training set. That's why they don't perform well.

Just today I asked GPT4 a simple task. Having mouse position in zoomed and scrolled image find it's position in the original image. GPT4 happily wrote the code, but it was completely wrong. I had to fix it manually.

However, the performance can be increased if there are several threads working on solution. Some suggesting and others analyzing the solution(s). This will increase the size of 'active' memory, at least. And decrease the load on threads, making them more specialized and deeper. This requires more resources, of course. And good management with task split. May be a dedicated thread for that.

jug · on Nov 23, 2023

1. OpenAI researchers used loaded and emotional words, implying shock or surprise. It's not easy to impress an OpenAI researcher like this, and above all, they understand the difficulty difference between teaching AI grade school and complex math since many years. They also understand that solving math with any form of reliability is only an emergent property in quite advanced LLM's.

2. Often, research is made on toy models and if this would be such a model, acing grade school problems (as per the article) would be quite impressive to say the least as this ability simply isn't emergent early in current LLM's.

What I think might have happened here is a step forward in AI capacity that has surprised researchers not because it is able to do things it couldn't at all do before, but how _early_ it is able to do so.

poulpy123 · on Nov 23, 2023

I don't know for Q* of course, but all the tests I made with GPT4, and all what I've read and seen about it, show that it is unable to reason. It was trained with an unfathomable amount of data, so it can simulate reasoning very well, but it is unable to reason

oezi · on Nov 23, 2023

What is the difference between simulating reasoning very well and "actual" reasoning?

parentheses · on Nov 23, 2023

I think the poster meant that it's capable of having a high probability of correct reasoning - simulating reasoning is lossy, actual reasoning is not. Though, human reasoning is still lossy.

Levitz · on Nov 23, 2023

Being able to extrapolate with newly found data.

You can get a LLM to simulate it "discovering" the pythagorean theorem, but can it actually, with the knowledge that was available at the time, discover the pythagorean theorem by itself?

Any parent will tell you, it's easy to simulate discovery and reasoning, it's a trick played for kids all the time. The actual, real stuff, that's way harder.

Jare · on Nov 23, 2023

Probably best to say "simulate the appearance of reasoning": looks and feels 100% acceptable at a surface level, but the actual details and conclusions are completely wrong / do not follow.

seanhunter · on Nov 23, 2023

Actual reasoning shows the understanding and use of a model of the key features of the underlying problem/domain.

As a simple example that you can replicate using chatgpt, ask it to solve some simple maths problem. Very frequently you will get a solution that looks like reasoning but is not, and reveals that it does not have an actual model of the underlying maths but is in fact doing text prediction based on a history of maths. For example see here[1]. I ask it for some quadratics in x with some specification on the number of roots. It gives me what looks at first glance like a decent answer. Then I ask the same exact question but asking for quadratics in x and y[2]. Again the answer looks plausible except that for the solution "with one real root" it says the solution has one real root when x + y =1. Well there are infinite real values for x and y such that x + y =1, not one real root. It looks like it has solved the problem but instead it has simulated the solving of the problem.

Likewise stacking problems, used to check for whether an AI has a model of the world. This is covered in "From task structures to world models: What do LLMs know?"[3] but for example here[4] I ask it whether it's easier to balance a barrel on a plank or a plank on a barrel. The model says it's easier to balance a plank on a barrel with an output text that simulates reasoning discussing center of mass and the difference between the flatness of the plank and the tendency of the barrel to roll because of its curvature. Actual reasoning would say to put the barrel on its end so it doesn't roll (whether you put the plank on top or not).

[1] https://chat.openai.com/share/64556be8-ad20-41aa-99af-ed5a42...

[2] https://chat.openai.com/share/2cd39197-dc09-4d07-a0d6-6cd800...

[3] https://arxiv.org/abs/2310.04276

[4] https://chat.openai.com/share/4b631a92-0d55-4ae5-8892-9be025...

chipsambos · on Nov 23, 2023

I generally agree with what you're saying and the first half of your answer makes perfect sense but I think the second is unfair (i.e. "[is it] easier to balance a barrel on a plank or a plank on a barrel"). It's a trick question and "it" tried to answer in good faith.

If you were to ask the same question of a real person and they replied with the exact same answer you could not conclude that person was not capable of "actual reasoning". It's a bit of witch-hunt question set to give you the conclusion you want.

seanhunter · on Nov 23, 2023

I should have said, as I understand it, the point of this type of question is not that one particular answer is the right answer and another is wrong, it's that often the model in giving an answer will do something really weird that shows that it doesn't have a model of the world.