AI has a "bathtub curve" of value. At the low level, it's a super-autocomplete, able to write 1-3 lines of code that works good enough. At the high level, it's great for explaining high-level concepts that are relevant to a task at hand.
In the middle... AI doesn't work very well.
If an AI writes a multi-step plan, where the pieces have to fit together, I've found it goes off the rails. Parts 1 and 3 of a 4-part plan are fine. So is part 2. However they don't fit together! AI has no concept of "these four parts have to be closely connected, building a whole". It just builds from A to B in four steps... but taking two different paths and stitching the pieces together poorly.
It's not a bathtub curve. Your low-level and "high"-level tasks are the same thing: Probabilistic text generation.
It's not reasoning about your code, nor about the explanation it gives you.
> AI has no concept of "these four parts have to be closely connected, building a whole".
AI can't think. It doesn't create an internal model of the problem given, it just guesses. It fails at all these "middle" tasks because they require abstract reasoning to be correct.
It's not "whether or not it thinks" its "whether n-dimensional vector multiplication in an intricate embedding space is thinking or not".
Which on the surface is easy to knee-jerk a "no" too, but with a bit more pondering you realize that however the brain thinks must be describable by math, and now you need to carve out what math is "thinking" and what math is "computation".
Or just be a duelist and attribute it to a soul or whatever.
> however the brain thinks must be describable by math
Roger Penrose believes that some portion of the work brains are doing is making use of quantum processes. The claim isn't too far-fetched - similar claims have been made about photosynthesis.
That doesn't mean it's not possible for a classical computer, running a neural network, to get the same outcome (any more than the observation that birds have feathers means feathers are necessary to flight).
But it does mean that it could be that, yes you can describe what the brain is doing with math ... but you can't copy it with computation.
it feels self-evident that computation can mimic the brain. as a result, it's difficult to argue this line much further. to say the brain is non-computable is to assert the existence of a soul, in my opinion.
A lot of things feel self-evident then turn out to be completely wrong.
We don't understand the processes in the brain well enough to assert that they are doing computation. Or to assert that they aren't!
> say the brain is non-computable is to assert the existence of a soul, in my opinion
I don't believe in souls, but the brain might still be non-computable. There are more than two possibilities.
If it is the case that brains are doing something computable that is compatible with our Turing machines, we still have no idea what that is or how to recreate it, simulate it, or approximate it. So it's not a very helpful axiom.
> We don't understand the processes in the brain well enough to assert that they are doing computation. Or to assert that they aren't!
We absolutely do know enough about neurons to know that neural networks are doing computation. Individual neurons integrate multiple inputs and produce an output based on those inputs, which is fundamentally a computational process. They also use a binary signaling system based on threshold potentials, analogous to digital computation.
With the right experimental setup, that computation can be quantified and predicted down to the microvolt. The only reason we can't do that with a full brain is the size of the electrodes.
> I don't believe in souls, but the brain might still be non-computable. There are more than two possibilities.
The real issue is neuroplasticity which is almost certainly critical to brain development. The physical hardware the computations are running on adapts and optimizes itself to the computations, for which I'm not sure we have an equivalent.
dendrocentric compartmentalization, spike timing, bandpass in the dendrites, spike retiming etc... aren't covered in the above.
But it is probably important to define 'computable'
Typically that means being able that can take a number position as input and output the digit in that location.
So if f(x) = pi, f(3) would return 4
Even the real numbers are uncomputable 'almost everywhere', meaning choose almost any real number, and no algorithm exists to produce it as f(x)
Add in ion channels and neurotransmitters and continuous input and you run into indeterminate features like riddled basins, where even with perfect information and precision and you can't predict what exit basin it is in.
Basically look at the counterexamples to Laplace's demon.
MLPs with at least one hidden layer can approximate within an error bounds with potentially infinite neurons, but it can only produce a countable infinity of outputs, while biological neurons, being continuous input will potentially have an uncountable infinity.
Riddled basins, being sets with no open subsets is another way to think about it.
We can write code that writes code. Hell even current LLM tech can write code. It's at least conceivable that a artificial neural network could be self-modifying, if it hasn't been done already.
(b) all of classical physics is computable therefore
(c) thinking relies on non-classical physics.
(d) In addition, he speculatively proposed which brain structures might do quantum stuff.
All of the early critiques of this I saw focussed on (d), which is irrelevant. The correctness of the position hinges on (a), for which Penrose provides a rigorous argument. I haven't kept up though, so maybe there are good critiques of (a) now.
If Penrose is right then neural networks implemented on regular computers will never think. We'll need some kind of quantum computer.
> If Penrose is right then neural networks implemented on regular computers will never think.
I disagree that that is necessarily an implication, though. As I said before, all that it implies is that computational tech will think differently than humans, in the same way that airplanes fly using different mechanisms from birds.
Part of Penroses's point (a) is that our brains can solve problems that aren't computable. That's the crux of his brains-aren't-computers argument. So even if computers can in some sense think, their thinking will be strictly more limited than ours, because we can solve problems that they can't. (Assuming that Penrose is right.)
I wonder if LLM's have shaken the ground he stood on when he said that. Penrose never worked with a computer that could answer off the cuff riddles. Or anything even remotely close to it.
(a) doesn't hold up because the details of the claim necessitate that it is a property of brains that they can always perceive the truth of statements which "regular computers" cannot. However, brains frequently err.
Penrose tries to respond to this by saying that various things may affect the functioning of a brain and keep it from reliably perceiving such truths, but when brains are working properly, they can perceive the truth of things. Most people would recognize that there's a difference between an idealized version of what humans do and what humans actually do, but for Penrose, this is not an issue, because for him, this truth that humans perceive is an idealized Platonic level of reality which human mathematicians access via non-computational means:
> 6.4 Sometimes there may be errors, but the errors are correctable. What is important is the fact is that there is an impersonal (ideal) standard against which the errors can be measured. Human mathematicians have capabilities for perceiving this standard and they can normally tell, given enough time and perseverance, whether their arguments are indeed correct. How is it, if they themselves are mere computational entities, that they seem to have access to these non-computational ideal concepts? Indeed, the ultimate criterion as to mathematical correctness is measured in relation to this ideal. And it is an ideal that seems to require use of their conscious minds in order for them to relate to it.
> 6.5 However, some AI proponents seem to argue against the very existence of such an ideal . . .
Penrose is not the first person to try to use Gödel’s incompleteness theorems for this purpose, and as with the people who attempted this before him, the general consensus is that this approach doesn't work:
Not going to comment on the thinking part, because who knows what that means, but there's evidence that transformers do in fact learn predictive models of their input space. There's a cool blog post on this here: https://www.neelnanda.io/mechanistic-interpretability/othell...
I should clarify, "of the problem given" refers to the problem given in a prompt.
As you note, transformer (and indeed, most ML) models do create a "world model". They're useful for 'specific' intelligence tasks.
The problem for general tasks lies in their inability to create specific models. To stick with the board game example: The model can't handle differently shaped boards, or changes to the rules.
I could ask a human and a chess-trained AI system to, for a given chess board state and piece, what places that piece can move to. Both have their model of chess.
But if I then ask, "With the rule change that the pawn can always move two spaces", the AI cannot update their model. Where for the human this would be trivial. The human can substitue in new logic rules, the AI cannot.
And that is very core of what's required for generalized logic and "thinking" in the way most tasks require it. What's so troublesome about current generative AI is that it's trained to be extremely general (within the domain of text generation), so their internal models aren't all that good.
Ask an LLM the chess problem above and you might even get a good answer out, but it doesn't generalize to all such chess problems, especially not more complex ones.
The paper on Othello is of course a very limited model, useful because it's simple enough to study and complex enough to have interesting behaviour.
But the general takeaway is that this is evidence that large transformers like GPT, which are trained to predict text, are fully capable of developing emergent models of parts of that input space whenever it is convenient for minimising the loss function. In practice this means that GPT may have internal models of the semantics of human dialogue that are sophisticated enough for it to get by in the enormous variety of prediction tasks we throw at it.
I agree with you that it's likely these internal models aren't very detailed (for the reason you wrote - they're very general). The linked blog actually talks about this at the end - an OthelloGPT trained to be good at Othello rather than just able to play legal moves ends up with a worse board model. Presumably because it needs to "invest" more in playing better moves. But if you agree with the blog's take then this is just a matter of scale and training. And whether it's possible or not for them to develop models capable of complex tasks like strategy games with shifting rules is certainly not something you (or anyone else for that matter) can say with certainty right now.
Edit: I should clarify we're using "model" in two senses here. There's the actual transformer model, but what I and the blog are talking about is specific weights and neurons _inside_ these transformers that learn to predict complex features of the input space (like legal moves and board updates in the case of OthelloGPT). These develop spontaneously during the training process, which is why they are so interesting. And why they are not really analogous to the "ML models" you refer to in your first two paragraphs.
If you're going to suggest something you think an LLM can't do I think at the very least as a show of good faith you should try it out. I've lost count of the number of times people have told me LLMs can't do shit that they very evidently can.
I explicitly say that LLMs could do it in my response. As a show of good faith you should try reading the entire comment.
Yes, I'm using simple examples to demonstrate a particular difference, because using "real" examples makes getting the point across a lot harder.
You're also just wrong. I did in fact test, and both GPT 3.5 Turbo and 4o failed. Not only with the rule change, but with the mere task of providing possible moves. I only included the admission that they may succeed as a matter of due diligence, in that I cannot conclusively rule out they can't get the right answer because of the randomization and API-specific pre-prompting involved.
> "For chess board r1bk3r/p2pBpNp/n4n2/1p1NP2P/6P1/3P4/P1P1K3/q5b1 (FEN notation), what are the available moves for pawn B5"
I did read your entire comment, and that is what prompted my response, because from my perspective your entire premise was based on LLMs failing at simple examples, and yet despite admitting you thought there was a chance an LLM would succeed at your example, it didn't seem you'd bothered to check.
The argument you are making is based on the fact that the example is simple. If the example were not simple, you would not be able to use it to dismiss LLMs.
I am not surprised that GPT 3.5 and 4o failed, they are both terrible models. GPT4-o is multimodal, but it is far buggier than gpt-4. I tried with claude 3.5 sonnet and it got it first try. It also was able to compute the moves when told the rule change.
> It's not reasoning about your code, nor about the explanation it gives you.
We don't really know what "reasoning" is. Presumably you think humans reason about code, but humans also only have statistical models of most problems. So if humans only reason probabilistically about problems, which is why they still make mistakes, then the only difference is that AI is just worse at it. That's not an indication it isn't "reasoning".
“We don’t know how we do it, so we can’t say this isn’t how we do it” isn’t a valid argument.
We may not know exactly how we reason, but we can rule out probabilistic guessing. And even if that is a part of it, we’re capable of far more sophisticated models. We can recurse and hold links. We can also make intuitive leaps that aren’t quite built on probability.
> “We don’t know how we do it, so we can’t say this isn’t how we do it” isn’t a valid argument
Yes it is, assuming we don't know of any specific things that "this" literally can't do but that we can. Which we currently don't, we merely have suspicions.
> We may not know exactly how we reason, but we can rule out probabilistic guessing.
No we can't.
> even if that is a part of it, we’re capable of far more sophisticated models.
Yes, but that would be a difference of degree not of kind. This is what scaling proponents have been saying, eg. that scaling does not appear to have a limit.
> We can also make intuitive leaps that aren’t quite built on probability.
I don't think we have evidence of that. "Intuitive leap" could just be a link generated from sampling some random variable.
But it’s not replicating results that a human would give you.
Since it’s not giving the same type of results, then it’s not doing the same thing. If anything, LLMs have definitively ruled out probabilistic guessing as the model for human intelligence.
Even now, you’re trying to force LLMs onto human intelligence. Insisting it is despite it not delivering the results. And I’m sure you believe if we just fire up another few million gpus, we’d get there. But we’ll just get wrong faster. LLMs don’t produce new, they just remix old
> Even now, you’re trying to force LLMs onto human intelligence
I'm not forcing anything, I'm specifically refuting the claims that we know that LLMs are not how humans work, and that LLMs are not reasoning. We simply don't know either of these, and we definitely have not ruled out statistical completion wholesale.
Also, I don't even know what you mean that LLMs are not giving the same types of results as humans. An articulate human who was hired to write a short essay on given query will produce what looks like ChatGPT output, modulo some quirks that we've forced ChatGPT to produce via reinforcement learning.
> AI can't think. It doesn't create an internal model of the problem given, it just guesses.
These "AI can't think" comments pop up on every single thread about AI and they're incredibly tiresome. They never bring anything to the discussion except reminding us how inherently limited these AIs are or whatever.
Someone else already replied with the OthelloGPT counter-example that shows that, yes, they do have an internal model. To which you reply that the internal model doesn't count as thinking or abstract reasoning or something, and... like, what even is the point of bringing that up every discussion? These assertions never come with empirical predictions anyway.
GP's comment was interesting because it pointed at a specific area of what LLMs are bad at. A thousandth comment saying "LLMs can't think or do abstract things (except in all the cases where they can but those aren't really thinking)" doesn't bring any new info.
> It doesn't create an internal model of the problem given, it just guesses.
It's not entirely true. They often use some sort of memory/scratch-pad to keep a context other than previous tokens. This recent exploit lets you see claude's default prompt that have some references to this system.
If you think about the data they are trained on, they don't see a lot of examples of multi-step plans. Given they are trained to see how concepts (ie, high dimensional vectors) fit together, they aren't going to perform well without a lot of examples of the reasoning required. They'll get there eventually, with synthetic data, good descriptions of goals followed by code written to implement it, etc.
The low-level high-level spectrum might not be the best scale to gauge AI by. We should kernel-trick our scale so that low and high level is seperable from multi-step planning problems. Or in other words, use a different dimension to separate these three problems.
Does anyone remember the "Mad Libs" games - you fill out a form with blanks for "verb", "noun", "adjective", etc - then on the next page you fill in the words from the form to create a silly story. The results are funny because the words you provided initially were without context - they were syntactically correct, but were nonsense in context.
LLM's are like Mad Libs with a "contextual predictor" - they produce syntactically correct output, and the "contextual predictor" limits the amount of nonsense because statistical correlations can generate meaningful output most of the time. But there is no "reasoning" occurring here - just syntactic templating and statistical auto-complete.
Yes, but it's a hugely almost unimaginably complicated auto-complete model. And it turns out that a lot of human reasoning is statistically predictable enough in writing that you can actually obtain reasoning-like behavior just by having a good auto-complete model.
You shouldn't trvialize how amazingly well it does work, and how surprising it is that it works, just because it doesn't work in all cases.
Literally the whole point of TFA is to explore how this phenomenon of something-like-reasoning arises out of a sufficiently huge autocomplete model.
> And it turns out that a lot of human reasoning is statistically predictable enough in writing that you can actually obtain reasoning-like behavior just by having a good auto-complete model.
I would disagree with this on a technicality that changes the conclusion. It's not that human reasoning is statistically predictable (though it may be), it's that all of the writing that has ever described human reasoning on an unimaginable number of topics is statistically summarizable, and therefore having a good auto-complete model does a good job of describing human reasoning that has been previously described at least combinatorially across various sources.
We don't have direct access to anyone else's reasoning. We infer their reasoning by seeing/hearing it described, then we fill in the blanks with our own reasoning-to-description experiences. When we see a model that's great at mimicking descriptions of reasoning, it triggers the same inferences, and we conclude similar reasoning must be going on under the hood. It's like the ELIZA Effect on steroids.
It might be the case that neural networks could theoretically, eventually reproduce the same kind of thinking we experience. But I think it's highly unlikely it'd be a single neural network trained on language, especially given the myriad studies showing the logic and reasoning capabilities of humans that are distinct from language. It'd probably be a large number of separate models trained on different domains that come together. At that point though, there are several domains that would be much more efficiently represented with something other than a neural network model, such as the modeling of physics and mathematics with equations (just because we're able to learn them with neurons in our brains doesn't mean that's the most efficient way to learn or remember them).
While a "sufficiently huge autocomplete model" is impressive and can do many things related to language, I think it's inaccurate to claim they develop reasoning capabilities. I think of transformer-based neural networks as giant compression algorithms. They're super lossy compression algorithms with super high compression ratios, which allows them to take in more information than any other models we've developed. They work well, because they have the unique ability to determine the least relevant information to lose. The auto-complete part is then using the compressed information in the form of the trained model to decompress prompts with astounding capability. We do similar things in our brains, but again, it's not entirely tied to language; that's just one of many tools we use.
> We don't have direct access to anyone else's reasoning. We infer their reasoning by seeing/hearing it described, then we fill in the blanks with our own reasoning-to-description experiences. When we see a model that's great at mimicking descriptions of reasoning, it triggers the same inferences, and we conclude similar reasoning must be going on under the hood. It's like the ELIZA Effect on steroids.
I don't think we know enough of how these things work yet to conclude that they are definitely not "reasoning" in at least a limited subset of cases, in the broadest sense wherein ELIZA is also "reasoning" becuase it's following a sequence of logical steps to produce a conclusion.
Again, that's the point of TFA: something in the linear algebra stew does seem to produce reasoning-like behavior, and we want to learn more about it.
What is reasoning if not the ability to assess "if this" and conclude "then that"? If you can do it with logic gates, who's to say you can't do it with transformers or one of the newer SSMs? And who's to say it can't be learned from data?
In some sense, ELIZA was reasoning... but only within a very limited domain. And it couldn't learn anything new.
> It might be the case that neural networks could theoretically, eventually reproduce the same kind of thinking we experience. But I think it's highly unlikely it'd be a single neural network trained on language, especially given the myriad studies showing the logic and reasoning capabilities of humans that are distinct from language. It'd probably be a large number of separate models trained on different domains that come together.
Right, I think we agree here. It seems like we're hitting the top of an S-curve when it comes to how much information the transformer architecture can extract from human-generated text. To progress further, we will need different inputs and different architectures / system designs, e.g. something that has multiple layers of short- and medium-term working memory, the ability to update and learn over time, etc.
My main point is that while yes, it's "just" super-autocomplete, we should consider it within the realm of possibility that some limited form of reasoning might actually be part of the emergent behavior of such an autocomplete system. This is not AGI, but it's both suggestive and tantalizing. It is far from trivial, and greatly exceeds what anyone expected should be possible just 2 years ago. If nothing else, I think it tells us that maybe we do not understand the nature of human rationality as well as we thought we did.
> What is reasoning if not the ability to assess "if this" and conclude "then that"?
A lot of things. There are entire fields of study which seek to define reasoning, breaking it down into areas that include logic and inference, problem solving, creative thinking, etc.
> If you can do it with logic gates, who's to say you can't do it with transformers or one of the newer SSMs? And who's to say it can't be learned from data?
I'm not saying you can't do it with transformers. But what's the basis of the belief that it can be done with a single transformer model, and one trained on language specifically?
More specifically, the papers I've read so far that investigate the reasoning capabilities of neural network models (not just LLMs) seem to indicate that they're capable of emergent reasoning about the rules governing their input data. For example, being able to reverse-engineer equations (and not just approximations of them) from input/output pairs. Extending these studies would indicate that large language models are able to emergently learn the rules governing language, not necessarily much beyond that.
It makes me think of two anecdotes:
1. How many times have you heard someone say, "I'm a visual learner"? They've figured out for themselves that language isn't necessarily the best way for them to learn concepts to inform their reasoning. Indeed there are many concepts for which language is entirely inefficient, if not insufficient, to convey. The world's shortest published research paper is proof of this: https://paperpile.com/blog/shortest-papers/.
2. When I studied in school, I noticed that for many subjects and tests, sufficient rote memorization became indistinguishable from actual understanding. Conversely, better understanding of underlying principles often reduced the need for rote memorization. Taken to the extreme, there are many domains for which sufficient memorization makes actual understanding and reasoning unnecessary.
Perhaps the debate on whether LLMs can reason is a red herring, given that their ability to memorize surpasses any human by many orders of magnitude. Perhaps this is why they seem able to reason, especially given that our only indication so far is the language they output. The most useful use-cases are typically those which are used to trigger our own reasoning more efficiently, rather than relying on theirs (which may not exist).
I think the impressiveness of their capabilities is precisely what makes exaggeration unnecessary.
Saying LLMs develop emergent logic and reasoning, I think, is a stretch. Saying it's "within the realm of possibility that some limited form of reasoning might actually be part of the emergent behavior" sounds more realistic to me, though rightly less sensational.
EDIT:
I also think it's fair to say that the ELIZA program had the limited amount of reason that was programmed into it. However, the point of the ELIZA study was that it shows people's tendency to overestimate the amount of reasoning happening, based on their own inferences. This is significant, because this causes us to overestimate the generalizability of the program, which can lead to unintended consequences when reliance increases.
> But there is no "reasoning" occurring here - just syntactic templating and statistical auto-complete.
This is the "stochastic parrot" hypothesis, which people feel obligated to bring up every single time there's a LLM paper on HN.
This hypothesis isn't just philosophical, it can lead to falsifiable predictions, and experiments have thoroughly falsified them: LLMs do have a world model. See OthelloGPT for the most famous paper on the subject; see Transformers Represent Belief State Geometry in their Residual Stream for a more recent one.
Well we don't have an understanding of how the brain works so we can't be fully sure but it's clear why they have this intuition:
1) Many people have had to cram for some exam where they didn't have time to fully understand the material. So for those parts they memorized as much as they could and got through the exam by pattern matching. But they knew there was a difference because they knew what it was like to fully understand something where that they could reason about it and play with it in their mind.
2) Crucially, if they understand the key mechanism early then they often don't need to memorize anything (the opposite of LLM's which need millions of examples)
3) LLM's display attributes of someone who has crammed for an exam and when it is probed further [1] it starts to break down in exactly the same way a crammer does.
I understand why they intuitively think it isn't. I also think there is probably something more to reasoning. I'm just mystified by why they are so sure it isn't.
Logic is a syntactic formalism that humans often apply imperfectly. That certainly sounds like we could be employing syntactic templating and statistical auto-complete.
I was trying to tease apart whether you were talking about human behavior or the abstract concept of 'reasoning'. The latter is formalized in logic and has parts that are not merely syntactic (with or without stochastic autocomplete).
You seem to be confusing logic and proofs with any kind of random rhetoric or syntactically-correct opinion which might in terms of semantics be total nonsense. If you really don't understand that there's a difference between these things, then there's probably no difference between anything else either, and since things that are indiscernible must be identical, I conclude that I must be you, and I declare myself wrong, thus you are wrong too. Are we enjoying this kind of "reasoning" yet or do we perhaps want a more solid rock on which to build the church?
I don't know what claim you think I'm making that you inferred from my 5 sentences, but it's really simple. Do you agree or disagree that humans make mistakes on logical deduction?
I certainly hope you agree, in which case it follows that a person's understanding of any proposition, inference or deduction is only probabilistic with some certainty less than one. When they believe or make a mistake in deduction, they are going through the motions of applying logic without actually understanding what they're doing, which I suppose you could whimsically call "hallucinating". A person will typically continue to repeat this mistaken deduction until someone corrects them.
So if our only example of "reasoning" seems to share many of the same properties and flaws as LLMs, albeit at a lower rate, and that correcting this Paragon of reasoning is basically what we also do with LLMs (have them review their own output or check it against another LLM), this claim to human specialness starts to look a lot like special pleading.
I haven't made any claim that humans are special. And your claim, in your own words, is that if mistakes are made in logical deduction, that means that the agent involved must ultimately be employing statistical auto-complete? No idea why you would think that, or what else you want to conclude from it, but it's obviously not true. Just consider an agent that inverts every truth value you try to put into the knowledge base and then proceeds as usual with anything you ask it to do. It makes mistakes and has nothing to at all do with probability, therefore some systems that make mistakes aren't LLMs. QED?
Ironically the weird idea that "all broken systems must be broken in the same way" or even "all broken systems use equivalent mechanics" is exactly the type of thing you get by leaning on a language model that really isn't even trying to understand the underlying logic.
> I haven't made any claim that humans are special
The whole context of this thread is that humans are "reasoning" and LLMs are just statistical syntax predictors, which is "lesser", ie. humans are special.
> And your claim, in your own words, is that if mistakes are made in logical deduction, that means that the agent involved must ultimately be employing statistical auto-complete?
No, I said humans would be employing statistical auto-complete. The whole point of this argument is to show that this allegedly non-statistical, non-syntactic "reasoning" that humans are doing that supposedly makes them superior to statistical, syntactic processing that LLMs are doing, is mostly a fiction.
> leaning on a language model that really isn't even trying to understand the underlying logic.
You don't know that the LLM is not understanding. In fact, for certain rigorous formal definitions of "understanding", it absolutely does understand something. You can only reliably claim LLMs don't understand everything as well as some humans.
In no way does "Turing Completeness" imply the ability to reason - I mean it's like arguing that a nightlight "reasons" about if it is dark out or not.
However, if reason is computable, then a syntactic transformation can compute it. The point is that stating that something is a "mere" syntactic transformation does not imply computational weakness.
> A system that is Turing Complete absolutely can be programmed to reason, aka it has the ability to reason.
you can write C program which can reason, but C compiler can't reason. So, program part is missing between "Turing Completeness" and reasoning, and it is very non-trivial part.
Given "reasoning" is still undefined, I would not go so far as to claim that a C compiler is not reasoning. What if a C compiler's semantic analysis pass is a limited form of reasoning?
Furthermore, the C compiler can do a lot more than you think. The P99/metalang99 macro toolkits give the preprocessor enough state space to encode and run an LLM, in principle.
I can define "reasoning". Given number of observations and inference rules, infer new calculated observations.
> What if a C compiler's semantic analysis pass is a limited form of reasoning?
I guess you can say that C compiler can reason in specific narrow domain, because it is also a program and someone programmed it to reason in that domain.
I think C compiler was wrong analogy, because it is also a program. More correct could refer on some machine which executes ASM/C/bytecode etc. That machine(e.g. CPU or VM) is turing complete, but one need to write program to do reasoning. C compiler doing some semantic reasoning over say datatypes is example of such program.
The network has specific circuits that correspond to concepts and you can see that the network uses and combines those concepts to work through problems. That is reasoning.
Under this definition an 74LS21 AND gate is reasoning - it has specific circuits that correspond to concepts, and it uses that network to determine an output based on the input. Seems pretty overly broad - we run back into the issue of saying that a nightlight or thermostat is reasoning.
For true reasoning you really need to introduce the ability for the circuit to intentionally decide to do something different that is not just a random selection or hallucination - otherwise we are just saying that state machines "reason" for the sake of using an anthropomorphic word.
This restriction makes it impossible to determine if something is reasoning. An LLM may well intentionally make decisions; I have as much evidence for that as I have for anybody else doing so, ie. zilch. I'm not even sure that I make intentional decisions, I can only say that it feels like I do. But free will isn't really compliant with my model of physical reality.
Of course logic gates apply logical reasoning to solve problems, they are not much use for anything else (except as a space heater if there are a lot of them).
"Reasoning" implies the extrapolation of information - not the mechanical generation of a fixed output based on known inputs. No one would claim that a set of gears is "reasoning" but the logic gate is as fixed in it's output as a transmission.
But I understand there are two sides to the discussion - that by ingesting huge amounts of text these models have somehow built reasoning capabilities (language then reasoning) or that the reasoning was done by humans and then written down so as long as you ask something like “should romeo find another love after Juliet” there is a set of reasoning reflected in a billion English literature essays and the model just reflects those answers
To me those seem like to sides of the same coin. LLMs are fundamentally trained to complete text. The training just tries to find the most effective way to do that within the given model architecture and parameter count.
Now if we start by "LLMs ingest huge amounts of text", then a simple model would complete text by simple memorization. But correctly completing "234 * 452 =" is a lot simpler to do by doing math than by having memorized all possible multiplications. Similarly, understanding the world and being able to reason about it helps you correctly completing human-written sentences. Thus a sufficiently well-trained model that has enough parameters to do this but not so many that it simply overfits should be expected to develop some reasoning ability.
If you start with "the training set contains a lot of reasoning" you can get something that looks like reasoning in the memorization stage. But the same argument why the model would develop actual reasoning still works and is even stronger: if you have to complete someone's argument that's a lot easier if you can follow their train of thought.
> But correctly completing "234 * 452 =" is a lot simpler to do by doing math than by having memorized all possible multiplications.
There's a fatal flaw in this theory: We can trivially test this and see that LLMs aren't "doing math".
"Doing math" is an approach that scales to infinity. The same technique to solve a multiplication of 3 digit numbers applies to solving a multiplication of 500 digit numbers.
Ask GPT 3.5 to multiply "234 * 452 =" and it'll correctly guess 105768.
Ask "234878 * 452 =" and it gives an incorrect '105797256'
Ask GPT 4o, and you'll get correct answers for that problem. Yet even with the added external tools for such questions, it has the same failure mode and breaks down on larger questions.
These models are architecturally limited to only language modelling, and their capabilities of anything else are restricted by this. They do not "do math". They have a language-model approximation of math.
This can be observed in how these models perform better "step by step"; Odds are you'll see GPT 4o do this if you try to replicate the above. (If it doesn't, it fails just as miserably as GPT 3.5)
What's happening there is simple, the token context is used as a memory space. Breaking the problem down into parts that can be guessed or approximated through language modelling.
Beware of hyping this as "AI can think and has memory!" though. This behaviour is a curious novelty, but not very generalizeable. There is still no "math" or thought involved in breaking up the problem, merely the same guessing. This works reasonably only for cases where extensive training data is available on how to do this. (Such as math.)
With GPT4/o there is a trick for math problems. You can ask it to write the python code. This solves for example famous problem of counting letters in string. Sure model can be trained to use python under the hood without being explicitly asked. Pretty sure it can be trained to interpret code/algorithm step by step printing out intermediate results. Important in loops. Generating algorithm is easier for known problems, they learn it from github already. So, it looks like it's not that difficult to make model better/good at math.
The difference is what I attempt to describe at the end there.
Humans apply fixed strict rules about how to break up problems, like multiplication.
LLMs simply guess. That's a powerful trick to get some more capability for simple problems, but it just doesn't scale to more complex ones.
(Which in turn is a problem because most tasks in the real world are more complex than they seem, and simple problems are easily automated through conventional means)
We either learn the fixed rules in school, at which point we simply have a very strong prior, or we have to invent them somehow. This usually takes the form of "aesthetically/intuitively guided trial and error argument generation", which is not entirely wrongly summarized as "guessing".
Doing math scales to infinity only given an error rate of zero. Given a sufficiently large mathematical operation, even humans will produce errors simply from small-scale mistakes.
Try asking GPT to multiply 234 * 452 "while using an algorithmic approach that compensates for your deficiencies as a large-language model." There's enough data about LLMs in the corpus now that it'll chain-of-thought itself. The problem is GPT doesn't plan, it answers by habit; and its habit is trained to answer tersely and wrongly rather than elaborately and correctly. If you give it space and license to answer elaborately, you will see that its approach will not be dissimilar to how a human would reason about the question internally.
> Doing math scales to infinity only given an error rate of zero
This is true, I had omitted it for simplicity; It is still the same approach applied to scaled problems. Humans don't execute it perfectly, but computers do.
With humans, and any other fallible but "true" math system, the rate of errors is roughly linear to the size of the problem. (Linear to the # of steps, that is)
With LLMs and likewise systems, this is different. There is an "exponential" dropoff in accuracy after some point. The problem-solving approach simply does not scale.
> you will see that its approach will not be dissimilar to how a human would reason about the question internally.
"Not dissimilar", but nevertheless a mere approximation. It doesn't apply strict logic to the problem, but guesses what steps should be followed.
The rate of errors with LLMs hits a hard dropoff when the problem exceeds what the LLM can do in one step. This is the same for humans, if we were asked to compute multiplication without thinking about it for longer than a few milliseconds.
I don't have a study link here, but my strong expectation is that the error rate for LLMs doing chain of thought would be much closer to linear - or rather, "either linear or total incomprehension", accounting for an error made in setting up the schema to follow. Which can happen just as well for humans.
> "Not dissimilar", but nevertheless a mere approximation. It doesn't apply strict logic to the problem, but guesses what steps should be followed.
I have never in my life applied strict logic to any problem lol. Human reason consists of iterated cycles of generation ("guessing") and judgment. Both can be implemented by LLMs, albeit currently at subhuman skill.
> This looks like reason, but is not reason.
At the limit of "looking like", I do not believe such a thing can exist. Reason is a computational process. Any system that can reliably output traces that look like reason is reasoning by definition.
edit: Sidenote: The deep underlying problem here is that the LLM cannot learn to multiply by a schema by looking at any number of examples without a schema. These paths simply won't get any reinforcement. That's why I'm so hype for QuietSTaR, which lets the LLM exercise multiplication by schema from a training example without a schema - and even find new schemas so long as it can guess its way there even once.
> Not to be a jerk but "LLMs are just like humans when humans don't think" is perhaps not the take you intended to have.
No that's exactly the take I have and have always had. The LLM text axis is the LLM's axis of time. So it's actually even stupider: LLMs are just like humans who are trained not to think.
> No, but seriously. If you've done any kind of math beyond basic arithmetic, you have in fact applied strict logical rules.
To solve the problem, I apply the rules, plus error. LLMs can do that.
To find the rules, I apply creativity and exploratory cycles. LLMs can do that as well, but worse.
I think this is an underappreciated perspective. The simplest model of a reasoning process, at scale, is the reasoning process itself! That said, I haven't come across any research directly testing that hypothesis with transformers. Do you know of any?
The closest I've seen is a paper on OthelloGPT using linear probes to show that it does in fact learn a predictive model of Othello board states (which can be manipulated at inference time, so it's causal on the model's behaviour).
You should take a look at the more extensive reasoning tests used for LLMs right now, like MuSR, which clearly can't be the latter, since the questions are new: https://arxiv.org/abs/2310.16049
It is actually pretty straightforward why those model "reason" or, to be more exact, can operate on a complex concepts. By processing huge amount of texts they build an internal representation where those concepts are represented as a simple nodes (neurons or groups). So they really distill knowledge. Alternatively you can think about it as a very good principal component analysis that can extract many important aspects. Or like a semantic graph built automatically.
Once knowledge is distilled you can build on top of it easily by merging concepts for example.
Well the internal representation is tokens not words so.. the pin is even smaller?
They distill relationships between tokens. Multiple tokens together make up a word, and multiple words together make up a label for something we recognize as a "concept".
These "concepts" are not just a label though - they are an area in the latent space inside the neural network which happens to contains those words in the sequence (along with other labels that mean similar things).
A simple demonstration of this is how easily multi-modal neural networks build cross modal representations of the same thing, so "cats" end up in the same place in both image and word form but also more complex concepts ("a beautiful country fields with a foreboding thunderstorm forming") will also align well between the words and the images.
Glossing through the paper, it seems they're noting this issue but kinda skipping over it:
In fact, it is clear that approximation capabilities and generalization are not equivalent notions. However, it is not yet determined that the reasoning capabilities of LLMs are tied to their generalization. While these notions are still hard to pinpoint, we will focus in this experimental section on the relationship between intrinsic dimension, thus expressive power, and reasoning capabilities.
Right, they never claimed to have found a roadmap to AGI, they just found a cool geometric tool to describe how LLMs reason through approximation. Sounds like a handy tool if you want to discover things about approximation or generalization.
I think there is a lot happening in the word "reflects"! Is it so simple?
Does this mean that the model takes on the opinion of a specific lit crit essay it has "read"? Does that mean it takes on some kind of "average" opinion from everything? How would you define the "average" opinion on a topic, anyway?
Anyway, although I think this is really interesting stuff and cuts to the core of what an LLM is, this paper isn't where you're going to get the answer to that, because it is much more focused and narrow.
I think you're close enough that the differences probably aren't too important. But if you want a bit more nuance, then read on. For disclosure, I'm in the second camp here. But I'll also say that I have a lot of very strong evidence to support this position, and that I do this from the perspective of a researcher.
There's a few big problems when making any definite claims about either side. First, we need to know what data the machine is processing when training. I think we all understand that if the data is in training, then testing is not actually testing a model's ability to generalize, but a model's ability to recall. Second, we need to recognize the amount of duplication of data, both exact and semantically.
1) We have no idea because these are proprietary. While LLAMA is more open than GPT, we don't know all the data that went into it (last I checked). Thus, you can't say "this isn't in the data."[0] But we do know some things that are in the data, though we don't know exactly what was filtered out. We're all pretty online people here and I'm sure many people have seen some of the depths of places like Reddit, Medium, or even Hacker News. These are all in the (unfiltered) training data! There's even a large number of arxiv papers, books, publications, and so much more. So you have to ask yourself this: "Are we confident that what we're asking the model to do is not in the data we trained on?" Almost certainly it is, so then the question moves to "Are we confident that what we're asking the model to do was adequately filtered out during training so we can have a fair test?" Regardless of what your position is, I think you can see how such a question is incredibly important and how it would be easy to mess up. And only easier the more data we train on, since it's so incredibly hard to process that data.[1] I think you can see some concerning issues with this filtering method and how it can create a large number of false negatives. They explicitly ignore answers, which is important for part 2. IIRC the GPT-3 paper also used an ngram model to check for dupes. But the most concerning line to me was this one:
> As can be seen in tables 9 and 10, contamination overall has very little effect on the reported results.
There is a concerning way to read the data here that serves a valid explanation for the results. That the data is so contaminated, the filtering process does not meaningfully remove the contamination and thus does not significantly change the results. If introducing contamination into your data does not change your results you either have a model that has learned the function of the data VERY well and has an extremely impressive form of generalization, OR your data is contaminated in ways you aren't aware of (there are other explanations too btw). There's a clearly simpler answer here.
Second, is about semantic information and contamination[2]. This is when data has the same effective meaning, but uses different ways to express it. "This is a cat" and "este es un gato" are semantically the same but share no similar words. So is "I think there's data spoilage" as well as "There is some concerning issues left to be resolved that bring into question the potential for information leakage." These will not be caught by substrings or ngrams. Yet, training on one will be no different than training on the other once we consider RLHF. The thing here is that in high dimensions, data is very confusing and does not act the way you might expect when operating in 2D and 3D. A mean between two values may or may not be representative depending on the type of distribution (uniform and gaussian, respectively), and we don't have a clue what that is (it is intractable!). The curse of dimensionality is about how it is difficult to distinguish a nearest neighboring point from the furthest neighboring point, because our concept of a metric degrades as we increase dimensionality (just like we lose algebraic structure when going from C (complex) -> H (quaternion) -> O (octonions) (commutativity, then associativity)[3]. Some of this may be uninteresting in the mathematical sense but some does matter too. But because of this, we need to rethink our previous questions carefully. Now we need to ask: "Are we confident that we have filtered out data that is not sufficiently meaningfully different from that in the test data?" Given the complexity of semantic similarity and the fact that "sufficiently" is not well defined, I think this should make anybody uneasy. If you are absolutely confident the answer is "yes, we have filtered it" I would think you a fool. It is so incredibly easy to fool ourselves that any good researcher needs to have a constant amount of doubt (though confidence is needed too!). But neither should our lack of a definite answer here stop progress. But it should make us more careful about what claims we do make. And we need to be clear about this or else conmen have an easy time convincing others.
To me, the common line of research is wrong. Until we know the data and have processed the data with many looking for means of contamination, results like these are not meaningful. They rely on a shaky foundation and often are more looking for evidence to prove reasoning than to consider it might not.
But for me, I think the conversations about a lot of this are quite strange. Does it matter that LLMs can't reason? I mean in some sense yes, but the lack of this property does not make them any less powerful of a tool. If all they are is a lossy compression of the majority of human knowledge with a built in human interface, that sounds like an incredible achievement and a very useful tool. Even Google is fuzzy! But this also tells us what the tool is good for and isn't. That this puts bounds on what we should rely on it for and what we can trust it to do with and without human intervention. I think some are afraid that if LLMs aren't reasoning, then that means we won't get AGI. But at the same time, if they don't reason, then we need to find out why and how to make machines reason if we are to get there. So ignoring potential pitfalls hinders this progress. I'm not suggesting that we should stop using or studying LLMs (we should continue to), but rather that we need to stop putting alternatives down. We need to stop comparing alternatives one-to-one to models that took millions of dollars to do a single training and have been studied by thousands of people for several years against things scrambled together by small labs on a shoestring budget. We'll never be able to advance if the goalpost is that you can't make incremental steps along the way. Otherwise how do you? You got to create something new without testing, convince someone to give you millions of dollars to train it, and then millions more to debug your mistakes and things you've learned along the way? Very inefficient. We can take small steps. I think this goalpost results in obscurification. That because the bar is set so high, that strong claims need to be made for these works to be published. So we have to ask ourselves the deeper questions: "Why are we doing this?"[4]
[0] This might seem backwards but the creation of the model implicitly claims that the test data and training data are segregated. "Show me this isn't in training" is a request for validation.
[2] If you're interested, Meta put out a work on semantic deduplication last year. They mostly focused on vision, but it still shows the importance of what's being argued here. It is probably easier to verify that images are semantically similar than sentences, since language is more abstract. So pixels can be wildly different and the result is visually identical; how does this concept translate with language? https://arxiv.org/abs/2303.09540
[4] I think if our answer is just "to make money" (or anything semantically similar like "increase share value") then we are doomed to mediocrity and will stagnate. But I think if we're doing these things to better human lives, to understand the world and how things work (I'd argue building AI is, even if a bit abstract), or to make useful and meaningful things, then the money will follow. But I think that many of us and many leading teams and businesses have lost focus on the journey that has led to profits and are too focused on the end result. And I do not think this is isolated to CEOs, I think this similar short sighted thinking can be repeated all the way down the corporate ladder. To a manager focusing on what their bosses explicitly ask for (rather than the intent) to the employee who knows that this is not the right thing to do but does it anyways (often because they know the manager will be unhappy. And this repeats all the way up). All life, business, technology, and creation have immense amounts of complexity to them. Ones we obviously want to simplify as much as possible. But when we hyper focus on any set of rules, no matter how complex, we will be doomed to fail because the environment is always changing and you will never be able to instantly adapt (this is the nature of chaos. Where small perturbations have large changes on the outcome). That doesn't mean we shouldn't try to make rules, but rather it means that rules are to be broken. It's just a matter of knowing when. In the end, this is an example of what it means to be able to reason. So we should be careful to ensure that we create AGI by making machines able to reason and think (to make them "more human") rather than by making humans into unthinking machines. I worry that the latter looks more likely, given that it is a much easier task to accomplish.
You're missing the fact that the model can only express its capabilities through the token generation mechanism.
The annoying "humans are auto complete" crowd really tries their best to obscure this.
Consider the following. You are taking notes in French in a choppy way by writing keywords. Then you write the output in English, but you are only allowed to use phrases that you have already seen to express your keywords. Your teacher doesn't speak french and only looks at your essay. You are therefore able to do more complicated things in french, since you don't lose points for writing things that the teacher hasn't taught you. However, the point deduction is so ingrained in you, that even after the teacher is gone, you still decide to not say some of the things you have written in french.
A type theory has a corresponding geometric interpretation, per topos theory. And this is bidirectional, since there’s an equivalence.
A geometric model of language will correspond to some effective type theory encoded in that language.
So LLMs are essentially learning an implicit “internal language” they’re reasoning in — based on their training data of our language and ways of reasoning.
What does reasoning have to do with geometry? Is this like the idea that different concepts have inherent geometrical forms? A Platonic or noetic take on the geometries of reason? (I struggled to understand much of this paper…)
A follow-up comment after having studied the paper a bit more, since you asked about where the geometry comes into play.
One of the references the paper provide is to this[1] paper, which shows how the non-linear layers in modern deep neural networks partitions the input into regions and applies region-dependent affine mappings[2] to generate the output. It also mentions how that connects to vector quantization and k-means clustering.
So, the geometric perspective isn't referring to your typical high-school geometry, but more abstract concepts like vector spaces[3] and combinatiorial computational geometry[4].
The submitted paper shows that this partitioning is directly linked to the approximation power of the neural network. They then show how increasing the approximation power results in better answers to math word problems, and hence that the approximation power correlated to the reasoning ability of LLMs.
Modern neural networks make heavy use of linear algebra, in particular the transformer[1] architecture that powers modern LLMs.
Since linear algebra is closely related to geometry[2], it seems quite reasonable that there are some geometric aspects that define their capabilities and performance.
Specifically, in this paper they're considering the intrinsic dimension[3] of the attention layers, and seeing how it correlates with the performance of LLMs.
> it seems quite reasonable that there are some geometric aspects that define their capabilities and performance.
Sure but this doesn't mean terribly much when you can relate either concept to virtually any other concept. "Reasonable" would imply one specific term implies another specific term and you haven't filled in those blanks yet.
"different concepts have inherent geometrical forms"
Absolutely, in fact you can build the foundation of mathematics on this concept. You can build proofs and reasoning (for some value of "reasoning").
That's how dependent type systems work, search for HoTT and modal homotophy theory. That's how lean4, coq, and theorem proofs work.
If you remember at the foundation of lambda calculus or boolean algebra, they proceed through a series of transformation of mathematical objects that are organized lattices or semi-lattices, partially ordered sets (e.g. in boolean algebra, where the partial order is provided by the implication).
It would be interesting to understand if the density of attention mechanisms follow a similar progression as dependent type systems, and we can find a link between the dependent types involved in a proof and the corresponding spaces in a LLM via some continuous relaxation analogous to a proximal operator + some transformation (from high-level concepts into output tokens).
We have found in embeddings that geometry has a meaning. Specific simple concepts correspond to vector directions. I wouldn't be surprised at all that we find that reasoning on dependent concepts correspond to complex subspaces in the paths that a LLM takes, and that with enough training this connections becomes closer and closer to the logical structure of corresponding proofs (for self-consistent corpus of input and, like math proofs, and given enough training data).
The paper doesn't make this point at all, but one thing you could do here is an AlphaGeometry-style[1] synthetic benchmark, where you have a geometry engine crank out a hundred million word problems, and have an LLM try to solve them.
Geometry problems have the nice property that they're easy to generate and solve mechanically, but there's no particular reason why a vanilla Transformer LLM would be any good at them, and you can have absolutely huge scale. (Unlike, say, the HumanEval benchmark, which only has 164 problems, which resulted in lots of accusations that LLMs can simply memorize the answers)
You'd have the second problem of trying to figure out how to relay geometry as a sequence of tokens when surely how you would encode this would affect what things you might reasonably expect an LLM to draw from it.
Only if your purpose is to create the best geometry solver. If you're trying to improve the general intelligence of a frontier LLM, you're probably better off feeding in the synthetic data as some combination of raw images and text (as part of its existing tokenisation).
I think they are talking about the word embeddings, where context is embedded into high geometric dimensions (one dimension might capture how 'feminine' a word is, or how 'blue' it is).
My extremely naive understanding is that the more useful ones, which also tend to be structures of language like gender or color, get their own dimensions, and other embedding are represented with combinations.
A weak illustration of this is this site[1], from an HN post a few months ago[2].
If the curvature metric wasn’t steep to begin with AdamW wouldn’t work. If the regions of interest weren’t roughly Euclidean control vectors wouldn’t work.
I think the connection is that the authors could convincingly write a paper on this connection, thus inflating the AI publication bubble, furthering their academic acumen and improving their chances of getting research grants or selective jobs in the field. Some other interests of the authors seem to be detecting exoplanets using AI and detecting birds through audio analysis.
Since nobody can really say what a good AI department does, companies seem to be driven by credentiallism, load up on machine learning PhDs and masters so they can show their board and investors that they are ready for the AI revolution. This creates economic pressure to write such papers, the vast majority of which will amount to nothing.
I think a lot of the time you would be correct. But this is published to arxiv so it’s not peer reviewed and doesn’t boost the authors credentials. It could be designed to attract attention to the company they work at. Or it could just be a cool idea the author wanted to share.
What are regions in this context?, are more regions better, how one delimiter the regions?, can one region be the same concept as several related regions?
As I understand it, the regions are simply the pieces that constitute the partitioning of the input domain, ie vector space formed by the the weights. There's some more details in one of the referenced papers[1], section 3.1 and onward.
The argument in that paper is that the layers in a typical deep neural network partitions the input domain into regions, where each region has its own affine mapping of the input.
For any arbitrary activation function, one would have to find the partitioning as well as the per-region parameters of the affine mappings. However since all the common activation functions are globally convex, they show that one can use this in a way where the partitioning is entirely determined by the per-region affine mapping parameters.
Thus the output of the layer for a given input x is a "partition-region-dependent, piecewise affine transformation of x". The affine mapping parameters is effectively what you end up changing during training, and so the number and shape of the regions change during training as well.
The submitted paper shows that more regions increase the approximation power of the neural net layer. This in itself doesn't seem that surprising given the above, but they use it as an important stepping stone.
As with many philosophical discussions, there is no point in claiming LLMs can "reason" because "reason" is not a well-defined term and you will not get everyone to agree on a singular definition.
Ask a computer scientist, continental philosopher, and anthropologist what "reason" is and they will give you extremely different answers.
If by reason we mean deductive reasoning as practiced in mathematics and inductive reasoning as practiced in the sciences, there is no evidence that LLMs do anything of the sort. There is no reason (ha) to believe that linguistic pattern matching is enough to emulate all that we call thinking in man. To claim so is to adopt an drastically narrow definition of "thinking" and to ignore the fact that we are embodied intellects, capable of knowing ourselves in a transparent possibly prelinguistic way. Unless an AI becomes embodied and can do the same, I have no faith that it will ever "think" or "reason" as humans do. It remains a really good statistical parlor trick.
> Unless an AI becomes embodied and can do the same, I have no faith that it will ever "think" or "reason" as humans do. It remains a really good statistical parlor trick.
This may be true, but if it's "good enough" then why does that matter? If I can't determine if a user on Slack/Teams is an LLM that covers their tickets on time with decent code quality, then I really don't care if they know themselves in a transparent, prelinguistic fashion.
I'm not into AI, but I like to watch from the sidelines. Here's my non-AI summary of the paper after glossing through (corrections appreciated):
The multilayered perceptron[1] layers used in modern neural networks, like LLMs, essentially partitions the input into multiple regions. They show that the number of regions a single MLP layer can partition into depends exponentially on the intrinsic dimension[2] of the input. The number of regions/partitions increases the approximation power of the MLP layer.
Thus you can significantly increase the approximation power of a MLP layer without increasing the number of neurons, by essentially "distilling" the input to it.
In the transformer architecture, the inputs to the MLP layers are the self-attention layers[3]. The authors then show that the graph density of the self-attention layers[3] correlates strongly with the intrinsic dimension of the self-attention layer. Thus a more dense self-attention layer means the MLP can do a better job.
One way of increasing the density of the attention layers is to add more context. (edited, see comment) They show that prepending any token as context to a question which increases the intrinsic dimension of the final layer makes the LLM perform better.
They also note that the transformer architecture is susceptible to compounding approximation errors, and that the much more precise partitioning provided by the MLP layers when fed with high intrinsic-dimensional input can help with this. However the impact of this on generalization remains to be explored further.
If the results hold up it does seem like this paper provides nice insight into how to better optimize LLMs and similar neural networks.
Awesome summarization by someone who read and actually understood the paper.
> One way of increasing the density of the attention layers is to add more context. They show that simply prepending any token as context to a question makes the LLM perform better. Adding relevant context makes it even better.
Right, I think a more intuitive way to think about this is to define density: the number of _edges_ in the self-attention graph connecting tokens. Maybe a simpler explanation: the number of times a token had some connection to another token divided by the number of tokens. So, tokens which actually relate to one another and provide information are good, non sequitur tokens don't help except that you say
> They show that simply prepending any token as context to a question makes the LLM perform better.
I think this is not quite right. What they found was:
> pre-pending the question at hand
with any type of token does increase the intrinsic dimension at the first layer
> however, this increase is not
necessarily correlated with the reasoning capability of the model
but it is only
> when the pre-pended tokens lead to an increase in the
intrinsic dimension at the *final layer* of the model, the reasoning capabilities of the LLM improve
significantly.
Since in LLM what is important is the result, I wonder if all those dimensions are dependent of accuracy, so that the dimension can be low if you want low accuracy but you need a high dimension (a large number of parameters) in order to increase accuracy. If this intuition is right then dimension is not the key concept rather the key is how minimal dimension at required accuracy scale with accuracy. A metaphor is the way in which humans structure knowledge, we don't learn by heart, rather we learn by considering local and global relations with other areas in order to construct global knowledge. So the curve that reflects the best tradeoff of dimension versus accuracy is an important curve that merits to be studied. In general, to learn well you need to separate clearly the main parts, so the regions should be structured in such a way that they provide rich and independent information, so simply using the number of regions don't seem to me to be enough, it can contain a lot of noise or randomness.
Another point about the number of regions: if the number of regions is similar to the number of clusters in a clustering algorithm then the number of cluster is not a key factor since very different number of clusters could give similar performance and looking for a minimum number could limit the generalization capabilities of the model.
In support vector machines there is the concept of margin between regions. If we fix a threshold to separate regions by a fixed margin then the number of regions is less noisy since you eliminate redundant and low information regions. So fixing the minimum margin or threshold seems to be the first step prior to studying the relation between the number of parameters, number of regions and performance of the model.
The way I understood it, it's more like the opposite. That is if you feed the non-linear layers "dense" data, ie with higher intrinsic dimension, they perform better. Thus, you could potentially get by using a smaller non-linear layers by "condensing" the input before passing it through the non-linear layers.
But the point is to focus on the intrinsic dimension[1], not dimensions of the vector itself. I meant dense in the sense that the two are close, relative to another vector where they are not so close. Perhaps a poor choice of words on my part.
Ah, I see. Well, the data has an intrinsic dimension of a specific size. You don't get to choose that. And, in any case, you want something quite a bit larger than the intrinsic dimension, because deep-learning needs redundancy in its weights in order to train correctly.
Right, but part of the argument in the paper, as I understand it, is that the self-attention layers can increase the intrinsic dimension of the input data if you feed it additional, relevant context.
I guess you could also use this result to find that a smaller network might be sufficient for your particular problem.
If you have additional context that is relevant, feed it to the network. Why wouldn't you? As to the size of the network, this is not a simple benefit, because you need to account for the trade off between model size and training efficiency.
Not exactly gobbledygook, but math, figures, and terminology designed to provide the appearance of some deep, conceptual idea. The illusion doesn't stand up to closer inspection, though.
LLMs do not have the technology to iteratively solve a complex problem.
This is a fact. No graph will change this.
You want “reasoning,” then you need to invent a new technology to iterate, validate, experiment, validate, query external expertise, and validate again. When we get that technology, then AI will become resilient in solving complex problems.
That's false. It has been shown that LLMs can perform e.g. gradient descent internally [1], which can explain why they are so good at few shot prompting. The universal approximation theorem already tells us that a single layer is sufficient to approximate any function, so it should come as no surprise that modern deep networks with many layers should be able to perform iterative optimisations.
Why not? At the most simple level, the human brain also just takes a set of inputs and produces a set of outputs. There is a huge complicated function behind that, but complexity is no longer an issue thanks to modern compute capabilities.
To be honest it doesn't sound like you understand it. Or maybe look up the universal approximation theorem that says this is very much possible. Many people just have this dangerous tendency to put their own mind on a pedestal. That's how they justified their superiority over slaves or minorities in the past and it's how they will justify it over machines in the future.
That's so wrong, I don't even know where to begin. You should really look up the fundamentals of how these models work instead of listening to the "statistical parrot" nonsense that is constantly spewed around HN.
What’s the success rate once you hit 20 prompts? What’s the success rate if you hit 30 prompts 40 prompts?
I’m pretty sure that as you increase the complexity of your questioning, the LLM is just gonna flat out fail and no change to the vector database is going to improve that.
You can't "enhance" from zero. LLMs by design are not capable of reason.
We can observe LLM-like behaviour in humans: all those reactionaries who just parrot whatever catchphrases mass media programmed into them. LLMs are just the computer version of that uncle who thinks Fox News is true and is the reason your nieces have to wear long pants at family gatherings.
He doesn't understand the catchphrases he parrots any more than the chatbots do.
Actual AI will require a kind of modelling that as yet does not exist.
Usually when people are saying "LLMs can't reason" they are claiming they are unable to do logical inference (although the claims are often quite hard to pin down to something specific).
I would say integrated circuits in general are not incapable of reason by design, even if some examples may be. Somehow a bunch of meat and fat is capable of reason, even if my steak isn't.
It is not as clear cut. The argument being, that the patterns they learn in text encodes several layers of abstraction, one of them being some reasoning, as it is encoded in the discourse.
They are capable of picking up incredibly crude, noisy versions of first-order symbolic reasoning, and specific, commonly-used arguments, and the context for when those might be applied.
Taken together and iterated, you get something vaguely resembling a reasoning algorithm, but your average schoolchild with an NLP library and regular expressions could make a better reasoning algorithm. (While I've been calling these "reasoning algorithms" for analogy's sake, they don't actually behave how we expect reasoning to behave.)
The language model predicts what reasoning might look like. But it doesn't actually do the reasoning, so (unless it has something capable of reasoning to guide it), it's not going to correctly derive conclusions from premises.
Yes and No. I don't entirely disagree with you, but think about when you ask a model to explain step by step a conclusion. It is not doing the reasoning, but in a way abstracted and learned the pattern of doing the reasoning....So it is doing some type of reasoning....and sometimes producing the outcomes that are derived from actual reasoning...Even if defining "actual reasoning" is a whole new challenge.
It took a long time for the limitations of LLMs to "click" for me in my brain.
Let's say there's a student reading 10 books on some topic. They notice that 9 of the books say "A is the answer" and just 1 book says "B is the answer". From that, the student will conclude and memorise that 90% of authors agree on A and that B is the 10% minority opinion.
If you train an LLM on the same data set, then the LLM will learn the same statistical distribution but won't be able to articulate it. In other words, if you start off with a generic intro blurb paragraph, it'll be able to complete it with the answer "A" 90% of the time and the answer "B" 10% of the time. What it won't be able to tell you is what the ratio is between A or B, and it won't "know" that B is the minority opinion.
Of course, if it reads a "meta review" text during training that talks about A-versus-B and the ratios between them, it'll learn that, but it can't itself arrive at this conclusion from simply having read the original sources!
THIS more than anything seems to be the limit of LLM intelligence: they're always one level behind humans when trained on the same inputs. They can learn only to reproduce the level of abstraction given to them, they can't infer the next level from the inputs.
I strongly suspect that this is solvable, but the trillion-dollar question is how? Certainly, vanilla GPT-syle networks cannot do this, something fundamentally new would be required at the training stage. Maybe there needs to be multiple passes over the input data, with secondary passes somehow "meta-training" the model. (If I knew the answer, I'd be rich!)
In principle, yes, but empirically? They can't do this reliably, even if all the texts fit within the context window. (They can't even reliably answer the question "what does author X say about Y?" – which, I agree, they should be able to do in principle.)
Can you explain what it means to reason about something? Since you are so confident I'm guessing you'll find it easy to come up with a non-contrived definition that'll clearly include humans and future "actual AI" but exclude LLMs.
Not the parent, but there are couple of things current AI lack:
- learning from single article /book with lasting effect (accumulation of knowledge)
- arithmetics without unexpected errors
- gauging reliability of information it’s printing
BTW. I doubt that you’ll get satisfactory definition of “able to reason” (or “conscious” or “alive” or “chair”). As they define more an end or direction of a spectrum, not an exact cut off point.
Current llms are impressive and useful, but given how often they spout nonsense, it is hard to put them into “able to reason” category.
> learning from single article /book with lasting effect (accumulation of knowledge)
If you mean without training the model, it can be done by using RAG, and allowing LLM to decide what to keep in mind as learnings to later come back to those. There are various techniques for RAG based memory/learning. It's a combination of querying the memory that is relevant to current goal, as well as method to keep most recent info in memory, as well as compressing, throwing out old info progressively, assigning importance levels to different "memories". Kind of like humans, honestly.
> arithmetics without unexpected errors
That's a bit handwavy, because humans make very many unexpected errors when doing arithmetics.
> gauging reliability of information it’s printing
Arguably most people also whatever they output, they are not very good at gauging the reliability. Also you can actually make it do that with proper prompting. You can make it debate itself, and finally let it decide the winning decision and confidence level.
LLMs are trained to predict the next word in a sequence. As a result of this training they developed reasoning abilities. Currently these reasoning abilities are roughly at human level, but next gen models (gpt5) should be superior to humans at any reasoning tasks.
The vocabulary used here doesn't have sufficient intrinsic dimension to partition the input into a low loss prediction. Improvement is promising with larger context or denser attention.
In the middle... AI doesn't work very well.
If an AI writes a multi-step plan, where the pieces have to fit together, I've found it goes off the rails. Parts 1 and 3 of a 4-part plan are fine. So is part 2. However they don't fit together! AI has no concept of "these four parts have to be closely connected, building a whole". It just builds from A to B in four steps... but taking two different paths and stitching the pieces together poorly.