First of all, I would urge you to stop arbitrarily using negative words to make an argument. Saying that LLMs are "blathering" is equivalent to saying you and I are "smacking meat onto plastic to communicate" - it's completely empty of any meaning. This "vibes based arguing" is common in these discussions and a massive waste of time.
Now, I don't really understand what you mean by "almost impossible to maintain long-term context/attention". I'm writing fiction in my spare time, LLMs do very well on this by my testing, even subtle and complex simulations of environments, including keeping track of multiple "off-screen" dynamics like a pot boiling over.
There is nothing "1-dimensional" about the context, unless you mean that it is directional in time, which any human thought is as well, of course. As I said in my original reply, each token is represented by a multidimensional embedding, and even that is abstracted away by the time inference reaches the later layers. The word "citrus" isn't just a word for the LLM, just as it isn't just a word for you. Its internal representation retrieves all the contextual understanding that is related to it. Properties, associated feelings, usage - every relevant abstract concept is considered. And these concepts interact which every embedding of every other token in the input in a learned way, and with the position they have relative to each other. And then when an output is generated from that dynamic, said output influences the dynamic in a way that is just as multidimensional.
The model can maintain context as rich as it wants, and it can built upon that context in whatever way it wants as well. The problem is that in some domains, it didn't get enough training time to build robust transformation rules, leading it to draw false conclusions.
You should reflect on why you are only able to provide vague and under defined, often incorrect, arguments here. You're drawing distinctions that don't really exist and trying to hide that by appealing to false intuitions.
> The reasoning weakness... it's a fundamental architecturally-based limitation...
You have provided no evidence or reasoning for that conclusion. The river crossing puzzle is exactly what I had in mind when talking about specific domains. It is a common trick question with little to no variation and LLMs have overfit on that specific form of the problem. Translate it to any other version - say transferring potatoes from one pot to the next, or even a mathematical description of sets being modified - and the models do just fine. This is like tricking a human with the "As I was going to Saint Ives" question, exploiting their expectation of having to do arithmetic because it looks superficially like a math problem, and then concluding that they are fundamentally unable to reason.
> People like Demis Hassabis (CEO of DeepMind) acknowledge the weakness too.
What weakness? That current LLMs aren't as good as humans when reasoning over certain domains? I don't follow him personally but I doubt he would have the confidence to make any claims about fundamental inabilities of the transformer architecture. And even if he did, I could name you a couple of CEOs of AI labs with better models that would disagree, or even Turing award laureates. This is by no means a consensus stance in the expert community.
> And even if he did, I could name you a couple of CEOs of AI labs with better models that would disagree, or even Turing award laureates. This is by no means a consensus stance in the expert community.
I disagree - there is pretty widespread agreement that reasoning is a weakness, even among the best models, (and note Chollet's $1M ARC prize competition to spur improvements), but the big labs all seem to think that post-training can fix it. To me this is whack-a-mole wishful thinking (reminds me of CYC - just add more rules!). At least one of your "Turing award laureates" thinks Transformers are a complete dead end as far as AGI goes.
A weakness of the current models in some domains considered useful, yes - but not a fundamental limitation of the architecture. I see no consensus on the latter whatsoever.
The ARC challenge tests spatial reasoning, something we humans are obviously quite good at, given 4 billion years of evolutionary optimization. But as I said, there is no "general reasoning", it's all domain dependent. A child does better at the spatial problems in ARC given that it has that previously mentioned evolutionary advantage, but just as we don't worship calculators as superior intelligences because they can multiply 10^9 digit numbers in milliseconds, we shouldn't draw fundamental conclusions from humans doing well at a problem that they are in many ways built to solve. If the failures of previous predictions - those that considered Chess or Go as unmistakable signals of true general reasoning - have taught us anything, it's that general reasoning simply does not exist.
The bet of current labs is synthetic data in pre-training, or slight changes of natural data that induces more generalization pressure for multi-step transformations on state in various domains. The goal is to change the data so models learn these transformations more readily and develop good heuristics for them, so not the non-continuous patching that you suggest.
But yes, the next generation of models will probably reveal much more about where we're headed.
> If the failures of previous predictions - those that considered Chess or Go as unmistakable signals of true general reasoning - have taught us anything, it's that general reasoning simply does not exist.
I don't think DeepBlue or AlphaGo/etc were meant to teach us anything - they were just showcases of technological prowess by the companies involved, demonstrations of (narrow) machine intelligence.
But...
Reasoning (differentiated from simpler shallow "reactive" intelligence) is basically multi-step chained what-if prediction, and may involve a branching exploration of alternatives ("ok, so that wouldn't work, so what if I did this instead ..."), so could be framed as a tree search of sorts, not entirely disimilar to the MCTS used by DeepBlue or AlphaGo.
Of course general reasoning is a lot more general than playing a game like Chess or Go since the type of moves/choices available/applicable will vary at each step (these aren't all "game move" steps), as will the "evaluation function" that predicts what'll happen if we took that step, but "tree search" isn't a bad way to conceptualize the process, and this is true regardless of the domain(s) of knowledge over which the reasoning is operating.
Which is to say, that reasoning is in fact a generalized process, and one who' nature has some corresponding requirements (e.g. keeping track of state) for any machine to be capable of performing it ...
First of all, I would urge you to stop arbitrarily using negative words to make an argument. Saying that LLMs are "blathering" is equivalent to saying you and I are "smacking meat onto plastic to communicate" - it's completely empty of any meaning. This "vibes based arguing" is common in these discussions and a massive waste of time.
Now, I don't really understand what you mean by "almost impossible to maintain long-term context/attention". I'm writing fiction in my spare time, LLMs do very well on this by my testing, even subtle and complex simulations of environments, including keeping track of multiple "off-screen" dynamics like a pot boiling over.
There is nothing "1-dimensional" about the context, unless you mean that it is directional in time, which any human thought is as well, of course. As I said in my original reply, each token is represented by a multidimensional embedding, and even that is abstracted away by the time inference reaches the later layers. The word "citrus" isn't just a word for the LLM, just as it isn't just a word for you. Its internal representation retrieves all the contextual understanding that is related to it. Properties, associated feelings, usage - every relevant abstract concept is considered. And these concepts interact which every embedding of every other token in the input in a learned way, and with the position they have relative to each other. And then when an output is generated from that dynamic, said output influences the dynamic in a way that is just as multidimensional.
The model can maintain context as rich as it wants, and it can built upon that context in whatever way it wants as well. The problem is that in some domains, it didn't get enough training time to build robust transformation rules, leading it to draw false conclusions.
You should reflect on why you are only able to provide vague and under defined, often incorrect, arguments here. You're drawing distinctions that don't really exist and trying to hide that by appealing to false intuitions.
> The reasoning weakness... it's a fundamental architecturally-based limitation...
You have provided no evidence or reasoning for that conclusion. The river crossing puzzle is exactly what I had in mind when talking about specific domains. It is a common trick question with little to no variation and LLMs have overfit on that specific form of the problem. Translate it to any other version - say transferring potatoes from one pot to the next, or even a mathematical description of sets being modified - and the models do just fine. This is like tricking a human with the "As I was going to Saint Ives" question, exploiting their expectation of having to do arithmetic because it looks superficially like a math problem, and then concluding that they are fundamentally unable to reason.
> People like Demis Hassabis (CEO of DeepMind) acknowledge the weakness too.
What weakness? That current LLMs aren't as good as humans when reasoning over certain domains? I don't follow him personally but I doubt he would have the confidence to make any claims about fundamental inabilities of the transformer architecture. And even if he did, I could name you a couple of CEOs of AI labs with better models that would disagree, or even Turing award laureates. This is by no means a consensus stance in the expert community.