I think reaching AGI has to do with the 4 E's: embodied, enacted, embedded and extended. These models have no embodiment, no complex external environment, and there is no goal to attain other than language games.
But they could be embodied, and some experiments have shown for example how a LM can guide an agent in a 3D virtual home to accomplish tasks. (*) Maybe AIs can be educated like children after they reach a certain threshold, by giving them robotic bodies and immersing them in the human society which is the most complex environment for intelligence.
They have a complex external environment just as much as we do: the physical world generates their inputs. Just because their inputs are prefiltered through human language doesn't mean they're not exposed to its full complexity just as much as we are, after having it prefiltered through our eyeball and optic nerve. And predicting human statements implicitly requires modeling agents with goals - see the joke explanations.
Human eyes aren't an input. The process of vision is bidirectional and relies on you choosing what to look at just as much as what happens to be sending photons your way.
Same for the other senses and learning; it's all active engagement with an environment. An ML language model just gets text dumped into its weights until it works though. It doesn't get to ask for more, and it's probably not the best design that we're trying to store everything in its weights instead of letting it look up an external library.
I readily agree that it's not the best design, and that it will have a hard time figuring out things that are not in its corpus. But reinforcement learning is also unlike human learning, and may possibly be better at fitting data into a world model without requiring reflective attention and prediction the way human learning does.
But LM's can't do causal interventions to separate correlation from causation unless they have access to the real environment. Would a scientist that can't run experiments be able to advance the field by just reading past papers? Even babies need to try causality laws by letting objects fall and observing what happens, we learn by interacting with our environment.
Right, they can't create new evidence to learn from (yet). But we're feeding them a lot of data. Possibly more (and more diverse) data than a human takes in during childhood! I remember a lot of visual input, but I don't remember reading the entire internet. (Yes I know it's exaggeration, but come on- compare characters vs characters).
You could say that predicting the next token is their only action, and minimizing the loss their only goal. It's a kind of impoverished environment. Imagine a human living in a cave, seeing only a scrolling line of text. Would that human learn anything new by getting out of its cave? Plato was so right, intelligence needs access to the full environment.
I mean, isn't the whole point of Plato that we do everything we do, which is a lot, certainly enough to be dangerous, just by sitting in a cave and manipulating levers while watching a shadow? This does not make text-trained AI seem safer.
Seen another way, the lesson of Plato is "Shadows On A Cave Wall Are All You Need - arxiv.com"
But they could be embodied, and some experiments have shown for example how a LM can guide an agent in a 3D virtual home to accomplish tasks. (*) Maybe AIs can be educated like children after they reach a certain threshold, by giving them robotic bodies and immersing them in the human society which is the most complex environment for intelligence.
(*) Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents - https://arxiv.org/pdf/2201.07207.pdf