You missed the point. ChatGPT trained on a gazillion words to "learn" a language. Children learn their language from a tiny fraction of that. Streamed visual, smell, touch etc. don't help learn the grammars of (spoken) languages.
> visual, smell, touch etc. don't help learn the grammars of (spoken) languages.
Of course they do! These are literally the things children learn to associate language with. "Ouch!" is what is said when you feel pain.
An ML model can learn to associate the word "ouch" with the words "pain" and "feel", but it doesn't actually know what pain is, because it doesn't feel.
Isn't it more complicated than that?
"Ouch" can be a lot of things, and that's where a lot of problems crop up in the AI world.
If one of my friends insults another friend, I might say,"OUCH!" I'm not in pain but I might want to express that the insult was a bit much.
If someone tries to insult me and it's weak, I could reply with a dry, sarcastic "ouch."
Combine that with facial expression and tone of voice and 'ouch' is highly contextual.
One problem with some of the tools used to take down offensive comments on social media platforms is that they don't get context.
Let's say that 'ouch' is highly offensive and you got into trouble for calling someone an "ouch." If I want to discuss the issue and agree that you were being offensive, I could get into trouble with the ML/AI tools for quoting you.
Second, saying "Ouch" is not even language. My cat says something when I step on her paw. That doesn't mean she understands language, nor that she speaks some language.
Third, you're right about pain, but an ML model can associate the word "red" with the color, and "walk" with images of people walking, and "sailboat" with certain images or videos, and plenty of other concepts. If that was what learning a language was, then AIs would understand language in lots of areas, if not in the specific domain of pain. But they don't.
It's absolutely true that children learn (and even generate) language grammar from a ridiculously small number of samples compared to LLMs.
But could the availability of a world model, in the form of other sensory inputs, contribute to that capacity? Younger children who haven't fully mastered correct grammar are still able to communicate more sensibly than earlier LLMs, whereas the earlier LLMs tend toward more grammatically correct gibberish. What if the missing secret sauce to better LLM training is figuring out how to wire, say, image recognition into the training process?