I don't think the argument about DALLe would work - it deals with pixels instead...

I don't think the argument about DALLe would work - it deals with pixels instead of words, but it's fundamentally a different form of language, made of different mathematical patterns (obscured to us because, unlike symbolic manipulation, our visual system handles high-level patterns in images without engaging our conscious awareness).

I do agree about grounding is needed. All our language is expressing or abstracting concepts related to how we perceive and interact with reality in continuous space and time. This perception and interaction is a huge correlating factor that our ML models don't have access to - and we're expecting them to somehow tease it out from a massive dump of weakly related snapshots of recycled high-level human artifacts, be they textual or visual. No surprise the models would rather latch onto any kind of statistical regularity in the data, and get stuck in a local minimum.

Now I don't believe solution is actual embodiment - that would be constraining the model too hard. But I do think the model needs to be exposed to the concepts of time and causality - which means it needs to be able to interact with the thing it's learning about, and feed the results back into itself, accumulating them over time.