Generally I don't buy these arguments which require embodiment, because they don't seem to align well to what else I know about my world.
Rather than your Thai text example, let's consider a friend of my sister H. H has been profoundly blind from birth. Not "legally blind" with the world a blur, her eyes actually don't work. Direct lived experience of a summer day is to her literally just feeling warmth on her face from the sun, her eyes can't see the visible light.
I've seen purple and H never will so it seems to me you're arguing I "know" what purple is and she doesn't, thus ChatGPT doesn't know what purple is either. But I don't think I agree, I think we're both just experiencing a tiny fraction of reality, and ChatGPT is experiencing an even narrower sliver than either of us and that it probably wouldn't do us any good to try to quantify it. If I "know what purple is" then so does H and perhaps ChatGPT or a successor model will too.
That's an argument from ignorance, and it's not credible. The potential total scope of experience is irrelevant. The reality is that you have an embodied experience of purple shared with most humans. Unfortunately your sister doesn't. She will have a linguistic placeholder for the concept of purple, probably surrounded by verbal associations. But that's all.
It's an ironically apt analogy, because ChatGPT has the linguistic understanding of an entity that is deaf, dumb, blind, and has no working senses of any kind, and instead relies on a golem-like automated mass of statistics with some query processing.
We tend to project intelligence onto linguistic ability, because it's a useful default assumption in our world. (If you've ever tried speaking a foreign language while not being very good at it, you'll know how the opposite feels. Humans assume that not being able to use language is evidence of low intelligence.)
But it's a very subjective and flawed assessment. Embodied experience is far more necessary for sentience than we assume, and apparent linguistic performance is far less.
There's a few particular problems we have with the word intelligence/sentience, mostly revolving around that we evolved embodiment first and then added more and more complex intelligence/sentience on top of an ever changing DNA structure.
Much like when humans started experimenting with flight we tried to make flapping things like birds, but in the end it turns out spinning blades gives us capabilities above and beyond bodies that flap.
Back to the embodiment problem. For us as humans we have limits like only having one body. It has a great number of sensors but they are still very limited in relation to what reality has to offer, hence we extend our senses with technology. And with that there is no reason machine intelligence embodiment has to look anything like ours. Machine intelligence could have trillions of sensors spread across the planet as an example.
I don't think embodiment is required to understand a lot of stuff. But language is how we talk about the world, and non-linguistic concepts have to be grounded in an exposure to something other than language. I think there's an argument to be made that DALLe "knows" more about a lot of words than a pure language model bc it can relate phases to visual concepts. But I do think for many concepts, understanding also proceeds from interaction. This doesn't necessarily need to be physical. I similarly think code generation tools need access to interpreters etc to "understand" the code they're generating. Embodiment is not relevant to all concepts.
I don't think the argument about DALLe would work - it deals with pixels instead of words, but it's fundamentally a different form of language, made of different mathematical patterns (obscured to us because, unlike symbolic manipulation, our visual system handles high-level patterns in images without engaging our conscious awareness).
I do agree about grounding is needed. All our language is expressing or abstracting concepts related to how we perceive and interact with reality in continuous space and time. This perception and interaction is a huge correlating factor that our ML models don't have access to - and we're expecting them to somehow tease it out from a massive dump of weakly related snapshots of recycled high-level human artifacts, be they textual or visual. No surprise the models would rather latch onto any kind of statistical regularity in the data, and get stuck in a local minimum.
Now I don't believe solution is actual embodiment - that would be constraining the model too hard. But I do think the model needs to be exposed to the concepts of time and causality - which means it needs to be able to interact with the thing it's learning about, and feed the results back into itself, accumulating them over time.
Rather than your Thai text example, let's consider a friend of my sister H. H has been profoundly blind from birth. Not "legally blind" with the world a blur, her eyes actually don't work. Direct lived experience of a summer day is to her literally just feeling warmth on her face from the sun, her eyes can't see the visible light.
I've seen purple and H never will so it seems to me you're arguing I "know" what purple is and she doesn't, thus ChatGPT doesn't know what purple is either. But I don't think I agree, I think we're both just experiencing a tiny fraction of reality, and ChatGPT is experiencing an even narrower sliver than either of us and that it probably wouldn't do us any good to try to quantify it. If I "know what purple is" then so does H and perhaps ChatGPT or a successor model will too.