Am I nitpicking when I say that all techniques mentioned in this article don't capture meaning but rather capture similarity? In particular, the article never mentions grammar at all. Is this because of limitations in current computing or is the fundamental assumption that grammar is redundant?
When we roll out speech-based interfaces with these limitations, do we train our children to ignore grammar?
These models capture similarity on such a scale that they manage to learn patterns of language well, in terms of grammar they learn it indirectly - it is not manually coded, particularly the later models that we speak about. Our focus here is more on encoding text (and images) in a way that they can be compared to other text or image based on their meaning/imagery rather than syntax
But the later transformer models have a good grasp of grammar - similar to when we learn to speak as children, it isn't necessary to be taught grammar directly, but we can figure it out over time based on the patterns of language.
When we roll out speech-based interfaces with these limitations, do we train our children to ignore grammar?