It learns to recognize which objects correspond to which sounds, and vice versa. It doesn't really understand linguistic concepts like grammar, sentences, logic, etc.
It doesn't really have much in common with LLMs. If the ideas were combined, the results might be interesting, although probably requiring very significant compute.