My layman take on LLMs is that they map tokens to points in an absurdly high-dim...

My layman take on LLMs is that they map tokens to points in an absurdly high-dimensional vector space (on the order of tens to hundred of thousands dimensions). The training process shifts those points around to make the related tokens closer, which eventually ends up encoding pretty much any kind of relationship you could think of between the tokens, semantic or otherwise, as proximity in one or more dimensions. The latent space has enough dimensions to accommodate all those relationships, which is how even tasks which require complex understanding of abstract concepts still boil down to adjacency search in that space.

In other words: the LLM isn't learning algorithms, it's building a high-dimensional point cloud, where things related to each other are closer together.

Now, IIRC the visual model mentioned above works with sub-1000-dimensional latent space, which to me feels like not enough... space to fit generalized concepts in. But then, the prompts to txt2img and img2img models I saw seem more like additive modifiers, with individual tokens mostly independent of each other, so maybe that explanation still fits.