Text gen, one of the things you contrast 3d to, similarly isn't fixed size (capped in most models, but not fixed.)
In fact, the data structures of a 3D scene can be serialized as text, and a properly trained text gen system could generate such a representation directly, though that's probably not the best route to decent text-to-3d.
Text is a standard sized embedding vector that gets passed one at a time to an LLM. All tokens have the same shape. Each token is processed one at a time. All tokens also have a pre defined order. It is very different and vastly simpler.
Serializing 3D models as text is not going to work for negligibly non trivial circumstances.
> A 3D scene is vastly more complex
3D scenes, in fact, are also data, numbers and tokens. (Well, numbers, but so are tokens.)