We do it for text, audio and bitmapped images. A 3D scene file format is no different, you could train a model to output a blender file format instead of a bitmap.
It can learn anything you have data for.
Heck, we do it with geospatial data already, generating segmentation vectors. Why not 3D?
Also the level of fault tolerance... if your pixels are a bit blurry, chances are no one notices at a high enough resolution. If your json is a bit blurry you have problems.
Text, audio, and bitmapped images are data. Numbers and tokens.
A 3D scene is vastly more complex, and the way you consume it is tangential to the rendering of it we use to interpret. It is a collection of arbitrary data structures.
We’ll need a new approach for this kind of problem
Text gen, one of the things you contrast 3d to, similarly isn't fixed size (capped in most models, but not fixed.)
In fact, the data structures of a 3D scene can be serialized as text, and a properly trained text gen system could generate such a representation directly, though that's probably not the best route to decent text-to-3d.
Text is a standard sized embedding vector that gets passed one at a time to an LLM. All tokens have the same shape. Each token is processed one at a time. All tokens also have a pre defined order. It is very different and vastly simpler.
Serializing 3D models as text is not going to work for negligibly non trivial circumstances.
It can learn anything you have data for.
Heck, we do it with geospatial data already, generating segmentation vectors. Why not 3D?