> Text, audio, and bitmapped images are data. Numbers and tokens. > A 3D scene i...

jncfhnb · on Nov 22, 2023

As I stated and you selectively omitted, 3D scenes are collections of many arbitrary data structures.

Not at all the same as fixed sized arrays representing images.

dragonwriter · on Nov 22, 2023

Text gen, one of the things you contrast 3d to, similarly isn't fixed size (capped in most models, but not fixed.)

In fact, the data structures of a 3D scene can be serialized as text, and a properly trained text gen system could generate such a representation directly, though that's probably not the best route to decent text-to-3d.

jncfhnb · on Nov 22, 2023

Text is a standard sized embedding vector that gets passed one at a time to an LLM. All tokens have the same shape. Each token is processed one at a time. All tokens also have a pre defined order. It is very different and vastly simpler.

Serializing 3D models as text is not going to work for negligibly non trivial circumstances.