We do it for text, audio and bitmapped images. A 3D scene file format is no diff...

boppo1 · on Nov 21, 2023

>3D scene file format is no different

Not in theory, but the level of complexity is way higher and the amount of data available is much smaller.

Compare bitmaps to this: https://fossies.org/linux/blender/doc/blender_file_format/my...

kaibee · on Nov 21, 2023

Also the level of fault tolerance... if your pixels are a bit blurry, chances are no one notices at a high enough resolution. If your json is a bit blurry you have problems.

astrange · on Nov 22, 2023

You can do "constrained decoding" on a code model which keeps it grammatically correct.

But we haven't gotten diffusion working well for text/code, so generating long files is a problem.

DougBTX · on Nov 22, 2023

Recent results for code diffusion here: https://www.microsoft.com/en-us/research/publication/codefus...

I'm not experienced enough to validate their claims, but I love the choice of languages to evaluate on:

> Python, Bash and Excel conditional formatting rules.

dragonwriter · on Nov 21, 2023

We do it for 3D, too.

https://guytevet.github.io/mdm-page/

jncfhnb · on Nov 21, 2023

Text, audio, and bitmapped images are data. Numbers and tokens.

A 3D scene is vastly more complex, and the way you consume it is tangential to the rendering of it we use to interpret. It is a collection of arbitrary data structures.

We’ll need a new approach for this kind of problem

dragonwriter · on Nov 21, 2023

> Text, audio, and bitmapped images are data. Numbers and tokens.

> A 3D scene is vastly more complex

3D scenes, in fact, are also data, numbers and tokens. (Well, numbers, but so are tokens.)

jncfhnb · on Nov 22, 2023

As I stated and you selectively omitted, 3D scenes are collections of many arbitrary data structures.

Not at all the same as fixed sized arrays representing images.

dragonwriter · on Nov 22, 2023

Text gen, one of the things you contrast 3d to, similarly isn't fixed size (capped in most models, but not fixed.)

In fact, the data structures of a 3D scene can be serialized as text, and a properly trained text gen system could generate such a representation directly, though that's probably not the best route to decent text-to-3d.

jncfhnb · on Nov 22, 2023

Text is a standard sized embedding vector that gets passed one at a time to an LLM. All tokens have the same shape. Each token is processed one at a time. All tokens also have a pre defined order. It is very different and vastly simpler.

Serializing 3D models as text is not going to work for negligibly non trivial circumstances.