Hacker News new | past | comments | ask | show | jobs | submit login

We do it for text, audio and bitmapped images. A 3D scene file format is no different, you could train a model to output a blender file format instead of a bitmap.

It can learn anything you have data for.

Heck, we do it with geospatial data already, generating segmentation vectors. Why not 3D?




>3D scene file format is no different

Not in theory, but the level of complexity is way higher and the amount of data available is much smaller.

Compare bitmaps to this: https://fossies.org/linux/blender/doc/blender_file_format/my...


Also the level of fault tolerance... if your pixels are a bit blurry, chances are no one notices at a high enough resolution. If your json is a bit blurry you have problems.


You can do "constrained decoding" on a code model which keeps it grammatically correct.

But we haven't gotten diffusion working well for text/code, so generating long files is a problem.


Recent results for code diffusion here: https://www.microsoft.com/en-us/research/publication/codefus...

I'm not experienced enough to validate their claims, but I love the choice of languages to evaluate on:

> Python, Bash and Excel conditional formatting rules.



Text, audio, and bitmapped images are data. Numbers and tokens.

A 3D scene is vastly more complex, and the way you consume it is tangential to the rendering of it we use to interpret. It is a collection of arbitrary data structures.

We’ll need a new approach for this kind of problem


> Text, audio, and bitmapped images are data. Numbers and tokens.

> A 3D scene is vastly more complex

3D scenes, in fact, are also data, numbers and tokens. (Well, numbers, but so are tokens.)


As I stated and you selectively omitted, 3D scenes are collections of many arbitrary data structures.

Not at all the same as fixed sized arrays representing images.


Text gen, one of the things you contrast 3d to, similarly isn't fixed size (capped in most models, but not fixed.)

In fact, the data structures of a 3D scene can be serialized as text, and a properly trained text gen system could generate such a representation directly, though that's probably not the best route to decent text-to-3d.


Text is a standard sized embedding vector that gets passed one at a time to an LLM. All tokens have the same shape. Each token is processed one at a time. All tokens also have a pre defined order. It is very different and vastly simpler.

Serializing 3D models as text is not going to work for negligibly non trivial circumstances.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: