Stable Zero123: Quality 3D Object Generation from Single Images

avaer · 2023-12-14T23:26:49.000000Z

This looks a fine-tune of the classic zero123 (https://github.com/cvlab-columbia/zero123) I’m excited to check out the quality improvements.

Though 3d model synthesis is one use case, I found the less advertised base reprojection model to be more useful for gamedev at the moment. You can generate a multiview spritesheet from an image, and it’s fast enough for synthesis during a gameplay session. I couldn’t get a good quality/time balance to do the same with the 3d models, and the lack of mesh rigging or animation combined with imperfections in a fully 3d model tends to break the suspension of disbelief compared to what players are used to for full 3d. I’m sure this will change as the tech develops and we layer more AI on top (automatic animation synthesis is an active research area).

If you’re interested in this you might also want to check out deforum (https://github.com/deforum-art/deforum-stable-diffusion) which provides even more powerful camera controls on top of stable diffusion designed for full scenes rather than single objects.

dheera · 2023-12-14T23:58:41.000000Z

> Stable Zero123 produces notably improved results compared to the previous state-of-the-art, Zero123-XL.

Zero123-XL might be state of the art for image to 3D but I'm not sure it is state of the art if your ultimate goal is text-to-3D. MVDream (https://mv-dream.github.io/) performs quite a bit better at that in my opinion.

dwallin · 2023-12-15T18:13:43.000000Z

I can't help but feel that Stability AI is pushing themselves to irrelevancy with their business model and new policy of "research-only" models. Part of what made the original stable diffusion models such a force was the ability and willingness of others to build on top of it. This is such a force multiplier. People built all sorts of abstractions around it that it became something much larger than just the original image generation.

Locking down your models so you can sell a mediocre api and web app is not the business model you want here as a mid-sized player. You just won't be able to keep up with the biggest players and their in-house models.

I think the open license they should try using should be something akin to GPL, where derivative models need to be shared and released under the same license. This gives you legal ammunition against companies building upon your models for their own private service, while allowing for a vibrant open-source ecosystem. Then you can offer more permissive licenses to those who who can pay.

AndrewKemendo · 2023-12-14T22:29:55.000000Z

One of the key limiting factors to the adoption of augmented reality was the lack of available 3-D objects that you could then put into the AR space.

If you think about a company who has physical objects, the hardest thing you do as somebody who is a model builder is creating a 3-D model that’s accurate to whatever the product is.

This is such a big problem, that we devised a fairly novel large scanning system, that I proposed to Amazon that was part of our digitization suite when I was running my computer vision and AR company. That was one of a dozen projects that we were trying to do to get after this problem of rapid digitization of objects.

One of the key things we were trying to do starting in 2017, was come up with structure from motion, algorithms or otherwise have a large database and a similarity match such that we could inherit a series of objects, the most likely object types that we saw in the environment.

Of the major challenges of this is that it isn’t good enough for anybody like a corporation to pay for. The majority of time and money spent for companies they were trying to get into the AR space was in sending catalog of catalogs to India, Pakistan, etc. for thousands of three modelers to create the 3-D models.

You can certainly understand how this becomes complicated quickly, including what is considered a canonical model, is there licensing for a certain types of models, who has the official authorization for a 3-D model, etc.

All this to say, what is presented here seems pretty darn close, or at least close enough that we can see that we’re going to be able to fully automate this process, and hopefully that will actually allow for the adoption of some of these things that were previously rate, limited by the Linear growth rate of objects in the available space.

Edit: I’ll be curious to see what their meshes look like and if they optimize for polygons and in what way. Similarly if they are a single volume or if they have discrete objects composing a new object class (extremely doubtful)

The last time I saw any major updates on this kind of thing was a Stanford paper that was trying to derive Voxel spaces if I recall correctly from images and that was back in 2017 or 2018

doctorpangloss · 2023-12-14T23:48:30.000000Z

I am not downvoting you but...

> One of the key limiting factors to the adoption of augmented reality was the lack of available 3-D objects that you could then put into the AR space.

Is this true? There are so many assets. Problems with AR:

#1: All AR experiences are crummy.

#2: Delivery sucks. Phones are weak, the iOS GPU actually sucks, it overheats quickly.

> The majority of time and money spent for companies they were trying to get into the AR space was in sending catalog of catalogs to India, Pakistan, etc. for thousands of three modelers to create the 3-D models.

The money was misused on assets. You could cut the cost of generating assets to zero and nothing would change.

Riddle me this, #1 worldwide most valuable retailer per sq ft Apple has only 100 SKUs that all already exist as CAD models and the #1 AR delivery platform, why aren't they making AR product experiences? Because it makes no fucking sense that's why.

Show me in the autos vertical who is having trouble with assets. Nobody. Pixyz is great but so what? Autos vertical issues: #1 crummy experiences, #2 delivery.

Clothing? Who is having trouble with clothing assets? Virtual try ons make no sense. Who is buying $3 SHEIN cumberbumbers and cares about the fit? Nobody.

Some people working at a clothing company or a car manufacturer might express that they are interested in piloting an augmented reality orbit-around of some SKU... but that doesn't mean it makes any sense!

Show me a SKU where an AR experience makes sense. Please tell me real estate, introduce me to the San Francisco family looking to buy a home that is saying, "man, give me an AR tour." Nobody has ever said that.

fragmede · 2023-12-15T01:39:29.000000Z

I was with you until the last part. Matterport tours of houses are really quite popular.

https://matterport.com/discover/tag/bayarea/

ansible · 2023-12-14T22:43:43.000000Z

So it generates a fully rigged 3D model that can be animated by conventional means?

If it can do all that, and you add in motion capture from just a video, and that will drastically cut the costs for all kinds of animation projects.

Given that it is possible to render photo-realistic people now from 3D models (subsurface scattering for the skin, etc.), we are well on the way to a full video production pipeline. Just give it some scans of the people and objects you want, type in a description of the scenes, generate the voices via text to speech, and press "render".

The next few years are going to be crazy.

joewhatkins · 2023-12-14T22:48:00.000000Z

I don't think it rigs the models - I think that video is comprised of models generated by Stable Zero123 that were then rigged/animated/postprocessed in Blender.

ilaksh · 2023-12-15T07:14:52.000000Z

Ok so the $20/month subscription doesn't cover this one?