Hacker News new | past | comments | ask | show | jobs | submit login
Stable Fast 3D: Rapid 3D Asset Generation from Single Images (stability.ai)
321 points by meetpateltech 39 days ago | hide | past | favorite | 84 comments



For all of the hype around LLMs, this general area (image generation and graphical assets) seems to me to be the big long-term winner of current-generation AI. It hits the sweet spot for the fundamental limitations of the methods:

* so-called "hallucination" (actually just how generative models work) is a feature, not a bug.

* anyone can easily see the unrealistic and biased outputs without complex statistical tests.

* human intuition is useful for evaluation, and not fundamentally misleading (i.e. the equivalent of "this text sounds fluent, so the generator must be intelligent!" hype doesn't really exist for imagery. We're capable of treating it as technology and evaluating it fairly, because there's no equivalent human capability.)

* even lossy, noisy, collapsed and over-trained methods can be valuable for different creative pursuits.

* perfection is not required. You can easily see distorted features in output, and iteratively try to improve them.

* consistency is not required (though it will unlock hugely valuable applications, like video, should it ever arrive).

* technologies like LoRA allow even unskilled users to train character-, style- or concept-specific models with ease.

I've been amazed at how much better image / visual generation models have become in the last year, and IMO, the pace of improvement has not been slowing as much as text models. Moreover, it's becoming increasingly clear that the future isn't the wholesale replacement of photographers, cinematographers, etc., but rather, a generation of crazy AI-based power tools that can do things like add and remove concepts to imagery with a few text prompts. It's insanely useful, and just like Photoshop in the 90s, a new generation of power-users is already emerging, and doing wild things with the tools.


> For all of the hype around LLMs, this general area (image generation and graphical assets) seems to me to be the big long-term winner of current-generation AI. It hits the sweet spot for the fundamental limitations of the methods:

I am biased (I work at Rev.com and Rev.ai), but I totally agree and would add one more thing: transcription. Accurate human transcription takes a really, really long time to do right. Often a ratio of 3:1-10:1 of transcriptionist time to original audio length.

Though ASR is only ~90-95% accurate on many "average" audios, it is often 100% accurate on high quality audio.

It's not only a cost savings thing, but there are entire industries that are popping up around AI transcription that just weren't possible before with human speed and scale.


Also the other way around: text to speech. We're at the point where I can finally listen to computer generated voice for extended periods of time without fatigue.

There was a project mentioned here on HN where someone was creating audio book versions of content in the public domain that would never have been converted through the time and expense of human narrators because it wouldn't be economically feasible. That's a huge win for accessibility. Screen readers are also about to get dramatically better.


I’d add image to text - I use this all the time. For instance I’ll take a photo of a board or device and ChatGPT/claude/pick your frontier multi modal is almost always able to classify it accurately and describe details, including chipsets, pinouts, etc.


I tried using ChatGPT for some handwritten text I couldn't make out and it failed miserably, just made stuff up.

Tried it on a PDF and it didn't even read the PDF.

I'm sure we'll get there but.. real shame it lies when it can't figure something out


Are you using 4o?

First lying requires agency and intent, which LLMs don’t have and they can’t lie.

Yes it makes stuff up when you put garbage in and uncritically consume the garbage. The key isn’t to look at it as an outsourcing of agency or the easy button but as a tool that gets you started on stuff and a new way of interacting with computers. It also confidently asserts things that are untrue or are subtly off base. To that extent, and in a literally very real sense, this is a very early preview of the technology - of a completely new computing technique that only reached bare minimum usability in the last two years. Would you rather not have early access or have to wait 20 years as accountants and product managers strangle it?

For OCR I’m surprised anyone who has ever used it before would scan illegible hand writing in and expect to not get a bunch of garbage out without it identifying the garbage was semantically wrong. Frontier Multimodal LLMs do an amazing job - compared to the state of the art a year ago. Do they do an amazing job compared to an ever shifting goal post? Are all the guard rails of a mature 30 year old software technique even discovered yet? No. But I’ll tell you from the early days of things, the early days of HTTP was nothing like today. Was HTTP useless because it was so unreliable and flakey? No it was amazing for those with the patience and capacity to dream to building something truly remarkable at the time, like Google or Amazon or eBay.

The PDF issue you had is not expected. I upload PDFs all the time. For instance when I’m working on something, like restringing some hunter Douglas blinds in my house recently, I upload the instructions for the restring kit to a ChatGPT session or Claude and it then becomes something I can ask iteratively how to tackle what I’m working on as I get to challenge spots in the process. It’s not always right and if confidently tells me subtly wrong things. But I pretty quickly realize what’s wrong and isn’t as I work and that’s usually something ambiguous in the instructions and requires a lot more context on something very specific and likely not documented publicly anywhere. But 80% of the time my questions get answered as I work. That’s -amazing- that I can scan a paper instruction sheet into a computer and get step by step guidance that I can interactively interrogate using my voice as I work and it literally understands everything I ask and gives me cogent if sometimes off answers. This is like literally the definition of the future I was promised.


> a project mentioned here on HN where someone was creating audio book versions of content in the public domain

Maybe this: https://news.ycombinator.com/item?id=40961385


That's the one! Thanks!


As an ex-Rev transcriber, I can think of the worst one I ever did.

It was a video for ESPN of an indoor motorcross race and the transcription was for the commentators. There were two fundamental problems:

1) The bike noise made the commentators almost inaudible

2) The commentators were using the [well-known to fans] nicknames of all the racers, and not their real names

I haven't used Rev for about three years, so I don't know how much better your auto-transcription system has gotten. I'd hope AI can solve #1, but #2 is a very hard problem to solve, simply because of the domain knowledge required. The nicknames were like Buttski McDumpleface etc and took a bunch of Googling to figure out.

I eventually got fired from Rev simply because the moderators haven't heard of the Oxford comma :p


I agree. I think it's more of a niche use-case than image models (and fundamentally harder to evaluate), but transcription and summarization is my current front-runner for winning use-case of LLMs.

That said, "hallucination" is more of a fundamental problem for this area than it is for imagery, which is why I still think imagery is the most interesting category.


Is there any models that can do diarization well yet?

I need one for a product and the state of the art, e.g. pyannote, is so bad it's better to not use them.


Deepgram has been pretty good for our product. Fast and fairly accurate for English.


Do they have a local model?

I keep getting burned by APIs having stupid restrictions that makes use cases impossible that are trivial if you can run the thing locally.


German Public Television switchednto Automatic transcriptions a few year back already.


> This general area (image generation and graphical assets) seems to me to be the big long-term winner of current-generation AI

I think it's easy to totally miss that LLMs are just being completely and quietly subsumed into a ton of products. They have been far more successful, and many image generation models use LLMs on the backend to generate "better" prompts for the models themselves. LLMs are the bedrock


> it's becoming increasingly clear that the future isn't the wholesale replacement of photographers, cinematographers, etc.

I'd refrain from making any such statements about the future;* the pace of change makes it hard to see the horizon beyond a few years, especially relative to the span of a career. It's already wholesale-replacing many digital artists and editorial illustrators, and while it's still early, there's a clear push starting in the cinematography direction. (I fully agree with the rest of your comment, and it's strange how much diffusion models seem to be overlooked relative to LLMs when people think about AI progress these days.)

* (edit: about the future impact of AI on jobs).


I mean, my whole comment is a prediction of the future, so that's water under the bridge. Maybe you're right and this is the start of the apocalypse for digital artists, but it feels more like photoshop in 1990 to me -- and people were saying the same stuff back then.

> It's already wholesale-replacing many digital artists and editorial illustrators

I think you're going to need to cite some data on a claim like that. Maybe it's replacing the fiverr end of the market? It's certainly much harder to justify paying someone to generate a (bad) logo or graphic when a diffusion model can do the same thing, but there's no way that a model, today, can replace a skilled artist. Or said differently: a skilled artist, combined with a good AI model, is vastly more productive than an unskilled artist with the same model.


Pay 10 non skilled artist to do some bad job and we will complain about 10 bad logos. Now, for a fraction of the price, pay 10000 AI generated low quality logos and flood the market with them. Market expectations will go lower and suddenly your AI will be on par with the artists...

(in case you think the market will not behave like that, just have a look at how we produce low quality food and how many people are perfectly fine with that)...


Today a engenieer does the job of 100 thanks to computers.


What happens when the AI takes the low end of the market is that the people who catered to the low end now have to try to compete more in the mid-to-high end. The mid end facing increased competition has to try to move up to the high end. So while AI may not be able to compete directly with the high end it will erode the negotiating power and thus the earning potential of the high end.


We have watched this same process repeat a few times over the last century with photography.


Or graphic design, or video editing, or audio mastering, or...every new tool has come with a bunch of people saying things like "what will happen to the linotype operators!?"

I sort of hate this line of argument, but it also has been manifestly true of the past, and rhymes with the present.


I agree, but I'm a bit biased, our start-up www.sticky.study is in this space.

What we've seen over the last year trying out dozens of models and AI workflows, is that the fit of 1.) error tolerance of a model to 2.) its working context, is super important.

AI hallucinations break a lot of otherwise useful implementations. It's just not trustworthy enough. Even with AI imagery, some use cases require precision - AI photoshoots and brand advertising come to mind.

The sweet spot seems to be as part of a pipeline where the user only needs a 90% quality output. Or you have a human + computer workflow - a type of "Centaur" - similar to Moravec's Paradox.


Image models are a great way to understand generate AI. It's like surveying a battlefield from the air as opposed to the ground.


>For all of the hype around LLMs, this general area (image generation and graphical assets) seems to me to be the big long-term winner of current-generation AI.

Let me show you the future: https://www.youtube.com/watch?v=eVlXZKGuaiE

This is an LLM controlling an embodied VR body in a physics simulation.

It is responding to human voice input not only with voice but body movements.

Transformers aren't just chatbots, they are general symbolic manipulation machines. Anything that can be expressed as a series of symbols is a thing they can do.


>This is an LLM controlling an embodied VR body in a physics simulation.

No it's not. It's VAM that is controlling the character and it's literally just using a bog standard LLM as a chatbot and feeding the text into a plugin in VAM and VAM itself does the animation. Don't get me wrong it's absolutely next level to experience chatbots this way, but it's still a chat bot.


The animation, not the movement decisions.

This is as naive as calling an industrial robot 'just a calculator'.


The movement decisions are also just text from the LLM and are heavily coupled with what's available in the scene. It's not some free autonomous agent. Nor was the movement decisions trained any special type of tokens other than just text.


Yes and?


And it's just an LLM powered chatbot.


> anyone can easily see the unrealistic outputs without complex statistical tests.

This is key, we’re all pre-wired with fast correctness tests.

Are there other data types that match this?


Audio to a lesser degree


Software (I mean the product, not the code)

Mundane tasks that can be visually inspected at the end (cleaning, organizing, maintenance and mechanical work)


LLM is a breakthrough for human to computer interface.

The knowledge answering is secondary in my opinion


I would argue the opposite — image generation is the clear loser. If you've ever tried to do it yourself, grabbing a bunch of LoRAs from Civitai to try to convince a model to draw something it doesn't initially know how to draw — it becomes clear that there's far too much unavoidable correlation between "form" and "representation" / "style" going on in even a SOTA diffusion model's hidden layers.

Unlike LLMs, that really seem to translate the text into "concepts" at a certain embedding layer, the (current, 2D) diffusion models will store (and thus require to be trained on) a completely different idea of a thing, if it's viewed from a slightly different angle, or is a different size. Diffusion models can interpolate but not extrapolate — they can't see a prompt that says "lion goat dragon monster" and come up with the ancient-greek Chimera, unless they've actually been trained on a Chimera. You can tell them "asian man, blond hair" — and if their training dataset contains asian men and men with blonde hair but never at the same time, then they won't be able to "hallucinate" a blond asian man for you, because that won't be an established point in the model's latent space.

---

On a tangent: IMHO the true breakthrough would be a model for "text to textured-3D-mesh" — where it builds the model out of parts that it shapes individually and assembles in 3D space not out of tris, but by writing/manipulating tokens representing shader code (i.e. it creates "procedural art"); and then it consistency-checks itself at each step not just against a textual embedding, but also against an arbitrary (i.e. controlled for each layer at runtime by data) set of 2D projections that can be decoded out to textual embeddings.

(I imagine that such a model would need some internal "blackboard" of representational memory that it can set up arbitrarily-complex "lenses" for between each layer — i.e. a camera with an arbitrary projection matrix, through which is read/written a memory matrix. This would allow the model to arbitrarily re-project its internal working visual "conception" of the model between each step, in a way controllable by the output of each step. Just like a human would rotate and zoom a 3D model while working on it[1]. But (presumably) with all the edits needing a particular perspective done in parallel on the first layer where that perspective is locked in.)

Until we have something like that, though, all we're really getting from current {text,image}-to-{image,video} models is the parallel layered inpainting of a decently, but not remarkably exhaustive pre-styled patch library, with each patch of each layer being applied with an arbitrary Photoshop-like "layer effect" (convolution kernel.) Which is the big reason that artists get mad at AI for "stealing their work" — but also why the results just aren't very flexible. Don't have a patch of a person's ear with a big earlobe seen in profile? No big-earlobe ear in profile for you. It either becomes a small-earlobe ear or the whole image becomes not-in-profile. (Which is an improvement from earlier models, where just the ear became not-in-profile.)

[1] Or just like our minds are known to rotate and zoom objects in our "spatial memory" to snap them into our mental visual schemas!


I think you’re arguing about slightly different things. OP said that image generation is useful despite all its shortcomings, and that the shortcomings are easy to deal with for humans. OP didn’t argue that the image generation AIs are actually smart. Just that they are useful tech for a variety of use cases.


> Until we have something like that...

The kind of granular, human-assisted interaction interface and workflow you're describing is, IMHO, the high-value path for the evolution of AI creative tools for non-text applications such as imaging, video and music, etc. Using a single or handful of images or clips as a starting place is good but as a semi-talented, life-long aspirational creative, current AI generation isn't that practically useful to me without the ability to interactively guide the AI toward what I want in more granular ways.

Ideally, I'd like an interaction model akin to real-time collaboration. Due to my semi-talent, I've often done initial concepts myself and then worked with more technically proficient artists, modelers, musicians and sound designers to achieve my desired end result. By far the most valuable such collaborations weren't necessarily with the most technically proficient implementers, but rather those who had the most evolved real-time collaboration skills. The 'soft skill' of interpreting my directional inputs and then interactively refining or extrapolating them into new options or creative combinations proved simply invaluable.

For example, with graphic artists I've developed a strong preference for working with those able to start out by collaboratively sketching rough ideas on paper in real-time before moving to digital implementation. The interaction and rapid iteration of tossing evolving ideas back and forth tended to yield vastly superior creative results. While I don't expect AI-assisted creative tools to reach anywhere near the same interaction fluidity as a collaboratively-gifted human anytime soon, even minor steps in this direction will make such tools far more useful for concepting and creative exploration.


...but I wasn't describing a "human-assisted interaction interface and workflow." I was describing a different way for an AI to do things "inside its head" in a feed-forward span-of-a-few-seconds inference pass.


Thanks for the correction. Not being well-versed in AI tech, I misinterpreted what you wrote and assumed it might enable more granular feedback and iteration.


Honestly, I am still to see an AI generated image that makes me go "oh wow". It's missing those 10 last percents that always seem to elude neural networks.

Also, the very bad press gen AI gets is very much slowing down adoption. Particularly among the creative-minded people, who would be the most likely users.


Hop on civitai

There's plenty of mindblowing images


This is the third image to 3D AI I've tested, and in all cases the examples they give look like 2D renders of 3D models already. My tests were with cel-shaded images (cartoony, not with realistic lighting) and the model outputs something very flat but with very bad topology, which is worse than starting with a low poly or extruding the drawing. I suspect it is unable to give decent results without accurate shadows from which the normal vectors could be recomputed and thus lacks any 'understanding' of what the structure would be from the lines and forms.

In any case it would be cool if they specified the set of inputs that is expected to give decent results.


What stuck out to me from this release was this:

> Optional quad or triangle remeshing (adding only 100-200ms to processing time)

But it seems to have been optional. Did you try it with that turned on? I'd be very interested in those results, as I had the same experience as you, the models don't generate good enough meshes, so was hoping this one would be a bit better at that.

Edit: I just tried it out myself on their Huggingface demo and even with the predefined images they have there, the mesh output is just not good enough. https://i.imgur.com/e6voLi6.png


It might not just be your tests.

All of my tests of img2mesh technologies have produced poor results, even when using images that are very similar to the ones featured in their demo. I’ve never got fidelity like what they’ve shown.

I’ll give this a whirl and see if it performs better.


All right, I was hesitating to try shading some images to see if that improves the quality. It's probably still too early.


Tried it with a collection of images, and in my opinion it performs -worse- than earlier releases.

It is however fast.


> 0.5 seconds per 3D asset generation on a GPU with 7GB VRAM

Holy cow - I was thinking this might be one of those datacenter-only models but here I am proven wrong. 7GB of VRAM suggests this could run on a lot of hardware that 3D artists own already.


I really can't wait for this technology to improve. Unfortunately just from testing this it seems not very useful. It takes more work to modify the bad model it approximates from the image output than starting with a good foundation from scratch. I would rather see something that took a series of steps to reach a higher quality end product more slowly instead of expecting everything to come from one image. Perhaps i'm missing the use case?


> not very useful

Useful for what? I think use cases will emerge.

A lot of critiques assume you're working in VFX or game development. Making image to 3d (and by extension text to image to 3d) effortless a whole host of new applications open up - which might not be anywhere near so demanding.


Perhaps it'll require a series of segmentation and transforms that improves individual components and then works up towards the full 3d model of the image.


Not the holy grail yet, but pretty cool!

I see these usable not as main assets, but as something you would add as a low effort embellishment to add complexity to the main scene. The fact they maintain profile makes them usable for situations where mere 2d billboard impostor (i.e the original image always oriented towards the camera) would not cut it.

You can totally create a figure image (Midjourney|Bing|Dalle3) and drag and drop it to the image input and get a surprising good 3d presentation that is not a highly detailed model, but something you could very well put to a shelf in a 3d scene as an embellishment where the camera never sees the back of it, and the model is never at the center of attention.


I'm really excited for something in this area to really deliver, and it's really cool that I can just drag pictures into the demo on HuggingFace [0] to try it.

However... mixed success. It's not good with (real) cats yet - which was obvs the first thing I tried. It did reasonably well with a simple image of an iPhone, and actually pretty impressively with a pancake with fruit on top, terribly with a rocket, and impressively again with a rack of pool balls.

[0] https://huggingface.co/spaces/stabilityai/stable-fast-3d


I'm going to 3D print so much dumb stuff with this.


They're still hesitant to show the untextured version of the models so I would assume it's like previous efforts where most of the detail is in the textures, and the model itself, the part you would 3D print, isn't so impressive.


You can download a .glb file (from the HuggingFace demo page) and open it locally (e.g. in MS 3D Viewer). I'm looking at a mesh from one of the better examples I tried and it's actually pretty good...


You know I do wonder about this. If its just for static assets does it really matter? In something like Unreal, the textures are going to be virtualized and the geometry is going to be turned in to LODed triangle soup anyway.

Has anyone tried to build an Unreal scene with these generated meshes?


Usually the problem is the model itself is severely lacking in detail, sure Nanite could make light work of a poorly optimized model but it's not going to fix the model being a vague blob which doesn't hold up to close scrutiny.


So don't use them in a context where they require close scrutiny?


Generate the accompanying normal map and then just tesselate it?


I was going to comment on the same; these 3d reconstructions often generate a mess of a topology, and this post does not show any of the mesh triangulations, so I assume they're still not good. Arguably, the meshes are bad even for rendering.


Presumably, these meshes can be cleaned up using standard mesh refinement algorithms, like those found in MeshLab: https://www.meshlab.net/#features


Hopefully that's in the (near) future, but as of now there still exists 'retopo' in 3D work for a reason. Just like roto and similar menial tasks. We're getting there with automation though.


hueforge



It really looks like they've been doing that classic infomercial tactic of desaturating the images of the things they're comparing against to make theirs seen better.


Great result. Just had a play around with the demo models and they preserve structure really nicely; although the textures are still not great. It's kind of a voxelized version of the input image


You can interact with the models on their project page: https://stable-fast-3d.github.io/


Be still my miniature-painting heart.


Closer and closer to the automatic mapping drones from Prometheus.

I wonder what the optimum group of technologies is that would enable that kind of mapping? Would you pile on LIDAR, RADAR, this tech, ultrasound, magnetic sensing, etc etc. Although, you're then getting a flying tricorder. Which could enable some cool uses even outside the stereotypical search and rescue.


Are you talking about mapping tunnels with drones? That's already done and it doesn't really need any 'AI': it's plain old SLAM.

DARPA's subterranean challenge had many teams that did some pretty cool stuff in this direction: https://spectrum.ieee.org/darpa-subterranean-challenge-26571...


You already have depth anything v2 that can generate depthmap in realtime even on iPhone. Quality is pretty good but probably will be even improved in the future. Actually in many ways those depthmaps are much better quality than iPhone Lidar or Truedepth camera (that cannot handle transparent, metalic, reflective surfaces and also they have a big noise).

https://github.com/DepthAnything/Depth-Anything-V2

https://huggingface.co/spaces/pablovela5620/depth-compare

https://huggingface.co/apple/coreml-depth-anything-v2-small


You don't need or want generative AI for mapping, you "just" need lidar and drones for slam.

https://www.youtube.com/watch?v=1CWWP9jb4cE


High-res images from multiple perspectives should be sufficient. If you have a consumer drone, this product (no affiliation) is extremely impressive: https://www.dronedeploy.com/

You basically select an area on a map that you want to model in 3d, it flies your drone (take-off, flight path, landing), takes pictures, uploads to their servers for processing, generates point cloud, etc. Very powerful.


What you could do with WebODM is already quite impressive


Looks very good on examples, but testing a few Ikea chairs or a Donald Duck image gives very wrong results.

You can test here: https://huggingface.co/spaces/stabilityai/stable-fast-3d


Given the Graphics Asset part of AA or AAA Games are the most expensive, i wonder if 3D Asset Generation could perhaps drastically lower that by 50% or more? At least in terms of same output. Because in reality I guess artist will just spend more time in other areas.


Man it would be so cool to get AI-assisted photogrammetry. Imagine that instead of taking a hundred photos or a slow scan and having to labor over a point cloud, you could just take like three pictures and then go down a checklist. "Is this circular? How long is this straight line? Is this surface flat? What's the angle between these two pieces?" and getting a perfect replica or even a STEP file out of it. Heaven for 3D printers.



What I'd really like to see in these kinds of articles is examples of it not working as well. I don't necessarily want to see it being perfect, I'd quite like to see its limitations too


For those reading from Stability - just tried it - API seems to be down and the notebook doesn't have the example code it claimed to have.


This is good news for the indie game dev scene, I suppose?


The models aren't really optimized for game dev. Fine for machinima, probably.


This is a great step forward.

I wonder whether RAG based 3D animation generation can be done with this.

1. Textual description of a story.

2. Extract/generate keywords from the story using LLM.

3. Search and look up 2D images by the keywords.

4. Generate 3D models from the 2D images using Stable Fast 3D.

5. Extract/generate path description from the story using LLM.

6. Generate movement/animation/gait using some AI.

...

7. Profit??


Pre generate a bunch of images via sdxl and convert to 3d and then serve nearest mesh after querying




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: