Hacker News new | past | comments | ask | show | jobs | submit login
A working implementation of text-to-3D DreamFusion, powered by Stable Diffusion (github.com/ashawkey)
286 points by nopinsight on Oct 6, 2022 | hide | past | favorite | 71 comments



>Stable-Diffusion is a latent diffusion model, which diffuses in a latent space instead of the original image space. Therefore, we need the loss to propagate back from the VAE's encoder

There is also an alternative way to handle this latent difference with the original paper that should also work :

Instead of working in voxel color space, you push the latent to the voxel (Aka instead of having a voxel grid of 3d rgb color, you have a voxel grid of dimlatent latents, (you can also use spherical harmonics if you want as it works just the same in nd) ).

Only the color prediction network differ, the density is kept the same.

The NERF then directly render to the latent space (so there are less rays to render) which mean you need to decode it with the VAE only for visualization purposes and not in the training loop.


This sounds really interesting but I’m not sure I follow. Having a hard time expressing how I’m confused though (maybe its unfamiliar nerf terminology) but if you have the time I’d be very interested if you could reformulate this alternative method somehow (I’ve been stuck on this very issue for two days now trying to implement this myself).


NERF is neural radiance fields (the neural 3D reconstruction method that nVidia published recently.)

Basically if I’m reading it right, this does the synthesis in latent space (which describes the scene rather than rendering vocals) then translates it into a NERF. It sounds kind like the Stable Diffusion description that was on here earlier.


Do you have refereneces to NeRF papers that can directly compute the field in the latent code? Because it NeRF based methods are essentially performing solving the rendering equation to learn the mapping, what could be alternative equation for directly learning latent code? Your idea is interesting, could you elaborate on it!


Sorry no references for NeRF papers. But the idea is not new. My experience was originally with PointClouds and 3d keypoint-feature slam.

In these, you represent the world as a sparse latent representation, collection of 3D coordinates, and their corresponding SIFT feature descriptor. You rotate and translate these keypoints to obtain a novel view of the features in 2D image. (The descriptor can be taken as an interpolate of the descriptors weighted by the difference in orientation between views as features only match for small viewing angle difference (like 20°) ). And then you could invert features to retrieve a pixel-space image (for example https://openaccess.thecvf.com/content_cvpr_2016/papers/Dosov... ) (although it's never needed in practice.

Coming back to NERF, it's the same principle. When your NERF has converged, if you don't have transparent object, along the ray the density will be 0 except when you intersect the geometry where only a single voxel will be hit in which case you fetch the latent stored in the voxel latent (spherical harmonics) from the direction given by the ray during the training with a latent image.

The rendering equation is still the same but instead of rendering a single ray, it would be analog to rendering a group of close rays in order to render a patch of image, of which the latent is a compressed representation. You have to be careful not to make the patch too big, because like with a lens in the real world, spherical transform flip the patch-image upon translation, but neural network should transparently handle this.

The converged representation is an approximation based on linear approximation and interpolation along positions and ray direction, provided that you have enough resolution, you can construct it manually from the solution and see how it behaves in the rendering.

Will the convergence process work ? It will depend on how well latent mix, along a ray. The light transport equation is usually linear, and latent usually mix well linearly (even more so when weighted by a density), But in the case it doesn't mix well you can learn a mixing of latent rule that help it converge.

Also once you have a latent Nerf, it won't allow you directly to obtain a STL/obj directly but you should have 3d consistent views from which you could render a classical NERF, but you can/should also instead optimize for the classical voxel grid, that fit the latent voxel grid (aka that give the same image patches).



Key step in generating 3D – ask Stable Diffusion to score views from different angles:

  for d in ['front', 'side', 'back', 'side', 'overhead', 'bottom']:
    text = f"{ref_text}, {d} view"
https://github.com/ashawkey/stable-dreamfusion/blob/0cb8c0e0...


I'm modestly surprised that those few angles give us enough data to build out a full 3D render, but I guess I shouldn't be too surprised, as that's tech that has had high demand and been understood for years (those kind of front-cut / side-cut images are what 3D artists use to do their initial prototypes of objects if they're working from real-life models).


DreamFusion doesn't directly build a 3D model from those generated images. It starts with a completely random 3D voxel model, renders it from 6 different angles, then asks Stable Diffusion how plausible an image of "X, side view" it is.

It then sprinkles some noise on the rendering, makes Stable Diffusion improve it a little, then adjusts the voxels to produce that image (using differentiable rendering.)

Rinse and repeat for hours.


Thank you for the clarification; I hadn't grokked the algorithm yet.

That's interesting for a couple of reasons. I can see why that works. It also implies that for closed objects, the voxel data on the interior (where no images can see it) will be complete noise, as there's no signal to pick any color or lack of a voxel.


Yes, although not complete noise – probably empty. Haven't checked but assume there's regularization of the NeRF parameters.


    text = f"{ref_text}, front cutaway drawing"
Maybe?


I don't think that NeRFs require too many image to make impressive results.


Given the way the language model works these words could have multiple meanings. I wonder if training a form of textual inversion to more directly represent these concepts might improve the results. You could even try teaching it to represent more fine grained degree adjustments.


There's a small gallery of success and failure cases here [1].

It certainly doesn't look as good as the original, yet. I wonder if that's due to the implementation differences noted, less cherry picking in what they show, or inherent differences between Imagen and Stable Diffusion.

Maybe Imagen just has a much better grasp of how images translate to actual objects, where Stable Diffusion is stuck more on the 2d image plane.

1: https://github.com/ashawkey/stable-dreamfusion/issues/1


One cannot help but notice the success cases are expected to have symmetry along at least one axis, whereas the failure cases are not.


Aren't squirrels and frogs expected to have an axis of symmetry? I think the reason for the failures is the presence of faces; it seems to be trying to make a face visible from all angles.


Which probably has a lot to do with us taking nearly all our pictures of things with faces from the face facing direction.


I feel like the Cthulhu head is extra-successful, given the subject matter.

Non-Euclidean back-polygon imaging? Good work, algorithm. ;)


I guess a lot can be done to force the model to create properly connected 3D shapes instead of these thin protruding 2D slices. But I noticed something else. Some of the angles “in between” the frog faces have three eyes. I wonder if part of the issue might be that those don’t look especially wrong to Stable Diffusion. It’s often surprisingly confused about the number of limbs it should generate.


Currently working with a student group to build out a 3D scene generator (https://github.com/Cook4986/Longhand), and the prospect of arbitrary, hyper-specific mesh arrays on demand is thrilling.

Right now, we are relying on the Sketchfab API to populate our (Blender) scenes, which is an imperfect lens through which to visualize the contents of texts that our non-technical "clientele" are studying.

Since we are publishing these scenes via WebXR (Hubs), we have specific criteria related to poly counts (latency, bandwidth, etc) and usability. Regarding the latter concern, it's not clear that our end users will want to wait/pay for compute.

*copyedited


wow


All these news on image/3d/video generation just show how we live in the middle of ai/ml breakthrough. Incredible to see news with extreme progress in the field like this popping everyday.


Browse Lexica.art your mind will be blown by the range and amount of detail on some of the art.

Like this (nsfw content): https://lexica.art/?q=Intricate+goddess

There is an addictive and trippy quality to this and it is yet to hit mainstream -- The art itself is stunning but it goes beyond that, the ability to nudge it around and make variations to it is incredible. now add the fact that you can train it with your own content. people are going to go bonkers with this and it's going to open up a lot of debates too.


I can’t wait to generate novel 3D models to CNC/3D print! Can these be exported out as STL/OBJs?


In the usage notes, there's a line that mentions

  # test (exporting 360 video, and an obj mesh with png texture)
  python main_nerf.py --text "a hamburger" --workspace trial -O --test
So I guess so. That's pretty awesome.


Heckin A!

I just got my 3D printer and was a bit too tipsy to assemble it the day it arrived - and have several things I want to print…

It will be interesting to experiment with describing the thing I want to print with text instead of designing it in SolidEdge and see what AI thinks….

I wonder if you can feed it specific dimensions?

“A holder for a power supply for an e bike with two mounting holes 120mm apart with a carry capacity that is 5 inches long and 1.5 inches deep”


Well, here's 16 attempts of regular Stable Diffusion with that prompt [1], and here's what it things a technical drawing of it might look like [2].

Maybe two papers down the line :D For now you might have more luck with something less specific.

1: https://i.imgur.com/RPNCwyM.png

2: https://i.imgur.com/c9pfM8U.png


Still dope… but are those also obj or stl?

I like my DALLE expressions of “masterchief as ventruvian man as drawn by da Vinci”

And my “technical exploded diagrams of cybernetic eco skeleton suits in blueprint”

Try those out?


Feels like AI generated art is approaching a sort of singularity at this point. Progress getting very exponential.


These things take months and months to train (hardly fast progress). Any new model that’s coming out is generally known in the atmosphere (not unpredictable) and these applications were pretty expected the day stable diffusion came out.


"months and months"

At the beginning of this year, most technical people would have told you that graphic design was a decade from being automated, and creative video production more.

Now we are at "months and months".


That this would have come out was totally predictable and expected to most in the know people in the ML world (we basically had proto versions of it summer 2021). Not really the unpredictable trajectory I associate with a singularity


Only if you have low standards of what constitutes research on artificial intelligence and art perception.


Omg, imagine how useful this would be for video games or movies. Whipping up an asset in a matter of hours of computer time? Amazing


Incredible for indie developers especially. It's truly democratising.


Finally I can become a one man AAA game factory...


Or any sort of CAD


Maybe I just don't understand what goes into CAD (probably even), but I don't see it? CAD needs to spit out precise dimensions for things at the end, while this sort of AI is just going to spit out "something more or less like this", it doesn't seem likely to be usable?


A few papers down the road, hopefully. There's no inherent limit for these models to be bad with numbers, as they are now.

I do feel that the new-gen mechanical CAD will be based on a dialog between a human and a generational suite. Whether it's based on diffusion or other method - that remains to be seen.


I don't think this kind of AI will be that useful for CAD, the time consuming part of CADing is the precise details of things that need to fit together and often iterated on because of various mechanical, electrical and material properties.


A lot of CAD work today is based on parametric design, constraint solvers and letting you CAD software solve various optimisation problems. One of the big problems with this is that the dimensionality and search space involved can be huge. Using AI to narrow the search space that these solvers have to deal with will be a significant and obvious win, and work is already happening in this space.

Combine that with an AI that can generate an initial design based on a rough specification, have the user apply some constraints and then iteratively have the AI generate new designs, based on known good designs, that fulfil those constraints could be very powerful.


I would love a AI model that does text-to-SVG.


that would be lovely! perhaps it can also dream up interesting game of life patterns


OpenAI Codex can do that.


Would be cool to see it adapted to img2img scenario, using one or more 'seed images'. It would be closer to a standard NERF, but also would be able to imagine novel angles and guidance with prompt.


Only downside with this is that each mesh takes like 5 hours to generate (on a v100 too). Obviously it’ll speed up but we’re far from the panacea


Can someone explain the significance of this? I am not familiar what DreamFusion is.


[Edited] The original Dreamfusion project was discussed here a few days ago: https://news.ycombinator.com/item?id=33025446


This is a different project/implementation, based on the open source stable diffusion instead of proprietary google imagegen


Edited; thanks for clarifying!

That makes this Stable-Dreamfusion adaptation even more promising.


I think it would be interesting to convert to a polygon mesh periodically in-the-loop. It could end up with more precise models.


How long does it take? 5 hours?


Jesus. Well thanks for your contribution to putting the entire creative industry out of work, I guess, little anime girl icon person. Ugh.


I have a friend who is an old school professional artist from before affordable computing who has been using AI (and computers before that) to aid him with creations for many years now and runs everything himself on his own machines (which is a pretty expensive setup) experimenting and training and he loves every iteration.

But I guess what creative industry means to you? Pumping out web UIs or 3d gaming models were never, for the most part, the creative industry; learning to see what people like and copying that for different situations is not necessarily creative and thus what AI easily does; anything that doesn’t come with a lot of learning and practice and talent outside manual work will be replaced by AI soon; the other stuff will take somewhat longer.

If you think this can replace you, you weren’t/aren’t in the creative industry. Same goes for coders afraid of no code.


So how many shitty "not creative industry" jobs did your friend take on the way to where he could have "a pretty expensive setup" to do this? What did he crank out solely to earn a paycheck with his art skills?


Your tone is not great, but he never had those jobs; he was born in a poor family (for NL), but his talent was recognised by HR Giger when he sent him a paintbrushed work (via dhl with a frame and all on a whim) and that was enough. He is not rich but makes a nice living. Note that this is the EU; there is not much of a risk of dying under a bridge even if you don’t succeed. But he did as far as he is concerned. He never compromised anything like you imply he must have done.

Edit; but you are also implying you think your job is gone with stuff like this? What do you do? Also I am hoping I will be replaced: I have been thinking I will be replaced since the early 80s as my work as a programmer is not so exciting (I love it and will keep doing it even if it’s not viable anymore, which I do believe for the 20% of people who do niche work is very far off AI wise; like I think with creative as well) but it seems closer now than ever.

Edit2: looking at your profile work, you don’t seem you will be replaced by anything soon; what is the anger about? Do you have public blogs/tweets about your feelings about this; looking at your work (in your HN profile) you seem the group not touched by this at all.


What is “soon” here? Admittedly I’m not particularly sanguine about the prospects of AI generated art or code taking many jobs in the near future, but at some point it could well happen even to talented engineers and artists. It’s nice of you to not mind being replaced, but of course not everyone will be happy about existential threats to their hard-earned livelihoods.


I have always been sceptical about AI; I majored in it in the AI winter of the 90s. But now that I am back to it in the past years, I am getting more positive. This now feels not a case of being happy but a case of that this will happen anyway, like so many progress just happened and trampled over jobs and did create other jobs and still people shortage. Things like copilot are now not far from replacing bad programmers; I myself have 100s of lines written a day, in my style, by copilot. I know what prompts work and that’s the hard part, but it is really impressive what it comes up with and correctly ‘guesses’.

Politically, keeping the gains of this not flowing to a few companies or individuals should be a priority; as has been proven (and been said by Carmack), AI is trivial after invention and will be cloned by open source in days/weeks; then trained sets is key to making sure we all benefit. This needs to remain the case to make sure society works imho.

But yeah, soon is 10 years for many things I feel now and 100+ (infinite) years for others. I observe that things that are copilot for me now, are things that millions of programmers cannot even come up with and those need to do something else. And that’s close.

The talented people is further away because of ‘understanding’; statistical pattern matching is maybe (we don’t know) something different than understanding; when we manage to use temporal flow and so, conversational prompts, which a lot of people are researching and developing now, it will get really interesting. So far it seems inevitable… until the next AI winter.


I personally have a while yet before I can be replaced by AI art tools, yes.

What I am concerned and angry about is the next generation. The people who are just becoming pros, who are at the point where they are glad to get a job cranking out hundreds of models of sneakers for EA's Basketball Jam 2024 or doing a bunch of D&D character commissions or whatever other commercial art job because it is paying the bills (including working off their massive student debt) by doing what they actually trained to do instead of some shitty minimum-wage job or an even shittier "gig economy" thing. That's the window that all this AI art crap is making a lot smaller.


Sure, I know. But what is the use being angry about it? It is happening. You are looking for a socialist/basic income world but live in a capitalist one. That is the problem. If AI replaces (it already does) people that doesn’t create new jobs (it still does!), there is no other route than political. I will be dead and have (notably because of this) no kids, but still vote parties that understand this. Living in a country that requires working like you say is already not ok but, if AI moves on, will collapse.

An AI winter might happen though if we don’t move on from this place.


Research moves forwards when people are willing to put time, money, and effort into it. Remember e-paper? Where'd that go? Patents made it hard to work on, and transmissive displays kept getting better, so we're still staring into flashlights while we're arguing about this.

One new law that explicitly redefines "fair use" to exclude "scraping half the entire internet and dumping everything you find into a training dataset", and creates a new framework for proper licensing of training data along with hefty penalties for distributing datasets without these licensing would be a huge roadblock.

A grassroots effort of pissed-off artists starting a sideline in assassinating AI researchers would have a pretty chilling effect on the field, too. This may be a little extreme. There are probably solutions that don't go this far. It sure does make a pleasant revenge fantasy though. Time to go read some accounts of how the Unabomber was caught...


A new technology is developed with the potential to make you 100x more efficient at your job. Today, a creative artist can only contribute to a project through a narrow slice. Tomorrow, the same creative artist can single-handedly orchestrate an entire project.


It's more like there were tasks that were previously so unproductive they couldn't be done at all, and now they're productive enough you might be able to be employed doing them.

Automation creates jobs rather than destroying them. What destroys jobs is mainly bad macroeconomic conditions.


How very pessimistic. We should never shirk technological progress for fear of upsetting the status quo or established agenda. All of this is only a matter of time away from emerging. Have fun being on the forgotten side of history


Another day, another AI media generation project, and yet another comment by egypturnash lamenting the "death of the creative industry."


representative of the industry currently under threat of disruption is not happy about this and continues to be vocal about her unhappiness, film at 11


A representative of the luddite contingent perhaps.


This is a force multiplayer. It doesn't take the place of artistic intent, dingus. Besides you can't accomplish much with just "a model". This is an asset generator, hardly a threat to anyone especially when these things will likely need some weight painting to touch up anyway.


I don’t think the commenter is upset that this particular model will be deployed, putting creative professionals out of work. It’s clearly a janky proof of concept. I think they’re upset about what follow on work could eventually mean.


tent cities by the beach can use the showers




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: