When Diego first showed me this animation, I wasn't completely sure what I was looking at, because I assumed the left and right sides were like composited together or something. But it's a unified screen recording; the right, generated side is keeping pace with the riffing the artist does in the little paint program on the left.
There is no substitute for low latency in creative tools; if you have to sit there holding your breath every time you try something, you aren't just linearly slowed down. There are points that are just too hard to reach in slow, deliberate, 30+ second steps that a classical diffusion generation requires.
When I first heard about consistency, my assumption was that it was just an accelerator. I expected we'd get faster, cheaper versions of the same kinds of interactions with visual models we're used to seeing. The fine hackers at Krea did not take long to prove me wrong!
There is no substitute for real-time when you're doing creative work.
That's why GitHub Copilot works so well; that's why ChatGPT struck a chord with people—it streamed the characters back to you quite fast.
At first, I was skeptical too. I asked myself “what about Photoshop 1.0? They surely couldn't do it in real-time.”. It turns out that even then you needed it. Of course, compute wasn't there to do a simple translation of all rasterized pixel values that form an image within a layer, but there was a trick they did: they showed you the outline that would tell you, the user, where the content _will_ render if you let the mouse go.
And it did blow up! But not as much as changing the UI towards a (familiar) chat interface.
Good point! I agree with it but forgot to mention it: interaction matters.
With GitHub Copilot you are in familiar terrain, your code editor; with ChatGPT, you are talking to it the same way you'd talk to an assistant, via chat/email.
And we, at KREA, don't think it'll be the exception for AI for creativity.
That's definitely true, the chat format (vs the completion format) made all the difference. So much so that ChatGPT blew up even though it was inferior in capabilities to GPT-3, just because it was (much) more usable.
As an investor, I hope you’re ready to bankroll the inevitable legal battles. These are not going to be restricted to the big players. Eleuther was recently sued, and they’re a non profit.
The moment you try to market this, you need to be prepared for the lawsuit. I’m preparing for one, and all I did was assemble a dataset. This model is built off of work which most people (rightly or wrongly) believe is not yours to sell.
I’m still not sure how I feel about it. I was forced to confront the question a few days ago, and I’ve been in a holding pattern since then. I’m not so much concerned about the lawsuits as getting the big question right. Ethics has a funny way of sneaking up on you in the long run.
At the very least, be prepared for a lengthy, grizzly smear campaign. Two people wrote stories insinuating I somehow profited off of books3. Your crew will be profiting with intent.
One reason I’ve considered bowing out of ML is that I’d rather not be verbally spit on for the rest of eternity. It’s nice to have the support of colleagues, but unless you really care solely about money, you’ll be classified in the same bucket as Zuck: widely respected if successful by the people that matter, but never able to hold a normal relationship again. Most people probably prefer that tradeoff, but go into this with eyes wide open: you will be despised.
The way out is to help train a model on Creative Commons images. I don’t know if there’s enough data. And it’s certainly a bad idea to wait; your only chance of dominating this market is to iterate quickly, which means using existing models. But at this point, lawsuits are table stakes. You need to be prepared for when they happen, not if.
Also, join me in at least one sleepless night pondering the ethics of profiting off of this. Normally people only mention this as a social signal, not because they actually care. But if you sit down and think it through from first principles, the ethics — legality aside - is not at all clear. This also isn’t a case of a Snowmaker startup (https://x.com/snowmaker/status/1696026604030595497?s=61&t=jQ... he notes that this only works when you have the general population on your side. All of those examples are of startups violating the laws that people felt were dumb. Whereas I can tell you from firsthand trauma that copyright enthusiasts are religiously fanatical. Worse, they might be on the right side of the ethics question.
This was the first time in my life that a startup’s ethics gave me pause. Not just yours, but everyone who’s building creative tools off of these models. You’ll face a stiff headwind. Valve, for example, won’t approve any game containing any work generated by your tools. And everyone else is trying to build their own moat.
I’m not saying to consider giving up. I’m saying, really sit down and go through the mental exercise of deciding if this is a battle you want to fight for at least three years legally and five years socially. I’m happy to provide examples of the type of abuse you and your team will face, ranging from sticks and stones’ level insults to people directly calling for criminal liability (jail time). The latter is exceedingly unlikely, but being ostracized by the general public is not.
At the very least, you’ll need to have a solid answer prepared if you start hiring people and candidates ask for your stance. This comment is as much for your team as for you as an investor, since all of you will face these questions together.
Hi sillysaurusx. I'd love to get into contact with you. As someone who contributes dataset to academic NLP, you have very unique and interesting perspectives on this question.
I can't reach out to you via twitter as I am not a verified member, so I will reach out via email.
Given the pace at which features are added to popular automatic1111 repo, this will be added soon allowing you to try this for free on your own machine.
This is awesome. For future inputs please give me a 3d scene with camera controls into which I can drop primitive shapes, and human pose dolls so that I can frame shots like I would with a camera. I can't draw so even figuring out how to suggest "draw the image from the top left as the person looks over their shoulder" to the model is something I struggle with.
Come on, this is the wrong attitude. What they asked for could be prototyped in a day to see if it makes sense.
Have the left image in the demo be fed from a rendering from a simple 3d scene of a camera and simple colored primitives.
Do a search for "Blender GPT" there are several efforts ongoing. I've tried a few, and it is exactly what you describe: basic 3D shapes, some basic rigged objects like pose figures and blocky vehicles. Just put them roughly where you want them, likewise with the camera and lights and write your prompt. It's kinda ridiculous how fast and easy and, well, disposable all this is.
Yep! I saw YouTube videos of that workflow before posting the comment. I will say that I generally don't need Blender levels of control (or maybe I do and just don't know it yet) so a simplified set of primitives in a stripped down UI would be great.
The architecture is the same but it has an additional model trained on the sampling trajectory computation, speeding up the process. So instead of 20 iterations with the original sampler that solves a SDE, you could use 8, 4, or sometimes even 2 iterations with this new ML-based sampler. That's not a new idea, but the author of this adapter figured out how to train it efficiently using very few GPU-hours.
I wonder if they could do a real-time preview using 2-4 iterations, and then 'refine' the quality with a higher number of iterations over time, much like how 3D editors preview ray-tracing.
I've lost the link, but IIRC someone on twitter demoed exactly that in Blender. "Neural rendering" in a 3D editor has some nice properties - the scene can be rendered into separate layers with separate depth maps and openpose skeletons generated from the rigs to serve as input for controlnets, objects can be tagged and tags composed into the prompt for the particular tile in case of tiled rendering, etc.
latent diffusion is an iterative process, the image becomes clearer one step at a time.
The process can be viewed as a particle moving in the image space, one step at a time to its final position in the image space, which is the generated image.
consistency model tries to predict the movement trajectory by providing the current position in the image space. Hence, what used to be a step-by-step process becomes a one-step process.
if you are talking about latent diffusion, No, the "particle" is in a hyper-dimensional space, like for example a 10k-dimensional space. We are not supposed to interpret the meaning of that vector.
and when that particle has moved to the right location, there is a decoder that converts it into an image. The decoder network knows how to interpret it.
Does this mean that you need this vector state in order to generate the next step? Ie I can't take an image (the pixels) and the prompt and run a few more steps on it?
> consistency model tries to predict the movement trajectory by providing the current position in the image space. Hence, what used to be a step-by-step process becomes a one-step process.
no that wasnt a sufficient explanation for me. what is the prediction method here? why was diffusion necessary in the past? what tradeoffs does this approach have?
From playing around with it a bit locally, LCM is much, much faster, but generally the detail is much lower using the latest SDXL model.
If your prompt is very simple, such as "a boy looking at the moon in a forest" it does pretty well. If your prompt is much more complex and asks for a lot more detail or uses other LoRas, it doesn't do nearly as well as other samplers and generates lower quality, worse matching images. These other samplers take 30-40 steps so it's several times slower.
From what I've seen though, if you use control net, or passing some guideline images in and rely on a simple prompt, like an existing video that you're trying to change the style of, LCM can generate images in near real time like the OP on an RTX4090 and maybe slower cards on smaller/older models
Another benefit is the decreased experimental time. So you can more quickly iterate over seeds to find output you like, and then maybe you can spend some time with other samplers/upscalers with that seed to make the result higher quality
If you use a tool like ComfyUI, you can use LCM-LoRa to generate fast but not perfect outputs and then refine them with the old school sampler. I’ve been playing around and have found the quality of LCM to be excellent when used in combination with IP Adapter instead of text prompts for conditioning.
the exact neural network used for the prediction method is omitted. apparently many neural networks can be used for this prediction method as long as they fulfill certain requirements.
> why was diffusion necessary in the past?
in the paper, one way to train a consistency model is distilling an existing diffusion model. But it can be trained from scratch too.
"why was it necessary in the past " doesn't bother me that much. Before people know to use molding to make candles, they did it by dipping threads into wax. Why was thread dipping necessary? it's just a stepping stone of technology development.
The stepping stone way of seeing things reminds me a lot to the thesis behind the book “ Why Greatness Cannot Be Planned: The Myth of the Objective” (2015).
from https://arxiv.org/abs/2310.04378 it sounds like its a form of distillation of an SDL model. So im guessing it can't be directly trained, but once you have a trained diffusion model you can distil a predictor which cuts out the iterative steps.
While it can do 1-step the output quality looks a ton better with additional steps.
An LCM demands merely 32 A100 GPUs Hours training for 2-step inference, as de-
picted in Figure 1. [1]
Now let's look at the caption under Figure 1:
LCMs can be distilled from any pre-trained Stable Diffusion (SD) in only 4,000 training steps (∼32 A100 GPU Hours) for generating high quality 768×768 resolution images in 2∼4 steps or even one step, significantly accelerating
text-to-image generation.
The training mentioned in your quote is distillation: it requires a previously trained SD model.
You could have told just by reading the introduction:
We propose a one-stage guided distillation method to efficiently convert a pretrained guided diffusion model into a latent consistency model by solving an augmented PF-ODE.
There was a webcam demo a little while back that was pretty cool. It's mostly that the right-hand side seems to only mildly follow the patterns drawn on the left hand side. It still seems sort of useful but once it's running GAN speeds (30-60fps) and adheres more strongly to the input is when people will find it both useful and a joy to work with.
Based on my general experience with text-to-image stuff, I assume this lack of adherence isn't always the case? Maybe another demo could show what it's like under ideal conditions.
Definitely. I've been looking at all this stuff way too closely for awhile now so I doubt my opinion is representative. Looks like there's positive reception coming from others in the comments here. I think the general cycle of "new model/technique - rush to implement - immediately obsoleted by another new model/technique - rush to implement..." has me a little burned out. Sorry about the dismissive comment.
Like you, I'm similarly situated -- and I'll probably take a few months (to a year) break from following this stuff too closely (e.g. trying to get these running locally on my own). I have a suspicion that things will be a bit more lightweight and/or optimized by then, and it'll be way more tractable+economical to play with and selfhost many of these models.
Not sure what you mean by my BS-threshold, but yes I would describe this tool as "near real time". It has too much latency to be used as a professional tool for artists, though perhaps that is not the intended audience.
Re BS: It's me. It is my way of referring to reflecting doubt upon business claims, as I have a callous regard for aspirational character's when money is an issue. Since money and human labor (+ often suffering) becomes cleverly undervalued through such claims.
Blender takes > 2-3 seconds to apply all transformations to wireframe view and see final result. This is faster than that. Creatives would be happy with this.
Well you are comparing a child (current tech) to an adult (blender) to use Newton's words. Also those precise controls come at a price of more workflow to deal with.
So by your logic Blender will be always ahead of this because it's always more mature?
Anyway it's not about maturity but control over process, you can compare this to AR 3D drawing tools, also very new but already used to make actual art.
...yes? If you're marketing a tool as being "real time" I'd expect it to not have this much latency (e.g. when the blue is added to the mushroom). Imagine if Photoshop made you wait a second between each brush stroke.
I'm an illustrator. I'm fully capable of drawing/painting political figures in compromising or worse situations. Should the government blind me because my skills are dangerous?
Artists, intellectuals, journalists, and critics of all sorts have been jailed, beaten, or simply murdered in the past for expressing themselves in ways the government of the time did not like.
I think your situation is different, in that a) people who spend many years learning to be good illustrators tend to have standards for what they create in ways that, say, virulent racists using ML tools don't; and b) people rarely take illustrations as evidence of things that happened in real life, whereas they will do that with ML-generated fake photographs.
I think, like Paul Graham would say, we'll develop societal antibodies against this.
And, like Sam Altman would say, it is going to be net good for society, but that doesn’t mean there aren’t bumps along the way. We will need to learn to navigate them well.
In the general sense yes, but I wonder if there will be unexpected things that we’ll need to take into account with this new generation of tools.
Part of me thinks that this is another revolution in graphics the same way Photoshop was where you can work 10x faster. But another one ponders about what happens when we’re dealing with intelligence.
>But another one ponders about what happens when we’re dealing with intelligence.
This is a good point, diffusion models are an example of intelligence. Proof of that is that they became ubiquitous in the same year as large language models, thus they are the same.
When Diego first showed me this animation, I wasn't completely sure what I was looking at, because I assumed the left and right sides were like composited together or something. But it's a unified screen recording; the right, generated side is keeping pace with the riffing the artist does in the little paint program on the left.
There is no substitute for low latency in creative tools; if you have to sit there holding your breath every time you try something, you aren't just linearly slowed down. There are points that are just too hard to reach in slow, deliberate, 30+ second steps that a classical diffusion generation requires.
When I first heard about consistency, my assumption was that it was just an accelerator. I expected we'd get faster, cheaper versions of the same kinds of interactions with visual models we're used to seeing. The fine hackers at Krea did not take long to prove me wrong!