Hacker News new | past | comments | ask | show | jobs | submit login
Real-time image editing using latent consistency models (twitter.com/krea_ai)
220 points by dvrp 11 months ago | hide | past | favorite | 107 comments



(Disclaimer: I'm an investor in Krea AI.)

When Diego first showed me this animation, I wasn't completely sure what I was looking at, because I assumed the left and right sides were like composited together or something. But it's a unified screen recording; the right, generated side is keeping pace with the riffing the artist does in the little paint program on the left.

There is no substitute for low latency in creative tools; if you have to sit there holding your breath every time you try something, you aren't just linearly slowed down. There are points that are just too hard to reach in slow, deliberate, 30+ second steps that a classical diffusion generation requires.

When I first heard about consistency, my assumption was that it was just an accelerator. I expected we'd get faster, cheaper versions of the same kinds of interactions with visual models we're used to seeing. The fine hackers at Krea did not take long to prove me wrong!


Exactly.

There is no substitute for real-time when you're doing creative work.

That's why GitHub Copilot works so well; that's why ChatGPT struck a chord with people—it streamed the characters back to you quite fast.

At first, I was skeptical too. I asked myself “what about Photoshop 1.0? They surely couldn't do it in real-time.”. It turns out that even then you needed it. Of course, compute wasn't there to do a simple translation of all rasterized pixel values that form an image within a layer, but there was a trick they did: they showed you the outline that would tell you, the user, where the content _will_ render if you let the mouse go.

You can see the workflow here:

> https://www.youtube.com/watch?v=ftaIzyrMDqE

It applies to general tools too; you can see the same on this MacOS 8 demo (it runs on the browser!):

> https://infinitemac.org/1998/Mac%20OS%208.1


> that's why ChatGPT struck a chord with people—it streamed the characters back to you quite fast.

So did GPT-3, though. ChatGPT (3.5) was a bit faster, but not overly so.


And it did blow up! But not as much as changing the UI towards a (familiar) chat interface.

Good point! I agree with it but forgot to mention it: interaction matters.

With GitHub Copilot you are in familiar terrain, your code editor; with ChatGPT, you are talking to it the same way you'd talk to an assistant, via chat/email.

And we, at KREA, don't think it'll be the exception for AI for creativity.


That's definitely true, the chat format (vs the completion format) made all the difference. So much so that ChatGPT blew up even though it was inferior in capabilities to GPT-3, just because it was (much) more usable.


As an investor, I hope you’re ready to bankroll the inevitable legal battles. These are not going to be restricted to the big players. Eleuther was recently sued, and they’re a non profit.

The moment you try to market this, you need to be prepared for the lawsuit. I’m preparing for one, and all I did was assemble a dataset. This model is built off of work which most people (rightly or wrongly) believe is not yours to sell.

I’m still not sure how I feel about it. I was forced to confront the question a few days ago, and I’ve been in a holding pattern since then. I’m not so much concerned about the lawsuits as getting the big question right. Ethics has a funny way of sneaking up on you in the long run.

At the very least, be prepared for a lengthy, grizzly smear campaign. Two people wrote stories insinuating I somehow profited off of books3. Your crew will be profiting with intent.

One reason I’ve considered bowing out of ML is that I’d rather not be verbally spit on for the rest of eternity. It’s nice to have the support of colleagues, but unless you really care solely about money, you’ll be classified in the same bucket as Zuck: widely respected if successful by the people that matter, but never able to hold a normal relationship again. Most people probably prefer that tradeoff, but go into this with eyes wide open: you will be despised.

The way out is to help train a model on Creative Commons images. I don’t know if there’s enough data. And it’s certainly a bad idea to wait; your only chance of dominating this market is to iterate quickly, which means using existing models. But at this point, lawsuits are table stakes. You need to be prepared for when they happen, not if.

Also, join me in at least one sleepless night pondering the ethics of profiting off of this. Normally people only mention this as a social signal, not because they actually care. But if you sit down and think it through from first principles, the ethics — legality aside - is not at all clear. This also isn’t a case of a Snowmaker startup (https://x.com/snowmaker/status/1696026604030595497?s=61&t=jQ... he notes that this only works when you have the general population on your side. All of those examples are of startups violating the laws that people felt were dumb. Whereas I can tell you from firsthand trauma that copyright enthusiasts are religiously fanatical. Worse, they might be on the right side of the ethics question.

This was the first time in my life that a startup’s ethics gave me pause. Not just yours, but everyone who’s building creative tools off of these models. You’ll face a stiff headwind. Valve, for example, won’t approve any game containing any work generated by your tools. And everyone else is trying to build their own moat.

I’m not saying to consider giving up. I’m saying, really sit down and go through the mental exercise of deciding if this is a battle you want to fight for at least three years legally and five years socially. I’m happy to provide examples of the type of abuse you and your team will face, ranging from sticks and stones’ level insults to people directly calling for criminal liability (jail time). The latter is exceedingly unlikely, but being ostracized by the general public is not.

At the very least, you’ll need to have a solid answer prepared if you start hiring people and candidates ask for your stance. This comment is as much for your team as for you as an investor, since all of you will face these questions together.


Hi sillysaurusx. I'd love to get into contact with you. As someone who contributes dataset to academic NLP, you have very unique and interesting perspectives on this question.

I can't reach out to you via twitter as I am not a verified member, so I will reach out via email.


What’s your Twitter handle? I can DM you.

(It’s unfortunate that there’s no way for me to override that setting. The whole reason I made my DMs public is so people can contact me.)


@Hellisotherpe10

I don't use twitter much. I've emailed you a rather long email already and would prefer that. I emailed shawnpresser@gmail.com


Are you an artist? Would you use this tool?


I'm artist and I use midjourney for faster work. You still need photoshop skills, imagination and aesthetic to get job done.


Taste is still valuable. It will follow a power law dynamic—might finish my essay about it someday.


Taste is... hard to define. I know quite a lot of artists that lack of taste, but nowadays you need to look for originality offline.

Online consumption shape digital creative work significantly. AI wont help here either.


SD community has been experimenting with Realtime generation using LCM for pas few weeks now

https://www.reddit.com/r/StableDiffusion/comments/17ovb4j/th...

https://www.reddit.com/r/StableDiffusion/comments/17kvpxn/re...

https://www.reddit.com/r/StableDiffusion/comments/17kekea/de...

https://www.reddit.com/r/StableDiffusion/comments/17ecdab/we...

There are more examples on that subreddit.

Given the pace at which features are added to popular automatic1111 repo, this will be added soon allowing you to try this for free on your own machine.


Someone just shared this on twitter https://huggingface.co/spaces/latent-consistency/super-fast-... which you can play with right now in your browser.


I think we are going to see insane things before end of year :)


Disclaimer: I'm actively working on this tool.

This is in a closed beta for now (while we work on provisioning enough GPU compute) but we're hoping to make this public later.


Is the model designed to run locally or does it run in the cloud somewhere?

Is it your own model or is it based on sdxl or something like that?


cloud, A100, so yeah, beast.

we are doing tests, will keep you guys posted.


Which cloud? I've been looking for somewhere to fine-tune Pixart, but EC2 et al. are usurious.


If you want to run a batch job, look into vast.ai.


Yeah, indeed. I know LCMs are way faster than diffusion models, but surely this demo is running on a beastly GPU...


yup, see comment above!


I searched for "latent consistency models" out of curiosity, I guess this is the project (with paper link):

https://github.com/luosiallen/latent-consistency-model


This is awesome. For future inputs please give me a 3d scene with camera controls into which I can drop primitive shapes, and human pose dolls so that I can frame shots like I would with a camera. I can't draw so even figuring out how to suggest "draw the image from the top left as the person looks over their shoulder" to the model is something I struggle with.


We haven't even deployed an open beta of this and you're already asking for 3D?

I'll never get tired of seeing the adaptation capabilities of human nature!

(But of course, we'd love to work on this!)


Come on, this is the wrong attitude. What they asked for could be prototyped in a day to see if it makes sense. Have the left image in the demo be fed from a rendering from a simple 3d scene of a camera and simple colored primitives.


I was being sarcastic! I’m super excited too!


Now make it 5D.


Introducing Tesseract Diffusion.


Do a search for "Blender GPT" there are several efforts ongoing. I've tried a few, and it is exactly what you describe: basic 3D shapes, some basic rigged objects like pose figures and blocky vehicles. Just put them roughly where you want them, likewise with the camera and lights and write your prompt. It's kinda ridiculous how fast and easy and, well, disposable all this is.


Great idea! It takes a surprising amount of artist know how to illustrate actions using stick figures but probably less so with something like this

https://www.ikea.com/jp/ja/p/gestalta-artists-dummy-natural-...

Or maybe just pose yourself and upload a picture, trace it?


Yes. And what if I told you that you can prompt not just with basic shape but your custom assets? Or even better, a model that understands your brand.


There are many Stable Diffusion plugins for Blender. Grey box a scene and “render” it with a prompt.


Yep! I saw YouTube videos of that workflow before posting the comment. I will say that I generally don't need Blender levels of control (or maybe I do and just don't know it yet) so a simplified set of primitives in a stripped down UI would be great.


Dude what


Does this have to do anything at all with LoRAs?


indeed; we're able to make it work with SDXL thanks to a new technique that got released yesterday called LCM-LoRA.

with LCM-LoRA you can turn models like SDXL into LCMs without need for training and you can add other style LoRAs like the ones you find on civit.ai

in case you're interested, here's the technical report about LCM-LoRA: https://arxiv.org/abs/2311.05556


Yes, they are probably using: https://huggingface.co/blog/lcm_lora


Wow, this is extremely impressive, unless I'm missing a catch?


You're not missing anything, it's just a very clever technique.


That's fantastic news then!


apolinario is a friend:)


why is it faster than latent diffusion models?


it uses a new technique called "consistency" that lets latent diffusion models to predict images in much fewer steps.

some links here: - https://arxiv.org/abs/2310.04378 - https://arxiv.org/abs/2311.05556


The architecture is the same but it has an additional model trained on the sampling trajectory computation, speeding up the process. So instead of 20 iterations with the original sampler that solves a SDE, you could use 8, 4, or sometimes even 2 iterations with this new ML-based sampler. That's not a new idea, but the author of this adapter figured out how to train it efficiently using very few GPU-hours.


I wonder if they could do a real-time preview using 2-4 iterations, and then 'refine' the quality with a higher number of iterations over time, much like how 3D editors preview ray-tracing.


I've lost the link, but IIRC someone on twitter demoed exactly that in Blender. "Neural rendering" in a 3D editor has some nice properties - the scene can be rendered into separate layers with separate depth maps and openpose skeletons generated from the rigs to serve as input for controlnets, objects can be tagged and tags composed into the prompt for the particular tile in case of tiled rendering, etc.


probably not but are you talking about the guy who used control net and animatediff to make a video that looked relatively consistent?


No, it was running in the Blender viewport using the original LCM Dreamshaper model.


we’re thinking of doing that as well!

but first we want to make sure we can get this in the hands of a tight cohort of creatives and see how they use it.


latent diffusion is an iterative process, the image becomes clearer one step at a time.

The process can be viewed as a particle moving in the image space, one step at a time to its final position in the image space, which is the generated image.

consistency model tries to predict the movement trajectory by providing the current position in the image space. Hence, what used to be a step-by-step process becomes a one-step process.


Would that motion vector have to include like a color delta?


if you are talking about latent diffusion, No, the "particle" is in a hyper-dimensional space, like for example a 10k-dimensional space. We are not supposed to interpret the meaning of that vector.

and when that particle has moved to the right location, there is a decoder that converts it into an image. The decoder network knows how to interpret it.


Exactly.

I actually wrote a micro essay on Twitter the other day about the meaning of the classic encoder decoder network. It’s beautiful.

But yeah!

-

For reference: https://x.com/asciidiego/status/1722544108252836119


Does this mean that you need this vector state in order to generate the next step? Ie I can't take an image (the pixels) and the prompt and run a few more steps on it?


oh wow, never thought of it that way


> consistency model tries to predict the movement trajectory by providing the current position in the image space. Hence, what used to be a step-by-step process becomes a one-step process.

no that wasnt a sufficient explanation for me. what is the prediction method here? why was diffusion necessary in the past? what tradeoffs does this approach have?


> what tradeoffs does this approach have?

From playing around with it a bit locally, LCM is much, much faster, but generally the detail is much lower using the latest SDXL model.

If your prompt is very simple, such as "a boy looking at the moon in a forest" it does pretty well. If your prompt is much more complex and asks for a lot more detail or uses other LoRas, it doesn't do nearly as well as other samplers and generates lower quality, worse matching images. These other samplers take 30-40 steps so it's several times slower.

From what I've seen though, if you use control net, or passing some guideline images in and rely on a simple prompt, like an existing video that you're trying to change the style of, LCM can generate images in near real time like the OP on an RTX4090 and maybe slower cards on smaller/older models

Another benefit is the decreased experimental time. So you can more quickly iterate over seeds to find output you like, and then maybe you can spend some time with other samplers/upscalers with that seed to make the result higher quality


If you use a tool like ComfyUI, you can use LCM-LoRa to generate fast but not perfect outputs and then refine them with the old school sampler. I’ve been playing around and have found the quality of LCM to be excellent when used in combination with IP Adapter instead of text prompts for conditioning.


Is it possible to use the seed from the LCM on the full-blown model to get a more detailed image, or is it a different latent space/decoder?


exactly. and there’s more things that you can do in terms of what you use as input for the NN!


to my defense, if you look at the original paper

https://arxiv.org/pdf/2303.01469.pdf

the exact neural network used for the prediction method is omitted. apparently many neural networks can be used for this prediction method as long as they fulfill certain requirements.

> why was diffusion necessary in the past?

in the paper, one way to train a consistency model is distilling an existing diffusion model. But it can be trained from scratch too.

"why was it necessary in the past " doesn't bother me that much. Before people know to use molding to make candles, they did it by dipping threads into wax. Why was thread dipping necessary? it's just a stepping stone of technology development.


Exactly.

The stepping stone way of seeing things reminds me a lot to the thesis behind the book “ Why Greatness Cannot Be Planned: The Myth of the Objective” (2015).


from https://arxiv.org/abs/2310.04378 it sounds like its a form of distillation of an SDL model. So im guessing it can't be directly trained, but once you have a trained diffusion model you can distil a predictor which cuts out the iterative steps.

While it can do 1-step the output quality looks a ton better with additional steps.


You can train it directly. This is from the paper “An LCM demands merely 32 A100 GPUs Hours training for 2-step inference [...]”


Let's quote the whole sentence, shall we?

An LCM demands merely 32 A100 GPUs Hours training for 2-step inference, as de- picted in Figure 1. [1]

Now let's look at the caption under Figure 1:

LCMs can be distilled from any pre-trained Stable Diffusion (SD) in only 4,000 training steps (∼32 A100 GPU Hours) for generating high quality 768×768 resolution images in 2∼4 steps or even one step, significantly accelerating text-to-image generation.

The training mentioned in your quote is distillation: it requires a previously trained SD model.

You could have told just by reading the introduction:

We propose a one-stage guided distillation method to efficiently convert a pretrained guided diffusion model into a latent consistency model by solving an augmented PF-ODE.

[1] Section 4.2 in https://arxiv.org/abs/2310.04378


Cool! The real-time feedback will have enormous ramifications on the art creation workflows.


Exactly! Everything will change.

And this is without taking into account WebGPU and other advances in adjacent fields.


Nvidia had a solution like this several years ago. Something something studio.


that’s kind of different though


Is it different in the underlying methods and models that are used?

This is the Nvidia tool: https://www.nvidia.com/en-us/studio/canvas/


Yup! It is different; that one is based on GAN models.

That's why you can only use what they let you paint. In our demo, you can put whatever text you want.

And we want to make it so that you can assign a (custom) label to each shape. So you can type what each shape represents.


Could you apply the same technique for real-time video editing?


it would be interesting to play with LCM-LoRAs and AnimateDiff and see how much much this technique can speed up video generation.

not sure if it's possible to just plug-and-play it or if we would need an extra LCM-LoRA for the motion module.

once we have these sort of models producing frames in milliseconds we should be able to do something similar to this demo but with videos.


Color me underwhelmed.


what does it take to impress you?


There was a webcam demo a little while back that was pretty cool. It's mostly that the right-hand side seems to only mildly follow the patterns drawn on the left hand side. It still seems sort of useful but once it's running GAN speeds (30-60fps) and adheres more strongly to the input is when people will find it both useful and a joy to work with.

Based on my general experience with text-to-image stuff, I assume this lack of adherence isn't always the case? Maybe another demo could show what it's like under ideal conditions.

edit: Webcam demo

https://github.com/radames/Real-Time-Latent-Consistency-Mode...

It does however exhibit similar issues but the realtime constraints of live video make it quit interesting.


I agree with your points but I think this is the beginning of something bigger


Definitely. I've been looking at all this stuff way too closely for awhile now so I doubt my opinion is representative. Looks like there's positive reception coming from others in the comments here. I think the general cycle of "new model/technique - rush to implement - immediately obsoleted by another new model/technique - rush to implement..." has me a little burned out. Sorry about the dismissive comment.


Oh interesting, in our case it fuels us!

Perhaps it's because we've been doing generative AI for years now haha


Like you, I'm similarly situated -- and I'll probably take a few months (to a year) break from following this stuff too closely (e.g. trying to get these running locally on my own). I have a suspicion that things will be a bit more lightweight and/or optimized by then, and it'll be way more tractable+economical to play with and selfhost many of these models.


Something something <Best way to predict the future quote>


Still too slow to be used for real work but it's cool to see progress


I disagree.

A compromise: How about near-real time, would that pass under your BS-threshold?


Not sure what you mean by my BS-threshold, but yes I would describe this tool as "near real time". It has too much latency to be used as a professional tool for artists, though perhaps that is not the intended audience.


Re BS: It's me. It is my way of referring to reflecting doubt upon business claims, as I have a callous regard for aspirational character's when money is an issue. Since money and human labor (+ often suffering) becomes cleverly undervalued through such claims.


Our users are already using our tool professionally—even though latency is >5s at times.

I have to disagree with you.

Of course, people will use it in a weird way—which is exciting.


Blender takes > 2-3 seconds to apply all transformations to wireframe view and see final result. This is faster than that. Creatives would be happy with this.


Blender offers much more precise controls tho... You can't make here what you can make in Blender, and probably never will


Well you are comparing a child (current tech) to an adult (blender) to use Newton's words. Also those precise controls come at a price of more workflow to deal with.


So by your logic Blender will be always ahead of this because it's always more mature?

Anyway it's not about maturity but control over process, you can compare this to AR 3D drawing tools, also very new but already used to make actual art.


Depends on acceleration of each tech and inherent ease, algo progress.


Exactly! And I think there is a long room for optimizations.


Really? Some users are already using KREA professionally.

FCB for example.


...yes? If you're marketing a tool as being "real time" I'd expect it to not have this much latency (e.g. when the blue is added to the mushroom). Imagine if Photoshop made you wait a second between each brush stroke.


That’s how it started!


How do you plan on stopping people from using this tool maliciously?


I'm an illustrator. I'm fully capable of drawing/painting political figures in compromising or worse situations. Should the government blind me because my skills are dangerous?


Some politicians would say: "Yes."

Artists, intellectuals, journalists, and critics of all sorts have been jailed, beaten, or simply murdered in the past for expressing themselves in ways the government of the time did not like.


I think your situation is different, in that a) people who spend many years learning to be good illustrators tend to have standards for what they create in ways that, say, virulent racists using ML tools don't; and b) people rarely take illustrations as evidence of things that happened in real life, whereas they will do that with ML-generated fake photographs.


I think, like Paul Graham would say, we'll develop societal antibodies against this.

And, like Sam Altman would say, it is going to be net good for society, but that doesn’t mean there aren’t bumps along the way. We will need to learn to navigate them well.


Those are articles of religious faith, not actual arguments.


Oh, interesting!

What do you think of this technology? What do you envision?


The same way you stop people from using a hammer maliciously?


In the general sense yes, but I wonder if there will be unexpected things that we’ll need to take into account with this new generation of tools.

Part of me thinks that this is another revolution in graphics the same way Photoshop was where you can work 10x faster. But another one ponders about what happens when we’re dealing with intelligence.


>But another one ponders about what happens when we’re dealing with intelligence.

This is a good point, diffusion models are an example of intelligence. Proof of that is that they became ubiquitous in the same year as large language models, thus they are the same.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: