StreamDiffusion: A pipeline-level solution for real-time interactive generation

Flux159 · 2023-12-23T23:47:38.000000Z

Arxiv paper here https://arxiv.org/abs/2312.12491

I think that it's possible to get faster than their default timings for a 4090 (I have been able to get 10fps without optimizations with SDXL Turbo and 1 iteration step), but their other improvements like using a Stochastic Similarity Filter to prevent unnecessary generations are good for getting fast results w/out having to pin your GPU at 100% all the time.

acheong08 · 2023-12-24T00:38:19.000000Z

This feels unreal. It feels like a decade passed within a year.

mattigames · 2023-12-24T04:32:54.000000Z

I can't wait until it can do my job and then I will just run it on my PC and connect it to slack so my employeer will receive similar results of when I did it manually and I will be payed without spending any time actually working, I will be able to focus on my hobbies for once. This is how this all will play out in the end right?

godelski · 2023-12-24T05:44:48.000000Z

> I will be payed without spending any time actually working,

> This is how this all will play out in the end right?

Somebody is gonna tell him right? I don't want to be the one to crush such innocence.

ChatGTP · 2023-12-24T14:40:07.000000Z

You should try using AI to gauge sentiment.

mattigames · 2023-12-24T13:47:42.000000Z

It was sarcasm.

godelski · 2023-12-25T02:16:41.000000Z

Ahh sorry, I missed that. My bad. Have run into too many people that legitimately believe this :/

sroussey · 2023-12-24T04:47:28.000000Z

Your employer can just replace you then and save the money

buryat · 2023-12-24T06:38:52.000000Z

start recording your thought process very detailed while solving problems, then train a model and sell the model to work as you, roll in money (likely not as you would be outcompeted by other models).

holoduke · 2023-12-24T05:56:36.000000Z

If you are the first. Yes dor a time. Make sure you duplicate your work among 500 other jobs and become a millionaire. Because it will last not so long when everyone finds out

poulpy123 · 2023-12-24T13:25:48.000000Z

Yep you will be totally paid, don't worry !

jimmyl02 · 2023-12-24T01:20:50.000000Z

the entire open source AI space feels like this right now. basically every day there is some new advancement that either makes something deemed impossible achievable and it's actually really hard to keep up with all the changes.

legel · 2023-12-24T01:46:18.000000Z

100% agreed. I've been developing deep neural networks for over 10 years and this is just surreal.

On the bright side, one source of "sanity" that I'm finding is to review a collection of daily "hot" publications in AI/ML curated here: https://huggingface.co/papers

kwerk · 2023-12-24T03:47:43.000000Z

Thanks for sharing.

kossTKR · 2023-12-24T01:49:06.000000Z

True, the fact we now also have LLM models at 2.7B parameters that are "actually coherent and useful for niche work" like PHI-2 on my Macbook Air running super fast at 25 t/s or 12 t/s for Mistral_0.2 as an _somewhat_ knowledgable assistant / coder - is crazy!

Coherent models required dozens of billions of parameters just a year ago and something like Mistral was closer to 70B and reserved for server farms or crazy setups.

Locallama on Reddit has outdated info in from just a few months ago.

knlam · 2023-12-24T03:53:26.000000Z

Now as a frontend developer I understand how folks complain the frontend landscape change so fast that it is impossible to keep up

nazka · 2023-12-24T11:01:13.000000Z

At least this is innovation and always moving forward. For the frontend space sometimes it’s step backs, reinventing the wheel, or just because a new tool looks shinier…

mclightning · 2023-12-24T10:51:59.000000Z

as a frontend developer, you understood how folks complain about the changing frontend landscape, through....changes in ML, a completely different landscape to that of your own?

Are you an LLM?

amelius · 2023-12-24T00:59:10.000000Z

These softwares develop faster than I can apt-get install them.

throwup238 · 2023-12-24T01:42:32.000000Z

Not even ArchLinux’s rolling community repositories can keep up.

I’ve had to git clone everything like a package manager-less serf. Where even is my filesystem?!

gigel82 · 2023-12-24T02:50:05.000000Z

Reminds me of an incremental game ( https://www.reddit.com/r/incremental_games/ ) - BTW, don't start playing one of those or you'll ruin your holidays... :)

Roritharr · 2023-12-24T07:18:40.000000Z

This is the only thing that has me curious if that's how the progress curve feels like in the opening act of the Singularity.

ChatGTP · 2023-12-24T14:31:37.000000Z

Generating images on high end hardware and then...?

PartiallyTyped · 2023-12-24T10:30:19.000000Z

We are on the lower bottom of the S curve.

smusamashah · 2023-12-24T02:42:11.000000Z

I just tried the realtime-text2img demo (uses npm for frontend which i think is too much for this). Modified it to produce only 1 image instead of 16. Works well on a laptop with RTX-3080. It's probably 2 images / sec.

EDIT: The `examples\screen` demo almost feels realtime. Says 4 fps on the window but don't what it represents.

EDIT: Denoising in img2img is very low though which means thee returned image is only slightly different from base image.

godelski · 2023-12-24T05:42:44.000000Z

How's the actual quality, diversity, and alignment though. I'm away from my GPU for a few days. It's always hard to judge generative papers without getting hands on because you write to the reviewers which means you gotta cherry pick (I think this is bad, but it's where we're at). They're using tiny autoencoder? Artspew did that too and was getting higher FPS (but weren't using TensorRT but were using triton) but the quality was garbage (still cool). Regardless, these are impressive even if the quality isn't anywhere near whats shown, but it's hard to tell.

numpad0 · 2023-12-24T13:11:58.000000Z

If you mean not disciplined to caricaturize and stigmatize racial stereotypes a la Disney films by diversity and alignment, that won't be coming from anywhere Asia... Especially the "Asian" face makeup prevalent in NA, sometimes derogatorily called "Pocahontas" face. That one is American special.

smusamashah · 2023-12-24T09:16:55.000000Z

Not good which is very understandable. You are not supposed to use the output as final image. Find a good propt/seed by iterating quickly. Then go for higher step count to render a higher quality image.

foxhop · 2023-12-24T13:15:42.000000Z

I use automatic1111 hires fix with SDXL-turbo on a RTX 4090 and it looks the best by far for high quality images once you find a good prompt/seed (but heres the thing turning it on makes all the prompts/seeds that much better...)

godelski · 2023-12-25T02:20:00.000000Z

I mean there's nothing wrong with low diversity of outputs, it just has different context and its uses are different. The problem is just having all of this properly communicated so that we can evaluate properly.

modeless · 2023-12-24T01:02:49.000000Z

Does 100fps mean I can provide a new input every 10 ms and get a new output every 10ms? Or do inputs need to be batched together to get that average throughput?

stale2002 · 2023-12-24T01:21:56.000000Z

I haven't tried it, but just taking an educated guess, which is that I don't think batching should be required.

The slow part for models is loading the model up. But once the model is up, you can send it whatever input that you want.

Parsing and sending the image data just doesn't pass my gut check as to what would be the bottleneck here.

modeless · 2023-12-24T01:28:55.000000Z

That's not the issue, the issue is GPU utilization. Batching enables higher utilization and higher images per second throughput, but doesn't improve latency.

Kubuxu · 2023-12-24T02:09:36.000000Z

In essence batching allows for more efficient usage of memory bandwidth. Without batching for every generation you need to transfer the whole model from GPU memory to GPU core once for every image which sets an upper bound on speed. With batching the bottlenecks start showing up elsewhere.

For SD1.5 4090 is able to do ~17it/s without batching and ~90-100it/s with batching.

Although these numbers might be old at this point, I looked at it ~3mo ago.

godelski · 2023-12-24T06:21:32.000000Z

Everyone does warmup before you measure. But measuring isn't always done right because we actually measure the GPU time only but some people naively use CPU time which is problematic because the process is asynchrenous. They have a few timing scripts though and I'm away from my GPU. There are some interesting things but they look like they know how to time. But it can also get confusing because is it considering batches or not. Some works do batch some do single. Only problem is when it isn't communicated correctly or left ambiguous.

Their paper is ambiguous unfortunately. Abstract, intro, and conclusion suggests single image by motivating with sequential generation (specifically mentioning metaverse). Experiment section says

> We note that we evaluate the throughput mainly via the average inference time per image through processing 100 images.

That implies batch along with their name Stream Batch...

Looking at the code I'm a bit confused. I'm away from my GPU so can't run. Maybe someone can let me know? This block[0] measures correctly but is using a downloaded image? Then just opens the image in the preprocess? (multi looks identical) This block[1] is using CPU? But running CPU. (there's another like this)

So I'm quite a bit confused tbh.

[0] https://github.com/cumulo-autumn/StreamDiffusion/blob/03e2a7...

[1] https://github.com/cumulo-autumn/StreamDiffusion/blob/03e2a7...

kristopolous · 2023-12-24T01:06:46.000000Z

This more or less just worked as documented. Most of these demos tend to blow up and give really wonky deep errors.

Good job. Give it a try. Look into the server.py of realtime-txt2img to change the model if you want to generate something other than anime. Pointing it to say https://huggingface.co/runwayml/stable-diffusion-v1-5 works fine.

The results are genuinely fast. Not great, but fast. If you change to the SDXL via LCM-LoRA https://huggingface.co/latent-consistency you may get better stuff but that's when it's going to get difficult and you'll start to run into those mysterious crashes I talked about that require, you know, actual work.

my setup: 4090/3990x/CUDA 12.2/debian sid. ymmv.

ilaksh · 2023-12-24T00:05:23.000000Z

How does the demo with the girl moving in and out of frame work? Is it ControlNet?

woleium · 2023-12-24T00:21:25.000000Z

Its video input. from tfa:

Stochastic Similarity Filter reduces processing during video input by minimizing conversion operations when there is little change from the previous frame, thereby alleviating GPU processing load, as shown by the red frame in the above GIF

Flux159 · 2023-12-24T00:24:31.000000Z

I think it’s just img2img with a prompt & rcfg scale and no controlnet since theres a GitHub issue about adding controlnet support open at the moment.

ec109685 · 2023-12-24T00:37:53.000000Z

So left is the source image and right is the resultant image?

washadjeffmad · 2023-12-24T01:16:48.000000Z

yes. compare to animatediff.

_joel · 2023-12-24T08:43:42.000000Z

Maybe we're all living in a simulation^H^H^H^H^H pipeline-level solution for real-time interactive generation.

ChatGTP · 2023-12-24T14:33:02.000000Z

Maybe we're not?

brcmthrowaway · 2023-12-24T00:34:38.000000Z

What is the fps on Apple Silicon?

washadjeffmad · 2023-12-24T01:54:43.000000Z

0 because there's no MPS support.

However, a Studio with an M1 Max 64GB is ~13x slower at generative AI with SD1.5 and SDXL than an RTX 4090 24GB at the same cost (~$1,800, refurb) right now.

Terretta · 2023-12-24T02:46:52.000000Z

> 0 because there's no MPS support. ... Studio with an M1 Max 64GB is ~13x slower at generative AI with SD1.5 and SDXL than an RTX 4090 24GB at the same cost (~$1,800, refurb)

Does the 4090 have a computer attached to it? It seems like with no computer, the speed would also be 0.

echelon · 2023-12-24T03:41:23.000000Z

AI is best done in the Linux/Ubuntu/Pytorch/Nvidia ecosystem. Windows has some exposure due to WSL/Nvidia.

Mac is not a great place for AI/ML yet. Both the hardware and the software present challenges. It'll take time.

When I was hacking AI stuff on a Macbook, I had a second Framework laptop with EGPU that I SSH'd to.

sroussey · 2023-12-24T04:57:46.000000Z

I think the tensor core in the 4090 really help, and of course CUDA supporting every hardware they offer (cough cough, rocm) means that researchers are going to start there.

That said, I think Apple will have some interesting stuff in a year or two (M4 or more likely M5) where they can flex their NPU, Accelerate framework, and unified memory GPU and have it work with more modern requirements.

Time will tell what their software and hardware story is for local inference for generative AI.

Siri (dictation, some assistant stuff, and TTS) runs on device, and I doubt they want to undo that.

I doubt they will do much for training, but maybe a NUMA version of a MacPro with several M4 Ultras will prove me wrong?

echelon · 2023-12-24T05:00:30.000000Z

> That said, I think Apple will have some interesting stuff in a year or two (M4 or more likely M5) where they can flex their NPU, Accelerate framework, and unified memory GPU and have it work with more modern requirements.

Plus two years for software support by the broader ecosystem.

Even Windows, with Cuda + drivers, suffers from less support.

teaearlgraycold · 2023-12-24T03:21:31.000000Z

If we’re being snarky “Apple Silicon” won’t work without a motherboard and power supply either.

numpad0 · 2023-12-24T13:18:53.000000Z

I think the line between the line in GP is people blindly believing into Apple marketing graphs is annoying; Apple Silicon GPU marketing comparisons against NVIDIA GPUs are made using laptop variants, which were at some point exact same silicon as desktop GPUs software limited to fit within laptop power/cooling brackets, but not in 30/40 generations.

givinguflac · 2023-12-24T05:37:50.000000Z

I get what you’re saying but I don’t think there was snark. Just the fact that a 4090 without a computer attached won’t work. It’s not like you can buy apple silicon without a Mac attached.

jbverschoor · 2023-12-24T14:02:26.000000Z

You can just get a pci enclosure, and use the hardware.. Attaching it to a VM makes sense bc of drivers etc.

washadjeffmad · 2023-12-24T17:56:15.000000Z

eGPUs don't work with Apple Silicon Macs, only Intel. We ran into a lot of the limitations early on, and this is the only reason we still have 2018 Mac Minis and 2019 Mac Book Pros.

https://support.apple.com/en-us/102363

Five years, and still no solution. And somehow they're spinning memory bandwidth as some sort of prescient act of Apple genius for AI. It's insulting.

jbverschoor · 2023-12-25T22:00:03.000000Z

Hmm... you're right.. I tried searching for ANY support... but there's really nothing yet.

washadjeffmad · 2023-12-25T22:44:39.000000Z

You can imagine our confusion and surprise when we got our first M1s and had to lose a few displays and our eGPUs.

Apple made a strange choice with their hardware that effectively pushed our development to Linux and Windows. If Macs didn't make such nice front ends, they almost wouldn't have a place at all.

yazaddaruvala · 2023-12-24T06:38:01.000000Z

I run DrawThings with SDXL Turbo on my M1 Pro w/ 32GB RAM

I get a 512x512 5 step image generated in 5 seconds. No refiner, upscaler, or face restoration.

My understanding is that DrawThings hasn’t been optimized for SDXL Turbo and/or pipelined generation yet.

For reference: SDXL Base+Refiner with face restoration at 2k x 2k 50 step image generation takes about 120 seconds.

gaogao · 2023-12-24T00:38:43.000000Z

At least an 1/8 or so, but yeah, getting it running on Apple at at least 24fps would be huge. Some degree of interpolation might do it. You maybe could get away with 12fps esp. with an anime aesthetic since that's basically animating on 2's.

timexironman · 2023-12-24T08:07:38.000000Z

Is there a video of it I can view anywhere?

ChatGTP · 2023-12-24T14:38:38.000000Z

try clicking on the link ?

badloginagain · 2023-12-24T00:41:31.000000Z

Yo I just heard about MidJourney this year.

And this appears to be a local runtime stable diffusion streaming library?

Bruh.

Keyframe · 2023-12-24T00:54:42.000000Z

Singularity is real, but it's people. Amazing fast-paced progress.

programjames · 2023-12-24T02:01:42.000000Z

This paper is horribly written. It's like the authors are trying to sell me on them as researchers, instead of helping me understand their research (y'know, the entire reason journals got started??). An entire section for "stream batching" was just too much, and none of their ideas were innovative or unique. It was incredibly dense, simply because it's obfuscated, which makes me believe the authors themselves don't really understand what they're doing.

The results aren't even very good. They claim 60x speedup, but compared to what? HuggingFace's Diffusers Autopipeline... a company notorious for buggy code and inefficient pipelines. And that's for naively running the pipeline on every image. Give me a break.

godelski · 2023-12-24T07:28:01.000000Z

> instead of helping me understand their research (y'know, the entire reason journals got started??)

ML is crazy right now and people don't see papers as means of researchers communicating to other researchers. You write papers to reviewers. But your reviewers are stochastic so it's hard to write to them because they may or may not be in your niche.

I'll add though that this isn't why journals were created and that CS/ML doesn't typically use journals ({T,J}MLR, PAMI, and a few exist, sure) and instead write to conferences. Fixed dates, zero sum, 1.5 shot setting (1 rebuttal, zero revisions). Journals were created for dissemination of papers, indirectly about communicating to one another, but you know... now we got Arxiv and blogs and websites are sometimes way better just like how papers got better with pictures with computer graphics.

kristopolous · 2023-12-24T02:25:01.000000Z

Somehow just hacking together code to create something is considered publishable these days. The code works but it really is just pasted together stuff from the last few weeks of research.

godelski · 2023-12-24T07:33:12.000000Z

I don't think I have a problem with this tbh. Though this specifically looks more engineering and product oriented. What I do have a problem with is comparing papers across vastly different TRLs and comparing works done with 100 GPU years of compute to works with 1 GPU year (or less). Just completely different class of works and comparing is void of context, you know?

The reason I don't have a problem is I see papers as how we researchers communicate with other researchers. But I feel that's not how everyone sees them and there's the aspect that this is how we're judged so incentives get misaligned with actual goal. Idk if the reward hacking is ironic or makes sense because our job is to optimize. But don't let anyone try to convince you that reward (or any cost function) is enough.

pro9 · 2023-12-24T02:55:11.000000Z

The parameters and algorithms can be inferred from the code. Perhaps what’s unnecessary is tradition to wrap physical reality in human language semantics.

The complexity is in the hardware. Programming has only ever been templating desired machine state. Programmers fell into a religious like state of seeing their more ornate efforts as essential to making a purpose built counting machine count.