Explaining the SDXL Latent Space

ttul · 2024-02-06T23:09:13 1707260953

I’ve been playing with diffusion a ton for the past few months, writing a new sampler that implements an iterative blending technique described in a recent paper. The latent space is rich in semantic information, so it can be a great place to apply various transformations rather than operating on the image directly. Yet it still has a significant spatial component, so things you do in one spatial area will affect that same area of the image.

Stable Diffusion 1.5 may be quite old now, but it is an incredibly rich model that still yields shockingly good results. SDXL is newer and more high tech, but it’s not a revolutionary improvement. It can be less malleable than the older model and harder to work with to achieve a given desired result.

fpgaminer · 2024-02-06T23:21:17 1707261677

> It can be less malleable than the older model and harder to work with to achieve a given desired result.

That has been my experience as well. It's frustrating because SDXL can be exquisite, but SD 1.5 is more "fun" to work with and more creative. I can throw random ideas into a mish-mash of a prompt and SD 1.5 will output an array of interesting things while SDXL will just seem to fall back to something "reasonable", ignoring anything "weird" in the prompt. SDXL also seems to have a lot more position bias in the prompt. SD 1.5 had a bit of that, paying more attention to words earlier in the prompt, but SDXL takes that to a new level.

But SDXL can draw hands consistently, so ... it's a tough choice.

designium · 2024-02-07T02:54:39 1707274479

With comfyui, you can do SDXL > SD1.5 or SD1.5 > SDXL, it makes more sense to generate basic image in SDXL Turbo and apply the effects of a checkpoint later.

SV_BubbleTime · 2024-02-07T03:47:38 1707277658

Kind of blowing my mind here.

Coming from Auto1111 for a year, I thought comfy was most like always using img2img, then I figured out it wasn’t that but laten2latent… which is cool, but using XL to get the better prompting and 1.5 to get checkpoints and Loras I want is making it all click now.

ttul · 2024-02-08T07:14:18 1707376458

ComfyUI is insanely amazing. The learning curve is well worth the effort.

countWSS · 2024-02-07T09:37:34 1707298654

SDXL(and possibly 2.1) switched to different CLIP implementation that is geared for sentence-level understanding, SD1.5 uses old CLIP that works with tag-cloud type prompts.

ttul · 2024-02-08T07:13:43 1707376423

SDXL actually takes conditioning from either the old or the new CLIP, or both. The malleability of SDXL is not just down to the choice of the new CLIP; the UNet itself is more opinionated.

lrasinen · 2024-02-07T08:00:12 1707292812

> But SDXL can draw hands consistently, so ... it's a tough choice.

Looking at the article photos it still has some way to go. I counted 3 cases of missing fingers, two cases of extra fingers (on the cartoon girl), and a few arm poses that in real life would need medical attention.

wruza · 2024-02-07T09:49:53 1707299393

Entered this thread to write your comment. I find SDXL inferior to 1.5 and yes, much harder to work with.

My another issue is that sdxl images that you can see on the web always have that “from a movie/ads”-?ish? coating. Can’t explain it, but it feels even more uncanny than 1.5.

SDXL is too resource-hungry for what it produces. 3x+ model sizes, 12GB vram is barely enough for it, 40 steps is the minimum, and I don’t think training loras will turn out feasible at all. I can’t lower the resolution without distortions, and even proportions are hard to deal with. It feels much less flexible than 1.5 in this regard.

I’m sticking with 1.5, no sdxl plans.

ttul · 2024-02-08T07:16:27 1707376587

Particularly considering the rich world of SD1.5 fine-tunes, SDXL leaves so much to be desired. I'm sure it will all be sorted out eventually, but right now, the momentum in the community just isn't there with SDXL the way it is with 1.5.

3abiton · 2024-02-07T08:30:35 1707294635

I am curious what do you mean by high tech?

ttul · 2024-02-08T07:17:17 1707376637

SDXL is a more sophisticated model architecture. It has more layers. The CLIP model is bigger.

l33tman · 2024-02-07T10:15:43 1707300943

Just a terminology comment here. "Latent space" means a lot of different things in different models. For a GAN for example it actually means the "top concept" space where you can change the entire concept of the image by moving around in the latent space, which is notoriously difficult. For SD/SDXL it refers to the bottommost layer just above pixelspace, which expands the generated image from 64x64 to 512x512 pixels in the case of SD1.5.

This allows the rest of the network to be smaller while still generating a usable output resolution, so it's a performance "hack".

It's a really good idea to explore it and hack into it like in the article, to "remaster" the image so to speak!

Der_Einzige · 2024-02-07T03:44:14 1707277454

Anyone know if the work shown here has been implemented in Automatic1111 or ComfyUI as an extension? If not, than that might be my first project to add since these are quite simple (relatively speaking) in the code to implement.

mzz80 · 2024-02-09T17:15:12 1707498912

What’s your github username? I’d be interested in using that when it’s released.

nomel · 2024-02-06T23:24:17 1707261857

What's the reason for using RGB rather than, say, HSV? RGB seems like it would be fairly discontinuous. Or, do I have that backwards?

Lerc · 2024-02-06T23:44:19 1707263059

I think there might be an opinion that since most colour space conversions can be expressed with relatively small neural nets (since they are mostly accumulations of variously scaled values), the autoencoder can dedicate a negligible proportion of its parameters towards that job and that gives it the potential to choose whatever color space training dictates.

I'm not entirely convinced by this idea myself. I have seen a few networks where a range of -1..1 inputs do a lot better than inputs in the range of 0..2 even though translation should be an easyish step for the network to figure out. The benefit from preprocessing the inputs, to me seem to be more advantageous than my common sense tells me it should be.

shenberg · 2024-02-07T12:06:13 1707307573

I suspect that weight initializations are geared towards inputs being normal random variables with mean 0 and variance 1. Deviating from that makes the learning process unhappy.

godelski · 2024-02-07T05:03:18 1707282198

The format isn't explicit to the network. But the data trained on is usually in RGB format, so probably the reasoning. I found a repo where someone tried different formats but it's wroth noting that this was for discrimination so just because it can discriminate doesn't mean it does the same thing. Maybe I'll run some experiments. You could use a UNet for classification and then look at the bottom layer and do the same thing. Be hard to do with SD (or SDXL) because you'd need to retrain with the format. Tuning could possibly work but the network would likely be biased to understand the RGB encoding.

Edit: ops, forgot the link

https://github.com/ducha-aiki/caffenet-benchmark/blob/master...

viraptor · 2024-02-07T06:29:57 1707287397

> But the data trained on is usually in RGB format, so probably the reasoning.

It's trivial to convert the values for training - basically 0% of the cost of the process. But there's likely more "meaning" in HSV than in RGB. So I don't think that would account for the difference.

incrudible · 2024-02-07T08:47:13 1707295633

ML systems generally do not care about human semantics, and they will not produce them naturally. The VAE works at 16 bits float per channel, so compression is not an issue either, but if it was, HSV would be a poor choice too.

viraptor · 2024-02-07T08:56:24 1707296184

ML systems don't care, but humans do and better semantically-meaningful representations in training data usually lead to better results for us. In images you often care about "different colours of similar brightness" rather than "matching levels of 3 colour components", so there's a non-zero chance HSV/HLS would do better than RGB. It's nothing to do with compression.

incrudible · 2024-02-07T12:56:40 1707310600

Does it lead to better results though? For the system, the best representation would be one that it learned - which is the latent representation, 4 channels in this case. Would it learn a "better" representation when fed with HSL instead of RGB? If so, what's the intuition? RGB somewhat resembles human vision, whereas HSL exists for interactive editing, and YCbCr exists for compression. If anything, I would expect YCbCr to outperform.

nomel · 2024-02-07T19:04:09 1707332649

> If so, what's the intuition?

HSV closer resembles physical properties, for most natural things. Hue and saturation variations are usually meaningful variations in the actual material. Brightness variations often end up being mostly about lighting, rather than the material. It can be surprisingly effective for simple segmentation [1], which is why it's usually the first one implemented in computer vision classes.

Our eyes have RGB sensors, but I would claim I perceive the colors in my surroundings in something like HSV (although, that could very well be from the way I learned colors). And, I think this makes sense: if you're looking for something, you want a color perception that's not overly sensitive to lighting conditions. RGB is directly related.

[1] https://medium.com/neurosapiens/segmentation-and-classificat...

incrudible · 2024-02-08T19:03:11 1707418991

The segmentation aspect is interesting, but the problem I have with H is that it is circular, i.e. 0 and 1 represent virtually the same hue, and my intuition is that this lends itself poorly to a NN. The luminosity argument is valid, but that is not unique to HSL, hence my intuition that YCbCr (or related) would outperform.

Dwedit · 2024-02-07T02:47:43 1707274063

It's more like YCbCr or YCgCo color space than RGB.

Sabinus · 2024-02-06T01:21:02 1707182462

That's very cool. I had no idea the latent space was that accessible and obviously manipulatable.

Also interesting is how the way sdxl structures latents affects how it thinks about images.

Lerc · 2024-02-06T23:26:11 1707261971

I don't think it's as simple as this naive approach suggests, but it's a good preliminary analysis. It's a good lesson that while being absolutely correct might be quite difficult, diving in and having a go might get you further than you think.

01HNNWZ0MV43FF · 2024-02-07T00:40:20 1707266420

All the patterns and textures are expressed by only one dimension? Bizarre.

l33tman · 2024-02-07T09:58:37 1707299917

It's only patterns and textures for 8x8 images, so I guess it could make sense, you're not going to need every conceivable pattern of 8x8 pixels in normal images..

If you quantize those 4 floats per 8x8 block, is that encoding better than say the old venerable JPG 8x8 DCT + quant?

SV_BubbleTime · 2024-02-07T03:43:40 1707277420

That was an excellently written article.

I for sure thought a discussion about latent spaces would instantly be over my head. It was, but took a few paragraphs.

HanClinto · 2024-02-07T06:01:07 1707285667

Thank you for the excellent article! Top notch work!

rgmmm · 2024-02-07T08:02:43 1707292963

Enhance.