Does someone have an easy explanation how the text prompt is fed into the image.

Dzugaru · on Sept 3, 2022

It is fed using a fascinating mechanism called "cross-attention" that originated in the NN architecture called Transformer - which was used to achieve state-of-the-art translations. It uses something like associative memory, where a NN inside Stable Diffusion, that generates image (UNet, working in latent space), at almost each step "asks" the whole encoded prompt to provide data at various positions using query vector Q that is matched against key vectors K and value vectors V [0].

How Stable Diffusion works [1] as a whole is not really hard to comprehend at a high level - but you'll need some prereqs - probability theory underlying this is explained in Variational Autoencoders [2], then Diffusion Models [3] sort of made a really cool "deep variational" autoencoder that uses small noise-denoise steps, but largely the same math (variational inference), but they were unwieldy because operated in pixel space, after that Latent Diffusion Models [4] democratized the thing by vastly reducing the amount of computation needed - operating in latent space (btw that's why the images in this HN post look so cool - the denoising is not in the pixel space!).

[0] https://jalammar.github.io/illustrated-transformer/

[1] https://huggingface.co/blog/stable_diffusion

[2] https://arxiv.org/abs/1906.02691

[3] https://arxiv.org/abs/2006.11239

[4] https://arxiv.org/abs/2112.10752

alok-g · on Sept 3, 2022

Thanks!

andsens · on Sept 3, 2022

Uhm. You’re basically asking how the entire NN works. There is no easy explanation for that.

alok-g · on Sept 3, 2022

I understand neural networks, embeddings, convolutions, etc. The part that's unclear to me is specifically how textual embeddings are linked into the img-to-img network trying to reduce the noise. In other words, am missing how the process is 'conditioned upon' the text. (I lack a understanding the same for conditional GANs as well.)

If the answer is just that the textual embeddings are also fed as simple inputs to the network, I already understand then.

capableweb · on Sept 3, 2022

Might be worth looking through the dataset it was trained on, here's on example: https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/im...

So the model understands (kinda) who Bob Moog is, so when you include "Bob Moog" in the prompt, the model knows what you are looking for.

ShamelessC · on Sept 5, 2022

Why did they unnecassarily re-index a smaller subset of Laion Aesthetic? You can search _all_ of laion using the pre-built faiss indices from laion..

https://rom1504.github.io/clip-retrieval/?back=https%3A%2F%2...

is a hosted version, but you can download and host it yourself as well.