It is fed using a fascinating mechanism called "cross-attention" that originated in the NN architecture called Transformer - which was used to achieve state-of-the-art translations. It uses something like associative memory, where a NN inside Stable Diffusion, that generates image (UNet, working in latent space), at almost each step "asks" the whole encoded prompt to provide data at various positions using query vector Q that is matched against key vectors K and value vectors V [0].
How Stable Diffusion works [1] as a whole is not really hard to comprehend at a high level - but you'll need some prereqs - probability theory underlying this is explained in Variational Autoencoders [2], then Diffusion Models [3] sort of made a really cool "deep variational" autoencoder that uses small noise-denoise steps, but largely the same math (variational inference), but they were unwieldy because operated in pixel space, after that Latent Diffusion Models [4] democratized the thing by vastly reducing the amount of computation needed - operating in latent space (btw that's why the images in this HN post look so cool - the denoising is not in the pixel space!).
I understand neural networks, embeddings, convolutions, etc. The part that's unclear to me is specifically how textual embeddings are linked into the img-to-img network trying to reduce the noise. In other words, am missing how the process is 'conditioned upon' the text. (I lack a understanding the same for conditional GANs as well.)
If the answer is just that the textual embeddings are also fed as simple inputs to the network, I already understand then.