Introduction to Diffusion Models for Machine Learning

natly · on May 12, 2022

This is mostly a copy of the much better articles:

https://yang-song.github.io/blog/2021/score/

and

https://lilianweng.github.io/posts/2021-07-11-diffusion-mode...

SleekEagle · on May 12, 2022

Hi there! I'm actually the author of the main article. I actually didn't reference the second article you linked in my writing (I did see it, but did not go through it as I thought it was just a basic summary. In hindsight, I wish I had gone through it - it would've saved me some headaches on derivations!)

I did reference the second article, although the score-matching connection to diffusion models was not observed by Song but instead by Ho et at. I didn't venture heavily into the score-matching aspect for the very reason you cited - Song's article on the subject is second to none!

t_mann · on May 12, 2022

Interesting. So do I get this right that if you use such a model, you essentially don't have much control over the output other than that it's similar to your training data, because your input is just white noise? Or is there a way to bundle this with another model that would allow you to generate images based on inputs like 'dog with party hat'?

SleekEagle · on May 12, 2022

In this formulation, no, you have no control over the output other than the fact that it is similar to your training data.

If you need to have control over the generated image, you would need to use a conditional diffusion model. https://arxiv.org/pdf/2111.13606.pdf

t_mann · on May 13, 2022

Thanks!

dr_dshiv · on May 12, 2022

I followed with interest until this sentence: “ Where β 1 , . . . , β T is a variance schedule (either learned or fixed) which, if well-behaved, ensures that x T is nearly an isotropic Gaussian for sufficiently large T”

zone411 · on May 12, 2022

I found that it's best to avoid most of these webpage explanations that pop up in Google (often on Towards Data Science and Medium). You can get a better understanding by reading intro sections of actual research papers.

monkeybutton · on May 12, 2022

Towards Data Science needs to die in a fire. The number of "articles" that are shameless ripoffs of other peoples' blogs or the tutorial docs of open source packages is unbelievable.

uoaei · on May 12, 2022

Learning to diffuse from original image to an isotropic Gaussian, with an invertible transformation, means you are able to un-diffuse isotropic Gaussians back into images once training is complete. The idea is for the transformation to be generalized enough that this process returns images that are feasibly part of the dataset from which the training data was sampled.

kastnerkyle · on May 12, 2022

I really like the descriptions from SUNDAE (https://arxiv.org/abs/2112.06749) if you have some background about general neural net style modeling, and generally find the multinomial or binomial diffusion settings a bit simpler to think about conceptually (if a bit more difficult in practice due to the harshness of the noise). There are other papers focused on these settings too (even the origin diffusion work in the NN sphere http://proceedings.mlr.press/v37/sohl-dickstein15.html) but again - the math is at the forefront (https://arxiv.org/abs/2111.14822 , https://arxiv.org/abs/2102.05379)

But a lot of the diffusion literature does focus the math, since finding tighter bounds and proving that things converge to the true likelihood etc. etc. are current and recent contributions in research (cf https://proceedings.neurips.cc/paper/2021/hash/c11abfd29e4d9... or https://proceedings.neurips.cc/paper/2021/hash/b578f2a52a022...)

The summaries by Yang Song (https://yang-song.github.io/blog/2021/score/) and Lilian Weng (https://lilianweng.github.io/posts/2021-07-11-diffusion-mode...) are arguably the definitive summaries, but there is math there too.

Personally the idea of training a model that is learning to go from (more noise -> less noise) stepwise is a pretty intuitive one (used to call it iterative inference I guess), but that simple message does get wrapped up in proofs and theorems quite a lot in the literature right now.

If you make analogy to GAN generators, which go from noise -> data in one shot (and presumably might need to do this kind of iteration/denoising, implicitly and internally), you are kind of relaxing the modeling problem and allowing for variable compute time at prediction (as opposed to trying to train a GAN with a huge number of layers in the generator).

Similar analogies also hold when looking at the VAE formulation, and seeing it as a mapping from gaussian noise (latent/Z) to data via the decoder following in the tradition of latent variable modeling setups like LDA, with the encoder being a practical and useful necessity to map into this latent space (some early slides from Durk Kingma and Max Welling present/relate VAE in this light - particularly the "plate diagram" representation of VAE highlights this well). Similar analogies also hold for flow based models, and are used frequently to define and teach about flow-based generative models (https://lilianweng.github.io/posts/2018-10-13-flow-models/).

Ultimately (in my opinion) each of these branches has their own "math corner" people spend time in - minmax game stuff for GAN, ELBO / bounds for VAE (or deriving new priors), bijection / invertibility in flows, and now noise schedules for diffusion. Just part of research I guess.

But these diffusion models are pretty straightforward to train, and pretty powerful in my experience so far - definitely worth cutting through the noise if you are interested in generative models but (like me) aren't overly invested in the math-parts

Jascha's slides with the "dye in water" analogy (starting ~slide 18, https://www.lri.fr/TAU_seminars/videos/Jascha_Sohl_Dickstein...) are a great intuitive introduction to the concept