When you hear autoregressive model, think “predicting a sequence”. These are good for text to speech since you can say “given some text, generate a spectrogram.” GPT-2 is probably the most impressive example of autoregressive techniques (I think).
GANs, and especially stylegan, are good for generating high quality images up to 1024x1024. These take about 5 weeks to train and $1k of GCE credits. The dataset size is around 70k photos for FFHQ. Mode collapse is a concern, which is when the discriminator wins the game and the generator fails to generate anything that can fool it. Stylegan has some built in techniques to combat this. IMLEs recently showed that mode collapse can be solved without gans at all.
Hmm.. what else... I’ll update this as I think of stuff. Any questions?
EDIT: Regarding IMLE vs GAN, here are some resources:
For comparing images, I believe they use the standard VGG perceptual loss metric that StyleGAN uses. (See section 3.5 of https://arxiv.org/pdf/1811.12373.pdf)
It seems to me that the main disadvantage of IMLE is that you might not get any latent directions that you get with StyleGAN. E.g. I'm not sure you could "make a photograph smile" the way you can with StyleGAN. But in the paper, they show that you can at least interpolate between two latents in much the same way, and the interpolations look pretty solid.
IMLE (implicit maximum likelihood estimation) as far as I can tell is a trivial method of parameterizing a random variable distribution and tuning it to make true data (e.g., image) examples more likely. The technique relies on finding nearest neighbor example images, which in turn needs a metric of image distance. Original IMLE uses least-squares pixel distance for example, which is not a very flexible or effective metric in practice (eg., it is completely confused by rotation).
The whole advantage of GaN is it does NOT need an explicit distance metric for comparing images--instead the discriminator effectively learns the metric in order to improve its ability to distinguish real images from generated/fake ones. Arguably this is the whole advantage of GaNs.
So to argue that IMLE can solve mode collapse is a false equivalency.
I found a strange bifurcation recently while collecting papers on a sub-topic of this question.. China-based authors quoting other China-based authors extensively, in English with math, of course. Meanwhile, the US and Western EU seem like "it" , in other words, all the papers referenced seem like the ones you would reference..etc self-consistant.
One of the incredibly unfortunate things about science out of China. It may or may not be trustworthy, as in the data may be just straight false. I'm not surprised that you saw that split, I'd be leary of quoting/referencing a potentially false paper myself.
Autoregressive models use their own output at past time steps as part of the input to predict the next value. If your sequence generator does not do that then it’s not “autoregressive”.
GANs, and especially stylegan, are good for generating high quality images up to 1024x1024. These take about 5 weeks to train and $1k of GCE credits. The dataset size is around 70k photos for FFHQ. Mode collapse is a concern, which is when the discriminator wins the game and the generator fails to generate anything that can fool it. Stylegan has some built in techniques to combat this. IMLEs recently showed that mode collapse can be solved without gans at all.
Hmm.. what else... I’ll update this as I think of stuff. Any questions?
EDIT: Regarding IMLE vs GAN, here are some resources:
Mode collapse solved (original claim): https://twitter.com/KL_Div/status/1168913453744103426
Overview of mode collapse, why it occurs, and how to solve it with IMLE: https://people.eecs.berkeley.edu/~ke.li/papers/imle_slides.p...
Paper + code: https://people.eecs.berkeley.edu/~ke.li/projects/imle/scene_...
Some simple code for reproducing IMLE from scratch (I haven't seen this referenced many other places; stumbled onto it by accident): https://people.eecs.berkeley.edu/~ke.li/projects/imle/
Super resolution with IMLE: https://people.eecs.berkeley.edu/~ke.li/projects/imle/superr...
For comparing images, I believe they use the standard VGG perceptual loss metric that StyleGAN uses. (See section 3.5 of https://arxiv.org/pdf/1811.12373.pdf)
It seems to me that the main disadvantage of IMLE is that you might not get any latent directions that you get with StyleGAN. E.g. I'm not sure you could "make a photograph smile" the way you can with StyleGAN. But in the paper, they show that you can at least interpolate between two latents in much the same way, and the interpolations look pretty solid.