I would disagree. We have image generation with a variety of architectures. Diffusion models aside, it still takes a lot less parameters to model State of the art image generators with transformers (eg Parti).
Simplifying a bit, mapping (which is essentially the main goal of image generators and especially transformer generators) is just less complex than prediction.
Simplifying a bit, mapping (which is essentially the main goal of image generators and especially transformer generators) is just less complex than prediction.
It's like how bilingual llms can be much better translators than traditional map this sentence to this sentence translators. https://github.com/ogkalu2/Human-parity-on-machine-translati...