> What they (image generation models) need to learn simply isn't as complex.
This is the surprising part. People seem to intuit that images are richer and more complex than words; a picture is worth a thousand words. But apparently this isn't true? Or perhaps our training methods for text models are way worse than those we use for image models.
A picture may be worth a thousand words when the information you want to convey is visual. But that's not the case the overwhelming majority of the time.
Imagine having this discussion (or the comment thread as a whole) using exclusively pictures, for example... at least you can describe an image with words (even if the result is very lossy), most of the time it's not even possible to describe a text with images.
In my view, language is infinitely more versatile and powerful than images, and hence harder to learn.
The complexity of what is learned is rooted in the complexity required to complete the task. Predicting the next token may seem deceptively simple but you have to ask yourself what it takes to generate/predict passages of coherent text that display recursive understanding. Seeing as language is the communication between intelligent minds, there's a lot of complex abstractions encoded in it.
The typical text to image objective function is more about mapping/translation. Map this text to this image. Neural Networks are lazy. They'll only learn what is necessary for the task. And mapping typically requires fewer abstractions than prediction.
This is just a guess, but I don't think there's such a deep lesson here; language models and image models have simply been developed by mostly-different groups of researchers who chose different tradeoffs. In an alternate history it may very well have gone the other way around.
I would disagree. We have image generation with a variety of architectures. Diffusion models aside, it still takes a lot less parameters to model State of the art image generators with transformers (eg Parti).
Simplifying a bit, mapping (which is essentially the main goal of image generators and especially transformer generators) is just less complex than prediction.
This is the surprising part. People seem to intuit that images are richer and more complex than words; a picture is worth a thousand words. But apparently this isn't true? Or perhaps our training methods for text models are way worse than those we use for image models.