> What they (image generation models) need to learn simply isn't as complex. Thi...

Al-Khwarizmi · on March 10, 2023

A picture may be worth a thousand words when the information you want to convey is visual. But that's not the case the overwhelming majority of the time.

Imagine having this discussion (or the comment thread as a whole) using exclusively pictures, for example... at least you can describe an image with words (even if the result is very lossy), most of the time it's not even possible to describe a text with images.

In my view, language is infinitely more versatile and powerful than images, and hence harder to learn.

og_kalu · on March 10, 2023

The complexity of what is learned is rooted in the complexity required to complete the task. Predicting the next token may seem deceptively simple but you have to ask yourself what it takes to generate/predict passages of coherent text that display recursive understanding. Seeing as language is the communication between intelligent minds, there's a lot of complex abstractions encoded in it.

The typical text to image objective function is more about mapping/translation. Map this text to this image. Neural Networks are lazy. They'll only learn what is necessary for the task. And mapping typically requires fewer abstractions than prediction.

It's like how bilingual llms can be much better translators than traditional map this sentence to this sentence translators. https://github.com/ogkalu2/Human-parity-on-machine-translati...

hakuseki · on March 10, 2023

This is just a guess, but I don't think there's such a deep lesson here; language models and image models have simply been developed by mostly-different groups of researchers who chose different tradeoffs. In an alternate history it may very well have gone the other way around.

og_kalu · on March 10, 2023

I would disagree. We have image generation with a variety of architectures. Diffusion models aside, it still takes a lot less parameters to model State of the art image generators with transformers (eg Parti).

Simplifying a bit, mapping (which is essentially the main goal of image generators and especially transformer generators) is just less complex than prediction.

It's like how bilingual llms can be much better translators than traditional map this sentence to this sentence translators. https://github.com/ogkalu2/Human-parity-on-machine-translati...