The complexity of what is learned is rooted in the complexity required to complete the task. Predicting the next token may seem deceptively simple but you have to ask yourself what it takes to generate/predict passages of coherent text that display recursive understanding. Seeing as language is the communication between intelligent minds, there's a lot of complex abstractions encoded in it.
The typical text to image objective function is more about mapping/translation. Map this text to this image. Neural Networks are lazy. They'll only learn what is necessary for the task. And mapping typically requires fewer abstractions than prediction.
The typical text to image objective function is more about mapping/translation. Map this text to this image. Neural Networks are lazy. They'll only learn what is necessary for the task. And mapping typically requires fewer abstractions than prediction.
It's like how bilingual llms can be much better translators than traditional map this sentence to this sentence translators. https://github.com/ogkalu2/Human-parity-on-machine-translati...