This is just a guess, but I don't think there's such a deep lesson here; language models and image models have simply been developed by mostly-different groups of researchers who chose different tradeoffs. In an alternate history it may very well have gone the other way around.
I would disagree. We have image generation with a variety of architectures. Diffusion models aside, it still takes a lot less parameters to model State of the art image generators with transformers (eg Parti).
Simplifying a bit, mapping (which is essentially the main goal of image generators and especially transformer generators) is just less complex than prediction.