Hacker News new | past | comments | ask | show | jobs | submit login

Machine learning models don't store training data. The space a picture takes is irrelevant. For instance, stable diffusion would be the same size if it trained on 1 billion images than if it trained on 200 million or even 1 image(or 0 images).

Weights/parameters are configuration settings not training data storage. When weights/neurons/parameters are updated after each training loop, you are essentially updating configuration settings that direct generations, not storing any particular training text or image.

Weights are what take up the space. The bigger the parameter size/the number of weights, the bigger the size of the model.

Image generators don't need the huge parameter numbers text generators need to be useful. What they need to learn simply isn't as complex.




> What they (image generation models) need to learn simply isn't as complex.

This is the surprising part. People seem to intuit that images are richer and more complex than words; a picture is worth a thousand words. But apparently this isn't true? Or perhaps our training methods for text models are way worse than those we use for image models.


A picture may be worth a thousand words when the information you want to convey is visual. But that's not the case the overwhelming majority of the time.

Imagine having this discussion (or the comment thread as a whole) using exclusively pictures, for example... at least you can describe an image with words (even if the result is very lossy), most of the time it's not even possible to describe a text with images.

In my view, language is infinitely more versatile and powerful than images, and hence harder to learn.


The complexity of what is learned is rooted in the complexity required to complete the task. Predicting the next token may seem deceptively simple but you have to ask yourself what it takes to generate/predict passages of coherent text that display recursive understanding. Seeing as language is the communication between intelligent minds, there's a lot of complex abstractions encoded in it.

The typical text to image objective function is more about mapping/translation. Map this text to this image. Neural Networks are lazy. They'll only learn what is necessary for the task. And mapping typically requires fewer abstractions than prediction.

It's like how bilingual llms can be much better translators than traditional map this sentence to this sentence translators. https://github.com/ogkalu2/Human-parity-on-machine-translati...


This is just a guess, but I don't think there's such a deep lesson here; language models and image models have simply been developed by mostly-different groups of researchers who chose different tradeoffs. In an alternate history it may very well have gone the other way around.


I would disagree. We have image generation with a variety of architectures. Diffusion models aside, it still takes a lot less parameters to model State of the art image generators with transformers (eg Parti).

Simplifying a bit, mapping (which is essentially the main goal of image generators and especially transformer generators) is just less complex than prediction.

It's like how bilingual llms can be much better translators than traditional map this sentence to this sentence translators. https://github.com/ogkalu2/Human-parity-on-machine-translati...


True in this situation, but note that intermediate activations and gradients do take memory and in other contexts that's the limiting factor. For example purely convolutional image networks generally take fixed-size image inputs, and require cropping or downsampling or sliding windows to reach those sizes - despite the convolution memory usage being constant for whatever input image size.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: