The "memory usage" section of the README highlights the surprising fact that image generation models need much less memory than text-based language models. ChatGPT itself is by far the most resource-hungry part of the system.
Why is that so? It seems counterintuitive. A single picture snapped with a phone takes more space to store than the text of all the books in a typical home library, yet Stable Diffusion runs with 5 GB of RAM while LLAMA needs 130 GB.
Image models can be very off and still produce a satisfying result. Consider that I could literally vary all the pixels in an image randomly by 10% and you'd just see it as a bit low quality but otherwise perfectly cohesive image.
Language models have no such luck, the problem they're trying to solve is way "sharper", it's very easy for their results to be strictly wrong if they're off even a little bit.
So you need a much larger model to get a sufficient level of "sharpness" for text.
Maybe another way to think of it is that the error correction part of image generation models is offloaded to the human visual cortex which is a very old evolutionary construct and thus had time to become very resilient? In case of text generation, maybe the error tolerance of the human brain is less developed as human-level language is a newer evolutionary invention.
It'd be interesting if the parameter/complexity requirements are actually similar once you examine the system as a whole, meaning machine _and_ human brain.
> image generation models is offloaded to the human visual cortex which is a very old evolutionary construct and thus had time to become very resilient
This is a very important point. A group of my colleagues (who are not tech people) are much more impressed with the image generation models than with the chat interface, even though the images are often whacky or just wrong. Yet the fact that it tried is impressive to them, with their minds managing to fill in the blanks.
I wonder how this compares to how a toddler speaks vs. paints/draws, which is typically better in the former than the latter. I'm both cases, we fill in the blanks in our minds.
Toddler speaking gets impressive/surprising quite fast, whereas the drawing usually does not. The most surprising thing about most toddler drawings is listening to the kid describe it or tell you about making it.
The consistency of descriptions is particularly surprising to me. Like you got a roughly circular collection of seemingly random scribbles, but they can tell you exactly which parts of it correspond to the person's nose, hair, arms, eyes, etc. And the descriptions seem to stay the same if you ask about the same picture on different days. Still not sure what to make of this phenomenon but it is fascinating.
It's kind of similar to audio vs video. Although audio requires less data and processing than video, it's much more difficult to get the audio good than the video and if the audio is bad or even missing for some time, it's useless, while if the video is bad, stuck or missing, it's not that big of a deal most of the time.
This is particularly true in a video conferencing situation. If the audio is bad, you miss out a lot. If the video is bad, it's not a big deal.
Deaf people would disagree :) if you talk in sign language on zoom missing video parts would ruin the conversation.
I don't think it's about precision, in the case of audio vs video - if you remove all the even columns from a video it would be similar to reducing quality, the same can be done with audio - removing half of the frequencies uniformly will just lower the quality.
That's a pretty specific case. You can get really good performance for a ton of tasks in video (video question answering, object identification and tracking, action recognition, etc) by just sampling a frame per second or even less frequently. Definitely can't do that with audio.
Right - The ‘palette’ for text generation is smaller: just 26 or so letters (plus some other characters), and if you put the wrong ones next to one another the result is garbage.
There’s something interesting in the fact that an image based system doesn’t need as much complexity to capture a semantic model as a verbal system does; I think there’s maybe a parallel there to the way that human minds find it easier to just ‘visualize’ some things as a basis for reasoning about them, but if we can’t ‘visualize’ and instead have to ‘think things through’ it’s a more intensive process.
Like, GPT has well known trouble counting - ask it for five things and it will give you four or six. Humans can offload some thinking about counting to visual/spatial reasoning though.
I was having some gourmet crème brûlée with my friend Zoë at the café near the entrepôt in Åland's capital city and that made me realize it's but your naïveté when you say English is contained in mere 26 letters, for there is a soupçon of über important words that have diacritical marks in them.
And if it is characters (as it is for some models), its more than 26 of them for English. Between space, case, punctuation, and digits, its basically 7-bit ASCII without most of the control characters (newline is semantically important, the rest not), almost 100 characters.
More like 256 of them, just an image is actually 3 layers with RGB. If you are going into combinations, I could easily say that triplets of tokens are actually what's important, and so 250k^3 is the real number of whatever.
Actually now that I think about it, neural networks work with real numbers, so a pixel is just 3 numbers. Typical input for an image model would then be around 300x300x3 values. While an input for a language model is around 2000 tokens, but while each token is inputted as an integer into the model, mathematically it represents a 250k length vector, so mathematically the input is 250k x 2000 values. So 90k vs 500M. Also pixels next to each other in an image are related, so you can reduce model size by taking advantage of that (CNNs).
This is utterly wrong. There is a huge amount of redundancy in images compared to language. This redundancy is why image models have yet to surpass language models. In some sense, language is much easier than the vision problem.
I think no one has bothered with using as many images as documents used to train GPT-3.5, to create as big of a model, and then RLHF as done to produce ChatGPT from GPT-3.5 is why image models haven’t surpassed language models.
At any level of scale of model and scale of training set, images models do surpass language models.
Completely different model sizes? Stable Diffusion is a ~1B parameter model. The 7B parameter LLAMA model would more comparable in RAM usage.
It's not accurate to say "LLAMA needs 130 GB" because there's more than one LLAMA.
The hard part in general is it often doesn't work very well to train small model sizes directly. You can train a very large one and distill it down, in some cases to only 1% of the original size while retaining 99%+ quality. So clearly the 1% size model exists. However training it directly usually doesn't work nearly as well. Best I can tell no one knows for sure why.
Big tech cos are training the largest models as no one else has the hardware/power/money/etc to do so. SO they tend to release the most massive ones that only they can make. There's also a sect in the ML community that thinks "scale" is the answer to the universe...
Then, either the tech cos or the community will make multiple other sizes of models and those are the ones that normies can use.
Model sizes are adjusted to reach a given quality. Many papers justified empirically the use of more neurons allocated to language than to visual understanding, like Imagen[0] (figure 4.a). as it yields better results. Debatably, it is also true in the human brain.
To me, it could highlight that there is more entropy in language than in images.
After all, games have a similar property: some games, like noughts and crosses, or checkers, can have the same model architecture reach optimal play with smaller sizes than are necessary for chess and go[1].
It is certainly true that language has a lot of rules (grammatical, vocabular, syntactic) which are necessary to master it, but irrelevant to having a mental model of a scene; it just sequences a state of things into a stream of symbols. Maybe an additional entropy to learn in language which is absent in images includes motion and emotion.
> There's also a sect in the ML community that thinks "scale" is the answer to the universe...
Offtopic: Reminds me of Laurent Nottale and his controversial "Scale Relativity and Fractal Space-Time: A New Approach to Unifying Relativity and Quantum Mechanics"
Natural language can represent an abstraction of very diverse stuff and phenomena in the world. It can represent events (with time dimension) and interactions between entities and events, abstract and physical, and meta-interactions as well, at multiple layers of abstractions.
2D still images, at least the sort that humans are familiar with, are more limited in terms of representation power and thus more amenable to compression into network weights.
Perhaps models representing 4D phenomena (3D entities with time dimension, e.g. videos of real 3D models) would be more comparable in size to natural language models. Since LLMs can also represent abstract and unreal entities, while 4D representation can represent more details, it's hard to say which kind of models is richer.
exactly. visual information is more compressable than natural language: much of it boils down to locality, whereas language forms are highly pareto distributed plus the conceptual system is a huge hypergraph, so it's rather the opposite of "local" organisation of information.
See also: The case for 4-bit precision, which shows effectively no output quality reduction for these 4bit quantization methods (and considerable speedup) https://arxiv.org/abs/2212.09720
absolutely amazing. I'm stunned how fast quantization was done.
do you think there's anything left to trim? like weight pruning, or LoRA, or I dunno, some kind of Huffman coding scheme that lets you mix 4-bit, 2-bit and 1-bit quantizations?
I can't edit my comment now, but it's 30B that needs 18GB of VRAM.
LLaMA-13B, GPT-3 175B level, only needs 10GB of VRAM with the GPTQ 4bit quantization.
>do you think there's anything left to trim? like weight pruning, or LoRA, or I dunno, some kind of Huffman coding scheme that lets you mix 4-bit, 2-bit and 1-bit quantizations?
Absolutely. The GPTQ paper claims negligible output quality loss with 3-bit quantization. The GPTQ-for-LLaMA repo supports 3-bit quantization and inference. So this extra 25% savings is already possible.
As of right now GPTQ-for-LLaMA is using a VRAM hungry attention method. Flash attention will reduce the requirements for 7B to 4GB and possibly fit 30B with a 2048 context window into 16GB, all before stacking 3-bit.
Pruning is a possibility but I'm not aware of anyone working on it yet.
I too have zero qualifications, but I think “text” is slightly more complicated - from an information theoretical perspective - than we give it credit for.
A letter carries significantly more information than a pixel. One word can change the meaning of the rest of the text. (“joke:”)
They're categorically different media, it's not just a matter of quantity. You could sum up 1000 pictures of bananas with the word "bananas", or you could spend 1000 different words describing nuances and context in just one of those pictures. Something is lost (and something is gained) either way.
In a sense, language is a clumsy facsimile of the concepts we mean to express, in that we search for words to express the ideas in our minds rather than the other way around. By contrast, an image represents precisely the concept it depicts, by definition.
We forgot this about language by about the time of the Enlightenment era, when the intellectuals of the time thought that forcing everything to inhabit the structures of language (i.e., "rationality") represented the highest moral good one could achieve.
Possibly it is because with things like Stable diffusion we give it a lot of passes when things don't exactly right. Images just have to be close enough.
Text however if it is only a single word out, the whole meaning and readability can change. It needs a significantly larger data set to ensure clearer readability.
LLM text is token based not pixel based. And LLM output is ~1K tokens, while a picture is ~1M pixels.
And pictures aren't necessarily made of pixels. They are modelled as a collection waves (JPEG), and displayed as pixels. I don't know how LLM/whatever image models represent images, though.
Images are more global: changing the color pallette changes the tone (ha!) like a descriptive word in text.
Machine learning models don't store training data. The space a picture takes is irrelevant. For instance, stable diffusion would be the same size if it trained on 1 billion images than if it trained on 200 million or even 1 image(or 0 images).
Weights/parameters are configuration settings not training data storage. When weights/neurons/parameters are updated after each training loop, you are essentially updating configuration settings that direct generations, not storing any particular training text or image.
Weights are what take up the space. The bigger the parameter size/the number of weights, the bigger the size of the model.
Image generators don't need the huge parameter numbers text generators need to be useful. What they need to learn simply isn't as complex.
> What they (image generation models) need to learn simply isn't as complex.
This is the surprising part. People seem to intuit that images are richer and more complex than words; a picture is worth a thousand words. But apparently this isn't true? Or perhaps our training methods for text models are way worse than those we use for image models.
A picture may be worth a thousand words when the information you want to convey is visual. But that's not the case the overwhelming majority of the time.
Imagine having this discussion (or the comment thread as a whole) using exclusively pictures, for example... at least you can describe an image with words (even if the result is very lossy), most of the time it's not even possible to describe a text with images.
In my view, language is infinitely more versatile and powerful than images, and hence harder to learn.
The complexity of what is learned is rooted in the complexity required to complete the task. Predicting the next token may seem deceptively simple but you have to ask yourself what it takes to generate/predict passages of coherent text that display recursive understanding. Seeing as language is the communication between intelligent minds, there's a lot of complex abstractions encoded in it.
The typical text to image objective function is more about mapping/translation. Map this text to this image. Neural Networks are lazy. They'll only learn what is necessary for the task. And mapping typically requires fewer abstractions than prediction.
This is just a guess, but I don't think there's such a deep lesson here; language models and image models have simply been developed by mostly-different groups of researchers who chose different tradeoffs. In an alternate history it may very well have gone the other way around.
I would disagree. We have image generation with a variety of architectures. Diffusion models aside, it still takes a lot less parameters to model State of the art image generators with transformers (eg Parti).
Simplifying a bit, mapping (which is essentially the main goal of image generators and especially transformer generators) is just less complex than prediction.
True in this situation, but note that intermediate activations and gradients do take memory and in other contexts that's the limiting factor. For example purely convolutional image networks generally take fixed-size image inputs, and require cropping or downsampling or sliding windows to reach those sizes - despite the convolution memory usage being constant for whatever input image size.
Not an expert at all, but maybe because the accuracy of a LLM needs to be higher than Stable Diffusion. Pictures can just 'look' good. But text very quickly can be 'off'. Put one word in the wrong place and the entire thing doesn't make sense anymore. And there's overall context you need to take care of, can't repeat and can't go off the rails too much.
A picture at the end of the day is a 512x512 grid and there's many many combinations that would pass as a good result.
This!!!
I think the secret reason is that, on first glance, we are totally OK with six fingers in the picture. On closer examination we might balk at this but we won't notice if the "brush stroke" in a painting is slightly off (e.g. discontinuous in color by a few hues of red here or there).
In language, we would definitely notice a missing, or a Japanese word in the 中 of an English text.
So maybe the best and largest current image is just equivalent to GPT-1 in precision... but we just don't notice because our brains "smooth out" the imperfection.
I have zero qualifications to talk about this subject, but from what little I do know it's because Stable Diffusion's network has "only" ~900m parameters, whereas ChatGPT has 175 billion parameters.
Sure, but that's just another way of saying SD needs less space than ChatGPT, right?
The question is why an image-generating model needs so much fewer parameters than a text-generating model in order to produce useful results, when our everyday experience teaches us that images need much more storage space than text to convey similar information.
I think that's because the mind can make more sense of a mismash of visual data and see patterns and pictures. Comparatively if drop words here or you can still make sense of what I wrote, but it's much more difficult for the reader to parse what I'm saying if I accidentally a few words. If my drawing has a clearly recognizable human figure but it has 7 fingers, it's still recognizable that it was trying to draw a human.
LLMs have been able to generate words for years now. Hell, Tay was back in 2016. Making sure ChatGPT is able to answer the way it does (ie filter out bad things) is part of what makes it hard, and thus bigger, to implement. But having a cohesive readable output is what's hard and takes up more space. Plus, unless I missed it, we don't actually know how many gigabytes the model file for ChatGPT is.
>images need much more storage space than text to convey similar information
That is because an image is a collection of bit that attempt to represent reality as it is. It probably is easier to relates blue as a "color archetype" when you literally have a collection of bits that mean literal "blue" all the time. In languages "blue" doesn't always mean the color.
Texts are abstractions/coded form of reality. Your mind itself contains the decryption codes to translate text into what it actually means. That decryption code for text apparently is much bigger and harder for a machine to crack than interpreting images.
Also, images are bigger than a text file because of the way data is stored. It probaby has nothing to do with the amount of information stored inside. A book about quantum physics can be smaller than an image of a cat for example.
It could suggest that humans have a much more sensitive and expressive sense of language than they do image perception, which seems plausible. We can spot flawed language in more contexts than flawed images, and can produce a bigger range of representation in text and than image. So for a system to produce human-satisfying text, it needs to be far more prepared than one that produces images.
Conversely, this may underscore how inefficient pixel-like storage is for communicative and artistic images. Ten years from now, many of those kinds of images may only take a few hundred bytes and a good enough generator model to “decompress” them for display.
No it’s just that an image can’t be “false” or “grammatically incorrect”. Anything recognizable can make a satisfying image and people don’t seem too bothered by the kinds of mistakes image generators make.
Text generated though we want not to be just a pile of recognizable words but to follow some pretty strict rules and to actually be true.
Imagine you insisted stablediffusion only produce photorealistic images and judged it for every inaccuracy.
That’s exactly what I said. We have those words (false, grammatically incorrect, etc) because we have a more particular sense of language and identify more nuance in how language fails for us than how images fail for us.
It's more that nonsensical language is meaningless, because language has so little content density to begin with, so losing a little loses almost all of it, whereas a mishmashy picture has lots of pieces of meaning, even if the whole is a mess, because images are far more information dense.
This is the most likely I believe.
I expect that the local correlation structure in images is more homogenous than in language, there are more chances to make mistakes (e.g. confusing the intangible and tangible - You can't see green ideas sleeping furiously) and, perhaps, we might even criticise mistakes from long range interactions more in language.
Images need more space than text because images encode far more non-textual information.
Sure, an image of standard font/m face/size/weight/color text has low information, because it's only using asymptotically 0% of it's expressive power. Just look at how a logo conveys more information just by styling text.
I think here we are correlating the size of the input and output here with the size of the machine in the middle that creates it.
It's kind of like saying "Why would the factory creating tiny processors for phones (with only tiny bits of raw materials) need to be larger than that other factory that produces loads of big loafs of bread?"
The mistake you're making is thinking of weights as an avenue to store training data. This is wrong. I've hopefully explained it more in another comment but weights/neurons are essentially configuration settings.
When running diffusion models locally you often generate a lot of garbage. People with eight fingers per hand, cats with four tails, etc. When that happens you chalk that up to bad prompting or just the “magic” of image generation being too arcane to know.
When it comes to text, people don’t find “cats are an animal with four tails” an amusing statement in the same way that they do a drawing of one. The standard of acceptability is way higher.
I'm not convinced by this explanation. Small language models (similar in size to the Stable Diffusion network) don't just produce incorrect statements like "cats are an animal with four tails", they produce incoherent sentences with no relation to the text they are supposed to extrapolate. It's not that they are wrong, they don't even make sense much of the time. That's not true for SD. Yes, many of its output images have flaws, but the overall image usually shows the desired subject, and roughly resembles something a human artist might paint.
All decent replies but, the truth is that it is mostly a quirk of transformer architecture which scales quadratically in the length of the sequence because transformers look at all pairwise combinations of the input tokens. So you can get the memory to whatever level you want to just increase the sequence length (also known as context):
No it doesn't, it just improves the constant factor by a lot. Unless you are thinking of block sparse attention, which can chance complexity (assuming you scale sparsity with size) but decreases quality.
Size models, as people already pointed out. Also: keep in mind that reaolution of the images generated by txt2img models it is usually around 512*512. If you wanted a 50 megapixel photo, the VRAM requirement would grow by a lot. Granted, the number of parameters in the model would still be the same.
Entropy. Take a colored pixel. It is connected to a four other pixels each could be one of say 256 colors but are more likely likely to be a shade similar to each other. So you have roughly 1000 options. But then take a given English word. How many possible words might come after the word "the"? The set of possible connections between words in a paragraph are larger than the set of possible pixel colors in a given image (a realworld image, not white noise/static/random pixels.)
Visual knowledge can be decomposed into a limited set of primitives and operations. Whereas textual knowledge can accumulate arbitrarily and is bounded only by time.
It’s the difference between learning a book of sports facts (language) and what sports facts look like (vision).
Another example would be those passages in the Bible that list all the things owned by some person. There is a simple grammar to generate those passages. But learning them required more memory.
If you want to do a good job of generating text, you have to develop a model of how the world works. For example, if you describe an experiment from a paper to ChatGPT and ask it to generate the results section of the paper, then ChatGPT probably needs to understand the phenomenon that the experiment is about and be able to model it to some degree in order to generate plausible results. If you think about ChatGPT in this way, then it is not just a text generator, but a world simulator. The more accurately you can simulate the world, the better you can generate text. I think this is where the model's size and complexity comes from. ChatGPT needs to know as much as it can about basically everything.
Putting it more generally, the difficulty of a computation isn't necessarily correlated to the filesize of the end product of that computation. Imagine simulating the entire world to try to predict what next week's lottery drawing numbers are going to be. Would require an unimaginable amount of data and computation, yet the output will be just a couple numbers.
Why is this so surprising? Almost all animals can process images, while only humans and arguably a few more can process language. Language is clearly much more complex from a processing point of view, even if it takes less memory when stored.
(I know image generation models also use language, but to a much simpler extent, at least for now).
While I think it’s nonsense comparing what we do when we learn and what a computer
Program does, I’ll speculate on this.
Our eyes are doing a lot of signal processes before the “image” hits our brain. My understanding is audio has less signal processing required before the “sound” hits our brain.
I like to think of stable diffusion as decompressing a text prompt into a 512x512 image. In some sense the 10 word text prompt and seed contains the same amount of information as a 262k pixel image. You could say to a transformer a picture is only worth a few words.
By extension a if every image in the stable diffusion dataset compresses down to a sentence of text, then stable diffusion needs "only" 260M images * 10 words worth of information to train. Gpt3 on the other hand was trained on 45TB of text data.
The model is doing more than spitting out text, it's spitting out language. Language has much different statistical properties than random text, and we would expect that we would need a significant amount of complexity to avoid generating complete trash from what is effectively an inferred probability distribution over long passages of language-bearing text.
Images are limited in required dimensions. You tweak some pixels and you’re done.
Language on the other hand… every word you say needs to be matched probabilistically with the next to make sense… imagine how many ways this reply alone could have been formulated :)
This was counter-intuitive to me too! I was recently playing around with some of the LLMs that can run on consumer hardware (via KoboldAI, RWKV, etc) and, boy, are they not as good as ChatGPT despite consuming all my Mac's resources. Meanwhile, can get Stable Diffusion images in under a minute!
Here's a very unbaked thought, sparked by your question: a huge amount of organisms of varying sizes can parse and react to visual representation, but the amount of organisms that can handle any kind of language is much smaller.
if the intuition is based on images taking more space to encode than text, then it's a false intuition because the size of models is not correlated to the encoding size of the individual data points - it's correlated instead to the amount of relevant choices the model can make and the complexity of the dynamics in the model
I don't think general assertions like "language is more complicated" are congruent or meaningful, it really depends on what the model is trying to achieve; it's the complexity of that which will require a larger or smaller model
Image corresponds to a sentence or sometimes just a single word. What ChatGPT does is closer to video generation - a much harder task. I expect video generation models to be much larger than LLMs.
It is intriguing. I would have guessed human language with all its structure, would require way less parameters. If one were to look at the possibility of a 400x400 image and say 1000 words that describes it, the image would be from a 160,000^(16M) dimensional space. Whereas the 1000 words would require 1000^(40000) dimensional space. Space of all possible words seems smaller than all possible images. True, that visual image has a lot of redundancies, meaning I can change a lot of pixels and still the person will say both the images are the same. Whereas if you look at language, if you change even a few characters, humans might recognize the change. But human language is very heavily structured. It is constrained by grammar, constrained by semantics ('purple banana danced on top of the super-scalar processor' is nonsense.) etc. So once you apply these, the search space seems to get much much more constrained. Images are also constrained, for example if you take a random data point in the above space, it will look like noise to us. The visually interesting sub-space is much smaller. You can even constrain by sort of stochastic visual grammar (see David Mumford's work). The idea being humans have faces, faces have eyes etc. So if you see a face, you are more likely to see co-occurring parts as well. So both of them have a more constrained space that we are really interested in (one can define your own version of this). Our training of models is to differentiate/generate them within this space. And the question is, is one of these spaces definitely much smaller than the other. I would have presumed the constrained visual space is much larger than the constrained text space. Thus my only answer to the current contradiction being, we seem to be doing better with vision models than in language models. It could partly be that since we are more sensitive to errors in output of text, thus it is harder to find simpler models.
Another way to look at this. Let's look at training data. A human child might see 65M images (assuming 1 image per sec given temporal redundancy, 10hrs awake) by age of 5. Would have heard 50-100M words (assuming 20-30k words/day) and spoken a few million words and so 'trained' for 20-40k hours. And the child can speak reasonably well by this time and detect common objects etc. Stable diffusion was trained on 170 Million images (1-2 order mag diff from child) or 3x10^13 bits of info and trained for 150K GPU hours giving a 1 Billion parameters. GPT3 was trained on 600x10^9 tokens of info and trained for 900k GPU hours giving a 170 Billion parameter model. So it seems like stable diffusion is getting a lot better compression. About 1/30k vs 1/4 compression.
Caveat: Human learning process is much more complex and more effective (as of now at-least). We also learn actively by interacting with the world by changing the world etc. Think of the child gazing at the apple and looking at it from different angles or creating gibberish sentences very close to actual sentences and getting precise adult correction. We have a model of the world and we reason about it and provide 'consistency guarantees' between various questions about it, correctness etc (again all these only to a certain extent). Try asking questions like "I have a nail on the wall that is parallel to the floor, now I hang a painting on the wall. How is the painting placed with respect to the floor". Even a child would answer this.
Why is that so? It seems counterintuitive. A single picture snapped with a phone takes more space to store than the text of all the books in a typical home library, yet Stable Diffusion runs with 5 GB of RAM while LLAMA needs 130 GB.
Can someone illuminate what's going on here?