Inference speed is heavily dependent on memory read/write speed versus size. As long as you can fit the model in memory, what’ll determine functionality is the mem bandwidth.
This is not universally true although I see this phrase being repeated here too often. And it is especially not true with the small models. Small models are compute-bound.
Is there a back-of-the-napkin way to calculate how much memory a given model will take? Or what parameter/quantization model will fit in a given memory size?
To find the absolute minimum you just multiply the number of parameters by the bits per parameter, divide by 8 if you want bytes. In case 8 billion parameters of 4 bits each means "at least 4 billion bytes". For back of the napkin add ~20% overhead to that (it really depends on your context setup and a few other things but that's a good swag to start with) and then add whatever memory the base operating system is going to be using in the background.
Extra tidbits to keep in mind:
- A bits-per-parameter higher than the model was trained adds nothing (other than compatibility on certain accelerators) but a bits-per-parameter lower than the model was trained degrades the quality.
- Different models may be trained at different bits-per-parameter. E.g. 671 billion parameter Deepseek R1 (full) was trained at fp8 while llama 3.1 405 billion parameter was trained and released at a higher parameter width so "full quality" benchmark results for Deepseek R1 require less memory than Llama 3.1 even though R1 has more total parameters.
- Lower quantinizations will tend to run proportionally faster if you were memory bandwidth bound and that can be a reason to lower the quality even if you can fit the larger version of a model into memory (such as in this demonstration).
Thank you.
So F16 would be 16 bits per weight, and F32 would be 32?
Next question,
if you don't mind,
what are the tradeoffs in choosing between a model with more parameters quantized to smaller values vs fewer parameters full-precision models?
My current understanding is to prefer smaller quantized models over larger full-precision.