Hacker News new | past | comments | ask | show | jobs | submit login

Does adding memory help? There's a Rpi 5 with 16GB RAM recently available.



Inference speed is heavily dependent on memory read/write speed versus size. As long as you can fit the model in memory, what’ll determine functionality is the mem bandwidth.


This is not universally true although I see this phrase being repeated here too often. And it is especially not true with the small models. Small models are compute-bound.


Memory capacity in itself doesn't help so long as the model+context fits in memory (and and 8B parameter Q4 model should fit in a single 8 GB Pi).


Is there a back-of-the-napkin way to calculate how much memory a given model will take? Or what parameter/quantization model will fit in a given memory size?


To find the absolute minimum you just multiply the number of parameters by the bits per parameter, divide by 8 if you want bytes. In case 8 billion parameters of 4 bits each means "at least 4 billion bytes". For back of the napkin add ~20% overhead to that (it really depends on your context setup and a few other things but that's a good swag to start with) and then add whatever memory the base operating system is going to be using in the background.

Extra tidbits to keep in mind:

- A bits-per-parameter higher than the model was trained adds nothing (other than compatibility on certain accelerators) but a bits-per-parameter lower than the model was trained degrades the quality.

- Different models may be trained at different bits-per-parameter. E.g. 671 billion parameter Deepseek R1 (full) was trained at fp8 while llama 3.1 405 billion parameter was trained and released at a higher parameter width so "full quality" benchmark results for Deepseek R1 require less memory than Llama 3.1 even though R1 has more total parameters.

- Lower quantinizations will tend to run proportionally faster if you were memory bandwidth bound and that can be a reason to lower the quality even if you can fit the larger version of a model into memory (such as in this demonstration).


Thank you. So F16 would be 16 bits per weight, and F32 would be 32? Next question, if you don't mind, what are the tradeoffs in choosing between a model with more parameters quantized to smaller values vs fewer parameters full-precision models? My current understanding is to prefer smaller quantized models over larger full-precision.


q4=4bits per weight

So Q4 8B would be ~4GB.


The 16 GB Pi 5 comes and goes. I was able to snag one recently when Adafruit got a delivery in — then they sold right out again.

But, yeah, performance aside, there are models that Ollama won't run at all as they need more than 8GB to run.


Rpi 5 is difficult to justify. I’d like to see a 4x N150 minipc benchmark.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: