Does adding memory help? There's a Rpi 5 with 16GB RAM recently available.

ata_aman · 2025-02-15T18:21:56 1739643716

Inference speed is heavily dependent on memory read/write speed versus size. As long as you can fit the model in memory, what’ll determine functionality is the mem bandwidth.

menaerus · 2025-02-16T13:42:32 1739713352

This is not universally true although I see this phrase being repeated here too often. And it is especially not true with the small models. Small models are compute-bound.

zamadatix · 2025-02-15T17:08:12 1739639292

Memory capacity in itself doesn't help so long as the model+context fits in memory (and and 8B parameter Q4 model should fit in a single 8 GB Pi).

cratermoon · 2025-02-15T17:09:35 1739639375

Is there a back-of-the-napkin way to calculate how much memory a given model will take? Or what parameter/quantization model will fit in a given memory size?

zamadatix · 2025-02-15T17:16:42 1739639802

To find the absolute minimum you just multiply the number of parameters by the bits per parameter, divide by 8 if you want bytes. In case 8 billion parameters of 4 bits each means "at least 4 billion bytes". For back of the napkin add ~20% overhead to that (it really depends on your context setup and a few other things but that's a good swag to start with) and then add whatever memory the base operating system is going to be using in the background.

Extra tidbits to keep in mind:

- A bits-per-parameter higher than the model was trained adds nothing (other than compatibility on certain accelerators) but a bits-per-parameter lower than the model was trained degrades the quality.

- Different models may be trained at different bits-per-parameter. E.g. 671 billion parameter Deepseek R1 (full) was trained at fp8 while llama 3.1 405 billion parameter was trained and released at a higher parameter width so "full quality" benchmark results for Deepseek R1 require less memory than Llama 3.1 even though R1 has more total parameters.

- Lower quantinizations will tend to run proportionally faster if you were memory bandwidth bound and that can be a reason to lower the quality even if you can fit the larger version of a model into memory (such as in this demonstration).

cratermoon · 2025-02-15T19:20:01 1739647201

Thank you. So F16 would be 16 bits per weight, and F32 would be 32? Next question, if you don't mind, what are the tradeoffs in choosing between a model with more parameters quantized to smaller values vs fewer parameters full-precision models? My current understanding is to prefer smaller quantized models over larger full-precision.

monocasa · 2025-02-15T17:12:41 1739639561

q4=4bits per weight

So Q4 8B would be ~4GB.

JKCalhoun · 2025-02-15T17:59:48 1739642388

The 16 GB Pi 5 comes and goes. I was able to snag one recently when Adafruit got a delivery in — then they sold right out again.

But, yeah, performance aside, there are models that Ollama won't run at all as they need more than 8GB to run.

baq · 2025-02-15T18:29:05 1739644145

Rpi 5 is difficult to justify. I’d like to see a 4x N150 minipc benchmark.