i was curious on what it takes to run this, the smallest ovh public cloud instan...

speedgoose · on Sept 30, 2023

You can run a 7B model on CPU relatively quickly. If you want to go faster, the best value in public clouds may be a rented Mac mini.

objektif · on Sept 30, 2023

Do you have any resources to read on how to host LLMs in general? I am looking for scaleable ways to host our own models. Thanks.

speedgoose · on Oct 1, 2023

Sorry I haven’t followed the latest developments to run at scale since the summer. I don’t have concurrent users so llama.cpp or diffusers are good enough for me.

thelastparadise · on Sept 30, 2023

Could it run on a 4x 3090 24GB rig?

These can be built for about $4500 or less all-in.

Inference FLOPs will be roughly equivalent to ~1.8X A100 perf.

mutex_man · on Sept 30, 2023

You could run it on a single high end GPU. I can run llama2's models ,(except 70b) on my 4080.

7moritz7 · on Sept 30, 2023

This can run on 1x 2060S 8 GB

ComputerGuru · on Sept 30, 2023

With what degree of quantization?

7moritz7 · on Oct 14, 2023

None, just the default weights using ollama. It's fast too. 13b is where things get slow

chriscappuccio · on Sept 30, 2023

does a 4x 3090 rig need nvswitch

lionkor · on Sept 30, 2023

Presumably some compute per hour service would make more sense for playing around with it?