How to Run Llama 3 405B on Home Devices? Build AI Cluster

FloatArtifact · 2024-07-29T02:26:09 1722219969

An interesting idea, but there's no discussion about benchmarks in the article, cpu or other hardware requirements beyond 230 GB RAM for the cluster.

It seems impractical that a home would have 4 machines with 64 gb ram that would be dedicated to a distributed system. Max, core count, 16 cores from Consumer AMD CPUs? From a cost perspective build a system with 256gb ram and AMD Epic CPU?

The only thing I can think of that pushes a need for a distributed system is multi-GPU across multiple systems.

duffyjp · 2024-07-29T13:59:53 1722261593

There's a bit of info on the project page. It's CPU only, which is why GPU and VRAM is never mentioned: https://github.com/b4rtaz/distributed-llama

I'm using an old x99 board for my desktop currently. If I swap out the i7 with a Xeon it can take up to 512GB of ram. That would be pricey, but I could do 256 and the Xeon for under $300 total. Still a lot for a toy, and I'm sure it would be super slow...

adrian_b · 2024-07-29T11:16:02 1722251762

Unless you buy some obsolete Epyc CPU with a small number of cores, which would be slower than even one up-to-date Ryzen 9 9950X (which can beat even 32-core older Epycs in compute-limited tasks that use vector instructions), 4 Ryzen boxes with 16 cores & 64 GB ECC memory each should be significantly cheaper (probably not exceeding $6000) and also faster than one box with an Epyc CPU and 256 GB, where the CPU alone could be more expensive than the complete Ryzen systems, if bought at retail prices.

I actually happen to have in my home a cluster of five eATX cases, which stay on two adjacent IKEA metallic tables and which are connected in a ring of direct 10 Gb/s Ethernet links (i.e. they all have dual-port NICs). Therefore it is not really impractical, even if such a configuration may be infrequent.

In the distant past, I was using dual-socket Xeon or Opteron motherboards in those cases, because such MBs and CPUs were much cheaper than today. Then, by the time of Zen 1, when the Epyc CPUs were still very cheap, the upgrades have been to single-socket Epyc MBs. More recently, Epyc became much more expensive and Ryzen CPUs have much better performance per dollar, so the latest upgrades have been to Ryzen MBs. I look forward to the launch of Ryzen 9 9950X, which should double the throughput of the vector operations, 5 years after Ryzen 9 3950X (Zen 2) has done the same in 2019.

As my PC, I use an Intel NUC. Whenever I need to execute something for which that would be too weak, I launch it on one or more of the servers, by using Wake-on-LAN and shutting them down after the task is completed. In this way, my average electrical power consumption is much less than if I used a beefy desktop, while the peak performance is much higher.

imtringued · 2024-07-29T14:25:53 1722263153

> 4 Ryzen boxes with 16 cores & 64 GB ECC memory each should be significantly cheaper (probably not exceeding $6000)

Yeah, but then the question arises: why? Assuming 4x $700 for a mini PC setup, you still only get like 360GB/s memory bandwidth. It is a lot of money with nothing to show for it, really. Your model will run at 2 tokens per second. A GPU setup within the same budget could run a 70B model at high speeds.

This is why I am kinda bullish on the XDNA NPU from AMD. It represents the low end. People who run small models below 16GB in size. Since the hardware will be available on almost every AMD laptop, it will be much easier to actually utilize in say a video game. As of today, it is kind of annoying when an LLM based game asks you for an API key. The out of the box experience isn't that good.

adrian_b · 2024-07-29T14:58:10 1722265090

I have replied to the posting claiming that an Epyc might be better, which it is not.

The bandwidth per core is the same for a 16-core Ryzen and for a 96-core Epyc.

Of course if you compare it with a GPU, the GPU wins in memory bandwidth. However the GPUs are useless for FP64 computations, unless you are a big corporation that can afford the huge prices of the "datacenter" GPUs.

For ML/AI, GPUs are the normal choice, except when you cannot afford those with enough memory, when you may still want to fall back to cheap CPUs, as argued in the parent article.

A $1500 Ryzen 9 9950X system will have a double price in comparison with a mini-PC, but it will be 3 to 4 times faster in any program whose performance is dominated by the speed of the array operations. This speed ratio is true for the fastest CPU that could be put now in a mini-PC, i.e. AMD Strix Point (a HX 370 does 192 FP32 FMA per clock cycle, while a 9950X does 512 FP32 FMA per clock cycle and its clock frequency is much higher; it should be noted that these values are similar to those for integrated GPUs; the best iGPUs have a double number of ALUs, but their clock frequency is much lower, slightly more than half of that of a CPU). In comparison with a low-power Intel CPU or older AMD CPU, the speed advantage of 9950X would be even greater.

RealStickman_ · 2024-07-29T07:41:49 1722238909

If you just want to play with this a bit, around 250$/month should give you enough metal when renting from cheap VDS/dedicated server providers.

wkat4242 · 2024-07-29T10:59:22 1722250762

250$ is a lot of money to me though. I spent €290 on my 16GB GPU for my AI server but that was once off and I really had to think about it.

I'd love to see a llama model that fits now economically inside 16GB. The 8b is a bit too small when quantised even to 8 bits. A 16-20b model would be perfect.

But I think for 400b models to be viable, the hardware pricing really needs to catch up.

RealStickman_ · 2024-07-29T16:21:43 1722270103

> 250$ is a lot of money to me though. I spent €290 on my 16GB GPU for my AI server but that was once off and I really had to think about it.

Agreed, it's a lot of money and definitely more than I'd be willing to spend. However, compared to running such 400b models on a GPU cluster it's extremely cheap (and much slower)

malux85 · 2024-07-29T08:48:22 1722242902

Surely not with GPUs? Could you direct me to them please?

RealStickman_ · 2024-07-29T16:19:14 1722269954

This project only supports CPU inference at the moment, so no, no GPUs. I just estimated what some 256GB or RAM might cost monthly on the low end.

atemerev · 2024-07-29T16:00:31 1722268831

But this is not a problem, if you have money to buy 1TB memory, you can easily get a motherboard that supports it for around $800-$1000. It will be cheaper than attempting to build a cluster with 4x256GB RAM.

GPU inference is another thing, as high-VRAM GPUs are artificially priced way high so they could only be bought by corporations. However, if you attempt to build a cluster with say 10 4090s to obtain some 240GB VRAM, you won’t have enough electricity to run it at home.

I am currently building a 4x4090 rig, but that’s probably the maximum I could have at home giving my budget / available power restrictions. And that’s only 96G VRAM, slightly more than a single A100.

angoragoats · 2024-07-29T19:13:58 1722280438

> However, if you attempt to build a cluster with say 10 4090s to obtain some 240GB VRAM, you won’t have enough electricity to run it at home.

The average home (in the United States at least) has more than enough power to run 10 GPUs, or even more, with plenty left over for other appliances. Back of the napkin math is that ten high-end GPUs, at 350W each, need approximately 30-35A of power, and these days the bare minimum service in a home is 100A (and many homes have more than this; mine has 200A, for example). This could be accomplished on two standard 15A circuits, or perhaps more comfortably two 20A circuits.

You can also throttle your cards down to conserve power; I know that 3090s set to max power of 225W only lose ~5% of performance when running LLM workloads, while saving ~125W of power at full load.

atemerev · 2024-07-29T19:23:02 1722280982

Well, I live in an apartment in Switzerland, we have different standards.

angoragoats · 2024-07-29T19:42:25 1722282145

Yeah, makes sense. I was just trying to point out that "you won't have enough elecricity to run it at home" is not true for many single-family homeowners.

I'm not familiar with the typical Swiss apartment, so I'm sorry if this is a stupid question, but do you not think that your apartment could sustain a ~15A load at 230V? That's how much you'd need for ten RTX 4090s at full power.

atemerev · 2024-07-29T21:02:53 1722286973

It is not just the video cards. CPUs, mainboards, cooling systems, losses/inefficiencies… E.g. 4x4090 Comino Grando workstation is rated for 2811W, which is OK. 8x will probably trip my circuit breaker already, unless I plug it into higher-power circuit for kitchen appliances.

angoragoats · 2024-07-29T21:57:38 1722290258

Yep, in the US you’d need to use at least two separate circuits as well. I’m just saying that there probably is sufficient electricity for a crazy 10+ GPU setup, in most single family houses at least.

BizarroLand · 2024-07-30T19:12:05 1722366725

If you're going that far out you might as well just tap into your breaker box and add a 40a 110v circuit. As long as you can run your cluster near the breaker box it would not be a huge expense to do that, and if like most homes your breaker is in the garage then that would make venting the excess heat out fairly simple.

angoragoats · 2024-07-30T21:51:59 1722376319

I don’t condone doing this under any circumstances without first consulting an electrician. Lots of things could go wrong; for example, 40A @ 120V is not a very common configuration, and most standard household receptacles are rated for only 20A at most. Consult an electrician, folks.

BizarroLand · 2024-07-31T17:04:44 1722445484

Thanks, I did mean to imply that you would be paying an electrician to do the work to code, but I did also overlook that someone who wants to run a 405b Llama ai at home might be more of a diy'er than the average person.

Kirby64 · 2024-07-30T23:39:35 1722382775

Or just run your PSU off 240V. Much more standard supply, and most switched mode supplies already support 240V. They’re typically more efficient, too.

angoragoats · 2024-07-31T20:06:12 1722456372

That works for sure, though it's less common in the US to have a 240v circuit available (with an appropriate receptacle) anywhere you might put a large server. Using two 120v circuits, you're more likely to be able to set it up with what you have now, without requiring any electrical work.

Kirby64 · 2024-07-31T20:12:17 1722456737

The comparison was a 120V 40A circuit. A 240V-20A socket would be more common than a 120V 50A socket. Both technically exist, but a 240V 20A outlet would be cheaper, safer, and more common. A 240V 20A outlet is actually somewhat common for larger window mounted AC units.

angoragoats · 2024-07-31T23:29:24 1722468564

> The comparison was a 120V 40A circuit.

Yes, which was a suggestion in reply to my comment mentioning the use of two standard 120v circuits, which I still maintain is the easiest and cheapest path for most people.

gumboshoes · 2024-07-28T22:04:36 1722204276

A prediction: distributed LAN AI will soon be a normal network device, as simple as a media server to install and run. Querying your home AI instead of Alexa or Siri. Desktop integration. Hints of Tony Stsrk's Jarvis.

gumboshoes · 2024-07-28T22:04:36 1722204276

A prediction: distributed LAN AI will soon be a normal network device, as simple as a media server to install and run. Querying your home AI instead of Alexa or Siri. Desktop integration. Hints of Tony Stsrk's Jarvis.

cryptoboid · 2024-07-29T13:14:23 1722258863

I love this. How does it compare to something like https://petals.dev/?