Can My Water Cooled Raspberry Pi Cluster Beat My MacBook?

thraway123412 · on March 26, 2021

2GB Pi is $55 on amazon. 4GB version is $62. That'd be $450+ for the eight pies.

Ryzen 2700 launched at $300 (and tapered down to ~$200) and would pretty much run circles around such a Pi cluster.

Just saying. These Pi clusters can be a cool and fun thing to build but if you're looking for compute power, you'd be better served by a mid-range desktop.

wccrawford · on March 26, 2021

Those 8 pis have all their hardware, though. The Ryzen needs all the rest of the hardware to finish off the computer.

thraway123412 · on March 26, 2021

Well, he also needed to add a cooler and a bunch of tubes & eight mounts to get coolant running on each Pi, eight ethernet cables, eight power cables, a power supply with enough outputs, a 16-port ethernet switch, and (correct me if I'm wrong) eight SD cards, if only to store a bootloader that can do pxe.

So eight Pies alone do not make a complete cluster, just as one Ryzen alone does not make a complete computer. A few times eight times cheap can turn out to be quite expensive, depending on how much cheap actually is.

I haven't paid much attention to component prices recently but my rule of thumb for budget builds is to start with $80 for each component that isn't a GPU or CPU. Go with stock cooler, grab 80 for PSU, 80 for mobo, 80 for RAM, 80 for storage, see where you end up. In the same ballpark, but the real computer will pack a whole lot more punch (even if you had less RAM total).

jamesbfb · on March 26, 2021

Jeff Geerling did a blog post[0] and video (series) on this very thing a while ago that goes into some of the hardware and costs. Whilst the costs can be negligible, it looks like a hella lot of fun!

[0] https://www.jeffgeerling.com/blog/2020/raspberry-pi-cluster-...

mbreese · on March 26, 2021

Don’t forget this quote from that page:

> It's slightly more cost-effective and usually more power-efficient to build or buy a small NUC ("Next Unit of Computing") machine that has more raw CPU performance, more RAM, a fast SSD, and more expansion capabilities.

But is building some VMs to simulate a cluster on an NUC fun?

I would say, "No." Well, not as much as building a cluster of Raspberry Pis!

---

You don’t build a RPi cluster for speed or cost or efficiency. You do it because it’s fun. And that’s okay.

Quillbert182 · on March 26, 2021

2GB pi starts at $35 at actual Raspberry Pi distributors, 4GB at $55, although there will probably be added shipping cost.

thraway123412 · on March 26, 2021

Yeah, nice. I just assumed Amazon is representative of what most people would end up paying for them.

For my region, the actual distributors start at 44 EUR for 2GB and 64 EUR for 4GB.

bogwog · on March 26, 2021

Would it though? The cluster has more cores (each PI4 has a quad core), so for certain workloads it seems like it could realistically beat a single 8-core with higher clocks.

As always, a benchmark is the only thing that will prove it.

thraway123412 · on March 26, 2021

It absolutely would. These cores are in a completely different class.

If you care about benchmarks, a Pi4 will score around 200 points (1 core) or 550 points (4 core) at 1.5GHz on GB5. At 2GHz you can reach around 700 points multi-core.

Ryzen 2700 will easily go above 6000 points in multi-core bench without overclocking.

So even in a theoretical, embarrasingly parallel workload with minimal sharing between cluster nodes, the Zen will be faster than eight pies. It's not even a contest if you also need some I/O, shared memory, or heavy SIMD.

bogwog · on March 27, 2021

Where are you getting these "points" from? Without a real world benchmark I don't see why you're so confident about this.

The video in the OP showed 8 pis, and assuming 4 cores each that's 32 cores total.

A raytracing benchmark would be interesting because you could divide the work up-front and not have to worry about communication between nodes in the cluster, and each node doesn't need that much memory.

Maybe the ryzen 2700 beats the 8 PI cluster, but clearly there's an X number of PIs that will beat the ryzen. Maybe it's X = 10 PIs, or 20, or 200? Idk, but it could also just be 8. There's no way to know for sure without a benchmark.

thraway123412 · on April 2, 2021

GB5 = Geekbench 5. It is a real world benchmark.

willis936 · on March 26, 2021

In terms of computer per dollar yes. In terms of compute per joule, no.

echlebek · on March 26, 2021

I don't think that's true either. You can get 8-core desktop Ryzen parts that have a TDP of 65 watts. I have a passively cooled 8-core Ryzen system that uses a 240 watt power supply.

Also: by the time you've wired up 8 raspis you end up using quite a bit of power just to connect them all together with a switch.

Raspberry Pi 4s need a maximum of 15 watts each. So 120 watts just for the computers. Even if you discount the power consumption of the switch, my 240 watt Ryzen computer is still going to beat that joule-for-joule.

Edit: one more thing, that 240 watt system also powers a 75 watt GPU, so it's definitely more wattage than really required for the CPU alone.

willis936 · on March 26, 2021

You're calculating the raspi 4 power consumption based off of recommended USB power supply current rating. The actual expected load power consumption without peripherals is 1/5 to 1/3 of that (3-5 W rather than 15 W).

https://www.raspberrypi.org/documentation/hardware/raspberry...

https://www.pidramble.com/wiki/benchmarks/power-consumption

https://raspi.tv/2019/how-much-power-does-the-pi4b-use-power...

imtringued · on March 26, 2021

You can just undervolt the Ryzen and it will still run cycles around the Raspberry Pi while consuming less power if you are into that sort of thing.

willis936 · on March 26, 2021

That is not true. The energy efficiency will go down if you downclock because the static to dynamic power ratio will increase.

rcxdude · on March 26, 2021

static to dynamic power ratio isn't efficiency though (and static power in modern desktop chips is tiny compared to the dynamic power). The dynamic power is not linearly related to processing speed (in fact it's much worse). If you downclock and undervolt a ryzen processor it will use much less power than the decrease in speed (e.g. from stock, if you drop performance by ~20% you might get a ~40% power decrease). Obviously at a certain point you will start to get worse again but most chips are not at peak performance/watt at their stock settings because raw performance and performance/cost also matters.

willis936 · on March 26, 2021

That scaling has a limited range before static power becomes dominant. Compute efficiency is compute / energy (total power * time). Total power includes static power. An SBC pulls far less from the wall than an x86 desktop could ever hope to when calculating the first million digits of pi.

As an example: even if I halted my desktop at 0 MHz and it still magically took the same time to calculate the first million digits of pi as a raspberry pi, it still would be using far more power.

zozbot234 · on March 26, 2021

Static power ratio is increasing in modern processing nodes, to the point where a "rush to idle state" strategy starts to make sense because you can power down subsystems in idle states but you can't do that if you lower processing speed. The tradeoff would of course be different in a chip that was architecturally designed from the ground up for low-performance use, but that would be from things other than just a lower frequency for the same chip.

artiscode · on March 26, 2021

I would really like to see how a raspi cluster fares with "real world" loads, like running an app or a distributed system, instead of calculating pi. I'm genuinely curious whether there would be any gains by running Docker containers, each on a separate Raspberry Pi. I personally am a web developer, therefore my technology stack is almost always the same - an app server, a background worker, a queue and a database. Often the app server and the background worker is the same process, therefore a cluster of 3-4 Pi's would be sufficent for such workloads. Theoretically the combined horsepower of all these Pi's should stack up and deliver better performance than writing code on my M1. Or perhaps I'm trying to solve a problem that doesn't exist.

thraway123412 · on March 26, 2021

> Theoretically the combined horsepower of all these [4] Pi's should stack up and deliver better performance than writing code on my M1.

If that is the case, then M1 should be the slowest CPU Apple has used in the past 10 years.

jvanderbot · on March 26, 2021

M1 is good at ops/watt and also ops/second. M1 probably will still beat the Pi cluster by having a tighter Pareto curve.

ThrowawayR2 · on March 26, 2021

There's the Geekbench 5 measurements:

Apple M1: https://browser.geekbench.com/v5/cpu/search?utf8=%E2%9C%93&q...

Raspberry Pi 4: https://browser.geekbench.com/v5/cpu/search?utf8=%E2%9C%93&q...

b3lvedere · on March 26, 2021

I think that's a great idea! Personally i'd like a nice failover test using a Pi Cluster. So if Pi's when failing and getting replaced on the fly if processes would still be up and running flawlessly. Probably a pretty useless and expensive setup, but i think it'd be awesome to watch.

smallpipe · on March 26, 2021

This is fun and all, but the benchmarks here aren't really what the author seems to think they are. This isn't "computationally expensive", the script is basically only control flow, where the CPU spend more time doing variable lookups than the actual computation. This means most of the pipeline width sits completely unused, whcih is a pretty large disservice to the M1 and the i5.

There's also no control for the thermal throttling of the M1, which is probably why the 100,000 example is performing worse.

plhk · on March 26, 2021

There's no mention of M1 in the article

bogwog · on March 26, 2021

The video specifically says that it's an Air with an i5

swiley · on March 26, 2021

>no control for thermal throttling

there's no control because that's part of what he's measuring.

m463 · on March 26, 2021

I tried FindingPrimesMulti.py on a 3970x:

  Find all primes up to: 10000 using 256 processes.
  Time elasped: 0.51 seconds
  Number of primes found 1229

  Find all primes up to: 100000 using 256 processes.
  Time elasped: 36.71 seconds
  Number of primes found 9592

  Find all primes up to: 200000 using 256 processes.
  Time elasped: 149.55 seconds
  Number of primes found 17984

EDIT: ran it again:

  Find all primes up to: 200000 using 256 processes.
  Time elasped: 145.76 seconds
  Number of primes found 17984

tpkm · on March 26, 2021

That's interesting - I just tried on my Ryzen 5 3600 and got the following:

10k using 48 processes: 0.82 seconds (using the original single threaded script this was actually faster at 0.65)

100k using 48 processes: 28.41s

200k using 48 processes: 99.68s

--- EDIT - looking at resource monitor python.exe is only using 7-8% of total available CPU resources

--- EDIT 2 - switching to ThreadPool from multiprocessing.dummy brought the 10k result down from 0.8 seconds to 0.3 seconds, but didn't impact the 100k or 200k results

m463 · on March 26, 2021

sorry, was using python2 - using: python3 ./FindingPrimesMulti.py

  Find all primes up to: 10000 using 256 processes.
  Time elasped: 0.36 seconds
  Number of primes found 1229

  Find all primes up to: 100000 using 256 processes.
  Time elasped: 15.7 seconds
  Number of primes found 9592

  Find all primes up to: 200000 using 256 processes.
  Time elasped: 58.07 seconds
  Number of primes found 17984

(this was debian 10)

wil421 · on March 26, 2021

My M1 Air results are similar to yours.

neatze · on March 26, 2021

this is with a fixed multiprocessing script ?

m463 · on March 26, 2021

is this what you mean? FindingPrimesMulti-1.zip

neatze · on March 26, 2021

If you have mpi4py then I would try this one:

https://github.com/joshjerred/mpi4py-with-multiprocessing-Ch...

To me this would be more fair comparison between single cpu and pi cluster.

m463 · on March 26, 2021

apt-get mpi4py3 and using python3 primeMP.py is lots faster

  Find all primes up to: 10000
  Nodes: 1
  Time elasped: 0.12 seconds
  [1229]
  Primes discovered: 1229

  Find all primes up to: 100000
  Nodes: 1
  Time elasped: 4.52 seconds
  [9592]
  Primes discovered: 9592

  Find all primes up to: 200000
  Nodes: 1
  Time elasped: 17.43 seconds
  [17984]
  Primes discovered: 17984

I even tried:

  Find all primes up to: 1000000
  Nodes: 1
  Time elasped: 383.2 seconds
  [78498]
  Primes discovered: 78498

neatze · on March 27, 2021

that's makes much more sense :)

I found this: https://setiathome.berkeley.edu/cpu_list.php

ARMv7 looks like is raspberry pi, looking at this I can't see any computational value for such cluster setup.

kortex · on March 26, 2021

Using Ray distributed would be a better stress test. Computing primes this way probably isn't the best way to saturate cores. You are spending a lot of time doing python vm operations vs pure number crunching.

Using numeric arrays chunked into blocks of number ranges would be more efficient (and therefore "crunchier")

https://ray.io/

cashewchoo · on March 26, 2021

If they're thinking pure CPU benchmarking, prime95/mprime has several options, such as:

[this is from my recollection]

* completely in-register only, no caches. they warn an organic workload will never ever do this and not to do it if you're not confident in your cooling * somewhat more normal math-heavy workloads that don't do any IO but also do access the caches like a normal person * etc

The python benchmark might be fair, though? I wouldn't be surprised if the ARM chip on the pi is like, fast, but then when it came to doing something more holistic - like a bunch of python vm operations - something about the mac is more robust than the pi in a significant way.

toyota86 · on March 26, 2021

The test script didn't share the load across processors so the author is rerunning the tests. It seems the graphs therefore are misleading (for now).

Nevertheless this is not as interesting as testing the M1 chip on the latest MacBook offering. I feel a bit misled but perhaps it was just my fondness for the M1 causing this bias.

albanread · on March 26, 2021

I can confirm for you; that there is no rational reason to assume the authors own MacBook is a new M1 system.

danbruc · on March 26, 2021

There are only 1229 primes up to 10,000, not 1230 as the article says. Not sure whether this is a bug in the code or a typo in the article. I still remember this because 25 years ago as a teenager I spend quite some time making primality testing as fast as I could. My implementation was certainly not as naive as the one from the article - only testing up to the square root of n, only testing against primes I had found before - but not sophisticated in any way, for that I lacked the mathematical knowledge. I can not exactly remember how fast I got it but I am pretty sure it was sub-one-second, like 0.2 or 0.3 seconds maybe for the range up to 10,000. On a 50 MHz i486.

willis936 · on March 26, 2021

1 is considered prime by some.

WJW · on March 26, 2021

There are also people who consider the world to be flat though. There is a very thorough mathematical definition of prime numbers and 1 is not prime according to that definition.

willis936 · on March 26, 2021

The definition a prime number is having two factors. As the multiplicative identity, one only has one factor. Some might see primes as being numbers with >=2 factors. This interpretation is less useful. What is completely useless is a definition that makes exclusions of specific numbers as dogma. What does that say about anything? It’s totally arbitrary and unexplanatory.

spekcular · on March 26, 2021

No, that is not the correct definition of a prime. You can find the right one on Wikipedia.

The reason that 1 is not a prime is to preserve unique factorization. A basic fact from number theory is that every integer uniquely factors as a product of primes, say 21 = 7 times 3. If 1 were a prime, we'd also have 21 = 7 times 3 times 1. That's bad.

In more general number rings, other units (e.g. i) are also not considered primes for the same reason. The exclusion is not arbitrary.

SuchAnonMuchWow · on March 26, 2021

As a complement of your comment: For more information, see Unique Factorization Domains (https://en.wikipedia.org/wiki/Unique_factorization_domain) for the generalization of how prime factorization work on things other than integers.

batpangolin · on March 26, 2021

For a mathematician the suggestion that 1 might be a prime is on a par with a suggestion that a circle is technically a type of triangle.

totalZero · on March 26, 2021

Whether or not that is true, the code snippet in the article shows this comment:

   #0 and 1 are not primes

dan1234 · on March 26, 2021

That's in the multiprocessing version, which was written by someone else and returns 1229 primes.

The original code includes 1 as a prime.

908B64B197 · on March 26, 2021

His workload is embarrassingly parallel[0] and the message-passing between the Pi is basically free since the results are so small to send-back (and there's nothing to send to each node on startup). He's effectively doing a trivial map-reduce [1].

He could probably get even better than the Pi cluster by using a (single) GPU.

[0] https://en.wikipedia.org/wiki/Embarrassingly_parallel

[1] https://en.wikipedia.org/wiki/MapReduce

nielsbot · on March 26, 2021

I assumed it was the M1. (It's an Intel MacBook.)

tumblewit · on March 26, 2021

same. the m1 would be much faster. also the water cooling is going to add cost which can be reused to add more Pi’s instead and I know i can run mine at 1800Mhz with a tiny heatsink via thermal paste and a 40mm fan with it not going above 65C so the cluster with a large fan would be very easy to cool

chetangoti · on March 26, 2021

Site got HOD. Here is https://archive.is/jhbQV

neatze · on March 26, 2021

What is HOD ?

bkuehl · on March 26, 2021

Hug of Death

neatze · on March 26, 2021

From quick overview seems like speed up is achieved because there is comparison of 4 local processes, versus 16 processes, since it is only single gather operation does not seem like latency would make much of difference.

I wonder what performance would look like on 5 years old 16 core CPU from ebay for like $20, compared to py cluster.

SethTro · on March 26, 2021

I have a 2 socket E5-2650 v2 for 16 cores which is as fast as a modern Ryzen above (but 50% slower on real world tasks)

0.8 seconds for 10,000

28.16 seconds for 100,000

neatze · on March 26, 2021

This would be interesting comparison against pi cluster on more accurate tasks, I have suspicion dual e5 will out perform pi cluster.

shaggie76 · on March 26, 2021

I'd be curious the relative power draw -- is the performance/watt actually better?

wil421 · on March 26, 2021

M1 Air Results:

10,000: 0.52 seconds.

100,000: 41.48 seconds.

200,000: 157.68 seconds.

hutrdvnj · on March 26, 2021

Please redo the benchmark against the M1.

wil421 · on March 26, 2021

See my comment above.

hutrdvnj · on March 26, 2021

Don't know why I'm getting down voted, but thanks, that's exactly what I was looking for.

waterhouse · on March 26, 2021

I suspect because the author of the blogpost probably doesn't own an M1 Mac (he mentions benchmarking several machines and it seems unlikely he would have forgotten the M1 if he had one), so your comment (interpreted as a request to the author) amounts to "Please spend about $1000 on another computer because I want another benchmark result", not to mention shipping time would make it days before he could report back. In other words, it seems to be an unreasonable request.

If you meant to ask the other commenters here, then something like "Does anyone have an M1 Mac? If so, would you run his benchmarks and post the results here?" would have been clearer.

hutrdvnj · on March 26, 2021

Ah okay, I didn't want the author to spend $1000, I just wanted someone with a M1 like wil421 to redo the benchmark.