As you can see from the pictures each of the devices uses its own power supply, so there should be quite some possibility to improve overall power consumption.
It is almost always the case that programs written for clusters -- including those which use MPI -- are communication bound. Right now the limiting factor is not these little processors, but rather the interconnect speed. If I remember correctly, Raspberry Pi has 10/100 megabit ethernet. [Edit: just checked, this is the case] So while this looks like a lot of fun, it's not very useful for anything meaningful yet.
Of course it's not fair to compare this to an infiniband cluster (that's not the point of this exercise), but I'd really be interested to see a cluster built on $0.50 ARM chips with at least a gigabit ethernet interconnect. A couple of years from now -- given the low entry cost and lower infrastructure costs (cooling/power consumption/etc) -- that could be a game changer.
There are a few different companies that have ARM + custom interconnect systems out there or in development. They're not necessarily cost-competitive yet, but they're an interesting start.
Oddly enough, Cray recently sold their interconnect tech to Intel [0]. Intel seems to be planning to integrate it on-chip down the road [1], which seems to leave Cray serving as a somewhat quirky system integrator longer-term.
It's even worse because the Ethernet is connected via USB. What I would love to see are 16 or 32 ARM cores on a single card connected via high speed bus such as Infiniband and pack 4 or 8 of these cards into a chassis.
I found it amusing because at Blekko I was playing with one and talked about building a cluster with them. I think it would be tremendously valuable as a teaching tool to build smallish (24 - 96 machine) clusters and teach folks to write distributed algorithms. Its a stretch to call it a 'super computer' but it is quite educational.
One of my favorite systems questions is to have someone walk through the design and implementation of a system where all the machines in the system respond to a query Q based on the contents of a linked list L. The system has an API which consists of L <- M(op) (mutate list), R <- Q(id) respond to a query based on the contents of the list, and R <- S() report on the stability of the list. Start with M(op) being idempotent, then non-idempotent, Etc. Folks who've had a good introduction to state machines will immediately recognize and a number of problems that arise as you control correctness. If folks get through the whole sequence we're taking about a function f(C) which takes a correctness coefficent from 1.0 (fully correct) to 0.0 (unspecified) and look at the performance of the system across that range.
That kind of stuff you could easily do on a 48 node Pi Cluster.
With the sort of work I do nowadays (large-scale ML and NLP), I generally need very little synchronization, i.e. my tasks are usually embarrassingly parallel. I typically save final results in a centralized store (DB or NFS) and look at it there.
MPI is essential for molecular dynamics simulations. You split the "box" of atoms/molecules up into different domains -- one on each processor. Occasionally you'll have particles wander into the next box. The information of these ghost particles must be passed around and MPI facilitates this.
I ran some numerical simulations on my Raspberry Pi and on my laptop. With the LCD on, my laptop used 20.4 watt seconds to do the calculation and the RPi used 17.1 watt seconds. The RPi drew 3 watts and my laptop drew about 60 watts. I think, even with my screen on, that my laptop would be more efficient if I had used 2 cores instead of 1 in the calculation (or if I had just turned off the LCD).
My conclusion, the RPi doesn't even win in FLOPS/Watt let alone $/FLOPS.
That might matter for a cluster, but that's (obviously) not the target market for a Pi.
If you only have less than $100 to spend, a better $/FLOPS ratio of a MacBook (or anything else, really) doesn't matter. One is available, one is plainly not.
For people that CAN spend enough this usually is just a (third? nth?) gadget to play with. Like in the article (because, it's really just a neat way of playing with gadgets and lego, not 'useful' in any sense that can be quantified).
It certainly doesn't have to! I just wanted to share the results of an experiment I did with numerical computing on the RPi.
Certainly laptops aren't the idea platform to compare against anyways. I'd compare to a 1u node that you would buy for compute cluster. One can by a 64 core 1u node with 64 GB of ram for about $7k. If one wants to play around with cluster computing you can emulate an entire cluster using one of those!
Of course that may (or may not) change if the RPi foundation will be able to get the programming documentation for the DSP/GPU part of the processor a recent blog post hinted at.
Well you are missing the point a bit: the RPi's mission is to be an accessible platform for teaching kids to code on. It's role as a toy for former-80s-8-bit-geeks is secondary.
what a pleasant surprise to see our little OSS project mentioned/used... surprised as this is a Windows/VS Python IDE & they're running Linux on the nodes - http://pytools.codeplex.com
RasPi is ~ 175 MFLOPS per unit (CPU only, discounting the GPU). So, this cluster works out to 11 GFLOPS, with 16GB of RAM for > $2500 USD.
For comparison, you could buy a motherboard, 32 GB of RAM, and an Intel i5 processor for $500 that will do over 20 GFLOPS.
So, it doesn't really stack up well from a price/performance standpoint. The value of these systems is more teaching students how to work with parallel code.
Yes, from a performance standpoint there's no reason to go after this type of system. No one is arguing this is a good system for production work.
The application of this is a teaching model. It's a lot easier to demonstrate parallelism gains on this type of platform. Scaling beyond a single ARM core is going to give you immediate performance benefits. Scaling further out to the entire cluster will continue to show returns.
With a single desktop, once you go beyond ~4 cores the gains will drop off too quickly. You just won't be able to see gains out to 64 threads on a single CPU, where on this you should.
It also doesn't hurt to have a quirky architecture to get students excited by. And yes, you could also spend some time discussing the architectural trade-offs and why this is not a cost-effective system for production use.
Click the arrow that is to the left of the commenter's name. It seems you are one of today's lucky 10,000: http://xkcd.com/1053/
I made a bad assumption because your account is almost one year old. I'm sorry. Now I see that information is not given anywhere in this site. I think this vote method was taken from Reddit.
I was trying to find a real comparison of MIPS of this thing compared to an I7 and landed on this page for performance data for the RPi. The most interesting thing to me is power consumption data at the bottom - that shows idle with network is around 370mA, which should mean (ignore power supply efficiencies) that 64 of these things should use about 120W at idle.
If those are switch-mode power supplies most of the that current will be drawn out of phase (ie. imaginary power); not important to the domestic customer, as the meter probably won't read it, but important to the supplier, who will come and make you fit power correction equipment to your supply :-)
Well to be explicit, the cannot netboot directly. I've built a u-boot that boots from the network via NFS on other arm based systems. You still need an MMC card to start it all off but once you get the network boot loader loaded and running you can do what ever you want. I started with some code for the Pandaboard that used that trick to boot from a USB attached drive.
http://www.southampton.ac.uk/~sjc/raspberrypi/pi_supercomput...
As you can see from the pictures each of the devices uses its own power supply, so there should be quite some possibility to improve overall power consumption.
Anybody with more experience of using MPI ?