Intel's 50-Core Xeon Phi: The New Era of Inexpensive Supercomputing

Osmium · on Nov 15, 2012

I'm really excited about our massively parallel future, not least because I have to run scientific code that would greatly benefit from it. But at the moment it's so hard to program for this sort of thing: can someone explain why, in simple terms, something like OpenCL or CUDA is so damn complicated? Is there any way to avoid having to have a low-level understanding of how a GPU or co-processor works, rather than expecting the vendor to implement an easier to use solution? I'm thinking about, e.g., Matlab's "parfor" (parallel for) command, which is super easy to use.

The article states that "All of these [CUDA/OpenCL] problems go away with the Phi. It's a pure x86 programming model that everyone is used to. It's a question of reusing, rather than rewriting, code" but I find it hard to believe I can just drop existing code into it and expect decent performance.

chubot · on Nov 15, 2012

I expect that multicore MapReduce will become popular (e.g. http://mapreduce.stanford.edu/ or google for more literature).

I suppose it is strictly less powerful than regular MapReduce, but at least with Hadoop the system administration costs are too much for a lot of people, and machines are getting beefier, so you can get a lot more done on one machine. In another recent thread there was a MS research paper about "ill-conceived" Hadoop clusters processing 14GB of data...

The main benefits I see are:

1) You don't have to write in a specialized language. You should be able to use any language with a good implementation. Scientific code often has Matlab, R, C++, and Python glued together.

2) MapReduce lets you write sequential code, which is easier to learn.

3) You can adapt/port sequential legacy code easily, so you can use a lot of your existing code.

MapReduce is of course similar to "parallel for" but more powerful -- parallel for is essentially the map stage. The reduce stage adds a lot. For some reason most people who haven't programmed MapReduce think of MapReduce as just mapping, and they don't understand reducing.

If you want to do it quick and dirty, don't underestimate "xargs -P" :) That's your "parallel for" that works with any language. You can run that on your Matlab, Python, C++, etc. You need a serialization library but there are a lot of those around. It works well and with a minimum of programming effort.

mturmon · on Nov 15, 2012

Have you used OpenMP (http://en.wikipedia.org/wiki/OpenMP)? It has the flavor of parfor -- you identify the embarrassingly parallel loops in your C or Fortran, put in something like

    #pragma openmp parallel for

in front of them, and your code transfers through pretty much intact -- it handles the thread wrappers. You can add other pragmas for the times when you need locking.

This is a much less intrusive setup than CUDA; you don't have to worry about loading data, or double/float conflicts.

The OpenMP extensions could be a very good fit for scientific programming on this coprocessor.

Osmium · on Nov 15, 2012

Thanks; I'll take a look. But OpenMP is CPU-only right? Apple's got their (currently less portable, admittedly) Grand Central Dispatch that does something similar. But as far as I know, if you want portable GPU code your only option is OpenCL, and even then it requires optimisation depending what device you're using it on (or so I've heard).

foxhill · on Nov 15, 2012

OpenMP 4.0 is likely to have support for accelerator devices (i.e, move the necessary data on to the device, run the computation, and move back to the host). in fact, that's one of the methods you can use the Phi right now (intel have extensions to OpenMP)

or if you can't be bothered to wait for such a standard, you should have a look at OpenACC[1], which does exactly this, and exists now. you end up adding code like

    #pragma acc kernels for

on top of your for loops, it does the low level work for you.

[1] http://www.openacc-standard.org/

zmanian · on Nov 15, 2012

The primary theoretical advantage of Larrabee's descendants on Intel side over the GPGPU lineage stuff from Nvidia is that GPUs were never designed for things like branch heavy code, or debuggers.

Compiler support for CUDA is further ahead of Xeon Phi. I haven't seen any evidence that Intel has been successful yet extending the auto-vectorizing capabilities of ICC to massively parallel environments.

For CUDA, one can write code in Python, Haskell, C++ etc.

At this point, Xeon Phi only offers Vectorized C and Fortran. There is a narrow domain of HPC code designed to run on Multisocket Xeon processors that could probably be re targeted trivially to the Xeon PHI.

jdboyd · on Nov 15, 2012

The Intel Xeon Phi homepage mentions that Cilk ++ and Intel TBB support the Phi, which would imply that the Phi has C++ support.

If you run Linux on the Phi (which Intel ships in their Manycore Platform Support Stack), then anything that runs on 386 linux should run here, which should include Python and Haskell.

If you don't choose to use Linux on the Phi, then your tools options will be limited just like they are if you choose to not use a regular OS on a PC.

wtallis · on Nov 15, 2012

OpenCL is basically the systems programming language for GPUs, like C is for CPUs: as low-level as possible while still having the capability of being hardware-agnostic. Something like OpenCL has to exist in order for the higher-level alternatives to be possible. And just like MATLAB offers acceptable performance, we will eventually see some good high-level languages for GPGPU.

You can be sure it won't be any currently popular language, though, because almost none of them have support for the kind of pervasive parallelism needed (why isn't parfor the default kind of for loop?), and of those that do support that kind of parallelism (typically by being purely functional and supporting lazy evaluation), none come equipped with the necessary facilities to optimize the code for a particular GPU (by tweaking how the problem is split up).

The Xeon Phi does have an advantage in that it's easier to get the code running in the first place, but the difficulty of optimizing it for a massively parallel GPU-like architecture is (for now) exactly the same as faced by OpenCL and CUDA users.

wladimir · on Nov 16, 2012

If you can formulate your problem in terms of tensor math (for example, neural networks) you can use Theano [1]. It is a very high-level approach. You give it mathematical expressions and it generates and executes GPU code for you. When I last used it it only supported CUDA, so it was NVidia only, but it may be extended to OpenCL by now.

1. http://deeplearning.net/software/theano/

njbooher · on Nov 15, 2012

Use thrust (http://thrust.github.com/). Or just learn CUDA. It's really not that bad.

Osmium · on Nov 15, 2012

Thanks for the link, it looks interesting. My reluctance in learning CUDA (aside from the time investment in learning, which I'm happy to believe isn't actually too bad) is that the lower-level I have to work at, the less time I'd be able to spend writing "useful code" and the less flexibility I'd have in future if I'd want to move to something non-Nvidia.

I'm sure many other people are in the same position. I don't mind sacrificing a little performance for a much easier programming environment.

jacquesm · on Nov 15, 2012

Learning CUDA is very much doable, if you're already a competent programmer you'll be up and running in a relatively short time. The biggest hurdle will be to gain sufficient insight into the intricacies of memory management and how to squeeze maximum performance out of your hardware, but if you're satisfied with just a sizeable bump then it should be easy enough.

If you want to go all out you can probably get to the required level of knowledge based on a few weeks to a few months of really hard work depending on where you are coming from in terms of experience.

The docs are excellent, there are tons of examples and google will usually turn up a solution to a problem in case you hit a snag.

JoeAltmaier · on Nov 15, 2012

Lots of cores means lots of threads - 4 hyperthreads per core. So 200+ threads could be handy in high-bandwidth low-latency situations. E.g. it could make a dandy server for delivering low-latency streams like stock quotes. You would want a new kernel model, where you bound program threads to particular hyperthreads and blocked in user-space on events - so your hyperthread cache was always hot.

Symmetry · on Nov 15, 2012

Not horribly relevant from a software perspective, but as a hardware geek I think the way they're doing threading is really interesting. Big OoO processors like a normal Xeon or a Power7 usually use simultaneous multithreading (SMT) which means that you have instructions from two threads being fed to the execution units every clock cycle, and since they often aren't in contention for the same resources you get higher throughput. Some in-order processors like a Niagra often use block multithreading (BMT) where you run one process until you get a cache miss, then switch to another thread with some delay as the pipeline is flushed.

What the Phi is doing is combining those approaches, running two threads simultaneously and switching threads out on cache-misses. This way you only double rather than quadrupaling your control structures, but you don't have your cores entirely unutalized when you're swapping threads. A really nifty compromise, I think.

api · on Nov 15, 2012

I wonder if you could use this to run lots and lots of small virtualized nodes? I know that's not the intended use case but I wonder if it's possible and would perform well?

rbanffy · on Nov 15, 2012

Memory bandwidth would limit the performance unless your VMs were running something with a very small memory footprint, about 160 megabytes per instance.

Having said that, I've used Unix workstations with less RAM attached than that through much less than 7GBps worth of bus...

wmf · on Nov 15, 2012

Phi's memory bandwidth is very high but its memory capacity is very low (a normal Xeon can drive ~192 GB cheaply).

sp332 · on Nov 15, 2012

The first paragraph mentions that you can boot an OS on each of those cores. Or did you mean something else? Of course memory is tight as rbanffy pointed out, so loading up so much duplicate OS overhead might not be the most productive use.

Edit: 1 GHz sounds like plenty until you realize it's in-order execution. This would be noticeably sluggish.

matthavener · on Nov 15, 2012

VMWare has something called "Transparent Page Sharing", which allows VMs to share read only pages. So, it might actually be feasible to start up 50 VMs if they were all running the same software and only had a small amount of private state.

foxhill · on Nov 15, 2012

it may be possible, but it will almost certainly not perform well.

cefstat · on Nov 15, 2012

I've read about Xeon Phi a few months ago and I really want to get my hands on one. My problems are in the embarrassingly parallelizable class (or almost). Having said that, does anybody know how each Xeon Phi core performs with respect to a modern Intel processor (i7 or Xeon) for standard numerical code (Linpack etc.)?

rys · on Nov 15, 2012

They're Pentium-class x86 cores and barely any more than front end control processors for the vector hardware. The fact it's x86 is almost incidental, IMHO; the vector ISA is all programmers should really care about on the Phi.

berkut · on Nov 15, 2012

I guess that means they've got primitive (Pentium Pro equivalent) branch predictors and memory pre-fetchers then?

Are they even out-of-order? I.e. is it Pentium or Pentium Pro class?

stonemetal · on Nov 15, 2012

http://www.anandtech.com/show/6451/the-xeon-phi-at-work-at-t...

Each core is a simple in order x86 CPU (derived from the original Pentium) with a 512-bit SIMD unit.

berkut · on Nov 15, 2012

So the branch predictors will be crap, but thanks to the hyperthreading, it probably won't be noticeable on most workloads...

apendleton · on Nov 15, 2012

Maybe I'm missing something, but do in-order architectures even have much use for branch prediction? They can't speculatively execute based on the outcome of a conditional, right?

wtallis · on Nov 15, 2012

Sure they can. Branch prediction allows you to move an instruction along the pipeline before the instruction determining its outcome has been retired. Without branch prediction, every conditional jump will potentially stall the pipeline. With branch prediction, a correctly predicted branch executes quickly, and a mis-predicted branch results in a pipeline flush.

Instruction re-ordering is more about taking full advantage of multiple execution units (ALUs, etc.), or not completely stalling the pipeline to wait on a memory fetch.

sp332 · on Nov 15, 2012

I was curious about this, since the point is that you can run "Xeon" code on the Xeon Phi, but the Phi doesn't support SSE, MMX, or AVX so wouldn't you need to recompile to take advantage of the vector hardware?

rayiner · on Nov 15, 2012

They aren't quite so primitive. The Atom is also a Pentium-class in-order core. It may be Pentium class, but it's also running at ~2 GHz.

mich41 · on Nov 15, 2012

IIRC they'll be available to OEMs only. Anyway, Intel claimed 2-3x speedups over some unspecified dual socket Xeon system.

http://www.tomshardware.com/reviews/xeon-phi-larrabee-stampe...

joss82 · on Nov 15, 2012

I'm sure a lot of us here would love to have a cheap supercomputer to perform some heavily parallelizable workloads on our servers. And it this going to dramatically lower the cost of virtual private instances? I really can't wait to see some benchmarks.

foxhill · on Nov 15, 2012

cheap it is certainly not. additionally, all the tests i've seen indicate that kepler/tahiti have got little to worry about.

Scene_Cast2 · on Nov 15, 2012

GPGPUs are notoriously hard to extract high performance. If you're an enterprise customer with no readily-available GPGPU code, Xeon Phi makes much more sense GPUs for a few reasons.

First, the talent pool for HPC x86 programmers is an order of magnitude larger than for expert GPGPU programmers - Xeon Phi is just a virtual x86 server rack with TCP/IP messaging.

Second, the amount of time and effort to extract useful performance from GPGPUs is quite a lot; if it's for internal use and you're not selling the code to the masses, you're likely to get the same amount of performance with less time on the Phi, unless you're going for "the best, regardless of money & time".

Last, most enterprise customers will want ECC + other compute features. They're sold in the pro-level 3k+ Teslas, which happen to be more expensive than the Phi.

Where GPGPU does make sense: consumer-level hardware using already-written software (workstations and hobbyists in particular) and businesses where performance/watt is crucial at any cost.

mich41 · on Nov 15, 2012

Phi architecture is closer to a GPU than a rack of x86 servers.

With 60 cores reading memory over a common ring bus latency will kill you unless you tile your loops to maximize cache reuse [1], at which point you might as well write a GPU code which preloads blocks of data to local memory and works there.

Also, to beat performance of normal x86 CPU you must use vector instructions, what gives you all the little problems GPU warps are known to cause.

[1] http://software.intel.com/en-us/articles/cache-blocking-tech...

foxhill · on Nov 15, 2012

a Phi is not quite as you imagine it, it's more like a single machine with 60/61 cores (when you cat /proc/cpuinfo, there are 60/61 entries).

the main optimisation techniques for GPUs aren't difficult to grasp (in my opinion), although not all classes of problem are suited to execution on GPU.

stephengillie · on Nov 15, 2012

Will ARM servers have PCIe slots?

Symmetry · on Nov 15, 2012

A quick Google reveals there are already ARM servers with PCIe slots. http://www.globalscaletechnologies.com/t-openrdudetails.aspx

praveenster · on Nov 15, 2012

why does the url include donkey as the query string?

hellrich · on Nov 15, 2012

Probably submitted before, thus a change to the url was necessary to circumvent HNs filter.

sp332 · on Nov 15, 2012

Yup, just yesterday http://news.ycombinator.com/item?id=4784834

kristianp · on Nov 16, 2012

and by the same user, too.

shasta · on Nov 15, 2012

8 GB of RAM total?

eliben · on Nov 15, 2012

I would guess the main goal is to perform parallel scientific computations on a shared chunk of data, not to run multiple web servers & services.

fdej · on Nov 15, 2012

Seems pretty good if you compare it to a GPU.

tsahyt · on Nov 15, 2012

The impressive thing is the memory bandwidth though. That's the one thing I've always loved GPUs for.

CHsurfer · on Nov 15, 2012

Perhaps it could be a 'high CPU' micro instance. 16 512MB instance with 3 cores (6 threads) per instance...

drudru · on Nov 15, 2012

OR... just spend ~$1000 and get 3070 cores of what you really need (FLOPS)

How?

The latest and greatest Nvidia card at your favorite retailer.

foxhill · on Nov 15, 2012

consumer kepler boards have no double precision hardware, in this case a Phi would destroy it.

foxhill · on Nov 15, 2012

don't know where the 50 core figure comes from, as a Phi has 60 cores (61 in the "better" model).

wmf · on Nov 15, 2012

Until a few days ago Intel was saying "over 50" cores; some people forgot to flush their cache.

erichocean · on Nov 15, 2012

That article is riddled with errors.

stcredzero · on Nov 15, 2012

Examples?

rorrr · on Nov 15, 2012

> Suggested reatail pricing for the initial model is $2649, with subsequent models expected to cost less than $2000

That's $44.15 per 1 GHz core.

AMD FX-6300 Six-Core 3.5GHz is $138 = $23 per core (and much faster cores).

Intel Xeon 5148 2.33ghz is $18.

jlgreco · on Nov 15, 2012

How do differences in supporting-hardware/density change the effective pricing? It seems to me like the Phi cores could end up cheaper once you factor in everything else that you need to get that many cores of something else.

codewright · on Nov 15, 2012

Cost of power, power supplies, motherboards, memory, networking equipment, efficiency losses from synchronization, etc...

rorrr · on Nov 15, 2012

You need just three of AMDs CPUs to match the 60 1GHz cores.

You can get a quad CPU motherboard relatively cheaply:

http://www.ebay.com/itm/Arima-Quad-CPU-16-Core-AMD-Opteron-M...

Also don't forget to add the same costs for the Phi solution.

astrodust · on Nov 15, 2012

The 50-core Xeon sounded great up until the $2600 price tag. You can buy a lot of CPU cores for that much money if you have quad-socket boards like that.