I'm really excited about our massively parallel future, not least because I have to run scientific code that would greatly benefit from it. But at the moment it's so hard to program for this sort of thing: can someone explain why, in simple terms, something like OpenCL or CUDA is so damn complicated? Is there any way to avoid having to have a low-level understanding of how a GPU or co-processor works, rather than expecting the vendor to implement an easier to use solution? I'm thinking about, e.g., Matlab's "parfor" (parallel for) command, which is super easy to use.
The article states that "All of these [CUDA/OpenCL] problems go away with the Phi. It's a pure x86 programming model that everyone is used to. It's a question of reusing, rather than rewriting, code" but I find it hard to believe I can just drop existing code into it and expect decent performance.
I expect that multicore MapReduce will become popular (e.g. http://mapreduce.stanford.edu/ or google for more literature).
I suppose it is strictly less powerful than regular MapReduce, but at least with Hadoop the system administration costs are too much for a lot of people, and machines are getting beefier, so you can get a lot more done on one machine. In another recent thread there was a MS research paper about "ill-conceived" Hadoop clusters processing 14GB of data...
The main benefits I see are:
1) You don't have to write in a specialized language. You should be able to use any language with a good implementation. Scientific code often has Matlab, R, C++, and Python glued together.
2) MapReduce lets you write sequential code, which is easier to learn.
3) You can adapt/port sequential legacy code easily, so you can use a lot of your existing code.
MapReduce is of course similar to "parallel for" but more powerful -- parallel for is essentially the map stage. The reduce stage adds a lot. For some reason most people who haven't programmed MapReduce think of MapReduce as just mapping, and they don't understand reducing.
If you want to do it quick and dirty, don't underestimate "xargs -P" :) That's your "parallel for" that works with any language. You can run that on your Matlab, Python, C++, etc. You need a serialization library but there are a lot of those around. It works well and with a minimum of programming effort.
Have you used OpenMP (http://en.wikipedia.org/wiki/OpenMP)? It has the flavor of parfor -- you identify the embarrassingly parallel loops in your C or Fortran, put in something like
#pragma openmp parallel for
in front of them, and your code transfers through pretty much intact -- it handles the thread wrappers. You can add other pragmas for the times when you need locking.
This is a much less intrusive setup than CUDA; you don't have to worry about loading data, or double/float conflicts.
The OpenMP extensions could be a very good fit for scientific programming on this coprocessor.
Thanks; I'll take a look. But OpenMP is CPU-only right? Apple's got their (currently less portable, admittedly) Grand Central Dispatch that does something similar. But as far as I know, if you want portable GPU code your only option is OpenCL, and even then it requires optimisation depending what device you're using it on (or so I've heard).
OpenMP 4.0 is likely to have support for accelerator devices (i.e, move the necessary data on to the device, run the computation, and move back to the host). in fact, that's one of the methods you can use the Phi right now (intel have extensions to OpenMP)
or if you can't be bothered to wait for such a standard, you should have a look at OpenACC[1], which does exactly this, and exists now. you end up adding code like
#pragma acc kernels for
on top of your for loops, it does the low level work for you.
The primary theoretical advantage of Larrabee's descendants on Intel side over the GPGPU lineage stuff from Nvidia is that GPUs were never designed for things like branch heavy code, or debuggers.
Compiler support for CUDA is further ahead of Xeon Phi. I haven't seen any evidence that Intel has been successful yet extending the auto-vectorizing capabilities of ICC to massively parallel environments.
For CUDA, one can write code in Python, Haskell, C++ etc.
At this point, Xeon Phi only offers Vectorized C and Fortran. There is a narrow domain of HPC code designed to run on Multisocket Xeon processors that could probably be re targeted trivially to the Xeon PHI.
The Intel Xeon Phi homepage mentions that Cilk ++ and Intel TBB support the Phi, which would imply that the Phi has C++ support.
If you run Linux on the Phi (which Intel ships in their Manycore Platform Support Stack), then anything that runs on 386 linux should run here, which should include Python and Haskell.
If you don't choose to use Linux on the Phi, then your tools options will be limited just like they are if you choose to not use a regular OS on a PC.
OpenCL is basically the systems programming language for GPUs, like C is for CPUs: as low-level as possible while still having the capability of being hardware-agnostic. Something like OpenCL has to exist in order for the higher-level alternatives to be possible. And just like MATLAB offers acceptable performance, we will eventually see some good high-level languages for GPGPU.
You can be sure it won't be any currently popular language, though, because almost none of them have support for the kind of pervasive parallelism needed (why isn't parfor the default kind of for loop?), and of those that do support that kind of parallelism (typically by being purely functional and supporting lazy evaluation), none come equipped with the necessary facilities to optimize the code for a particular GPU (by tweaking how the problem is split up).
The Xeon Phi does have an advantage in that it's easier to get the code running in the first place, but the difficulty of optimizing it for a massively parallel GPU-like architecture is (for now) exactly the same as faced by OpenCL and CUDA users.
If you can formulate your problem in terms of tensor math (for example, neural networks) you can use Theano [1]. It is a very high-level approach. You give it mathematical expressions and it generates and executes GPU code for you. When I last used it it only supported CUDA, so it was NVidia only, but it may be extended to OpenCL by now.
Thanks for the link, it looks interesting. My reluctance in learning CUDA (aside from the time investment in learning, which I'm happy to believe isn't actually too bad) is that the lower-level I have to work at, the less time I'd be able to spend writing "useful code" and the less flexibility I'd have in future if I'd want to move to something non-Nvidia.
I'm sure many other people are in the same position. I don't mind sacrificing a little performance for a much easier programming environment.
Learning CUDA is very much doable, if you're already a competent programmer you'll be up and running in a relatively short time. The biggest hurdle will be to gain sufficient insight into the intricacies of memory management and how to squeeze maximum performance out of your hardware, but if you're satisfied with just a sizeable bump then it should be easy enough.
If you want to go all out you can probably get to the required level of knowledge based on a few weeks to a few months of really hard work depending on where you are coming from in terms of experience.
The docs are excellent, there are tons of examples and google will usually turn up a solution to a problem in case you hit a snag.
Lots of cores means lots of threads - 4 hyperthreads per core. So 200+ threads could be handy in high-bandwidth low-latency situations. E.g. it could make a dandy server for delivering low-latency streams like stock quotes. You would want a new kernel model, where you bound program threads to particular hyperthreads and blocked in user-space on events - so your hyperthread cache was always hot.
Not horribly relevant from a software perspective, but as a hardware geek I think the way they're doing threading is really interesting. Big OoO processors like a normal Xeon or a Power7 usually use simultaneous multithreading (SMT) which means that you have instructions from two threads being fed to the execution units every clock cycle, and since they often aren't in contention for the same resources you get higher throughput. Some in-order processors like a Niagra often use block multithreading (BMT) where you run one process until you get a cache miss, then switch to another thread with some delay as the pipeline is flushed.
What the Phi is doing is combining those approaches, running two threads simultaneously and switching threads out on cache-misses. This way you only double rather than quadrupaling your control structures, but you don't have your cores entirely unutalized when you're swapping threads. A really nifty compromise, I think.
I wonder if you could use this to run lots and lots of small virtualized nodes? I know that's not the intended use case but I wonder if it's possible and would perform well?
Memory bandwidth would limit the performance unless your VMs were running something with a very small memory footprint, about 160 megabytes per instance.
Having said that, I've used Unix workstations with less RAM attached than that through much less than 7GBps worth of bus...
The first paragraph mentions that you can boot an OS on each of those cores. Or did you mean something else? Of course memory is tight as rbanffy pointed out, so loading up so much duplicate OS overhead might not be the most productive use.
Edit: 1 GHz sounds like plenty until you realize it's in-order execution. This would be noticeably sluggish.
VMWare has something called "Transparent Page Sharing", which allows VMs to share read only pages. So, it might actually be feasible to start up 50 VMs if they were all running the same software and only had a small amount of private state.
I've read about Xeon Phi a few months ago and I really want to get my hands on one. My problems are in the embarrassingly parallelizable class (or almost). Having said that, does anybody know how each Xeon Phi core performs with respect to a modern Intel processor (i7 or Xeon) for standard numerical code (Linpack etc.)?
They're Pentium-class x86 cores and barely any more than front end control processors for the vector hardware. The fact it's x86 is almost incidental, IMHO; the vector ISA is all programmers should really care about on the Phi.
Maybe I'm missing something, but do in-order architectures even have much use for branch prediction? They can't speculatively execute based on the outcome of a conditional, right?
Sure they can. Branch prediction allows you to move an instruction along the pipeline before the instruction determining its outcome has been retired. Without branch prediction, every conditional jump will potentially stall the pipeline. With branch prediction, a correctly predicted branch executes quickly, and a mis-predicted branch results in a pipeline flush.
Instruction re-ordering is more about taking full advantage of multiple execution units (ALUs, etc.), or not completely stalling the pipeline to wait on a memory fetch.
I was curious about this, since the point is that you can run "Xeon" code on the Xeon Phi, but the Phi doesn't support SSE, MMX, or AVX so wouldn't you need to recompile to take advantage of the vector hardware?
I'm sure a lot of us here would love to have a cheap supercomputer to perform some heavily parallelizable workloads on our servers.
And it this going to dramatically lower the cost of virtual private instances?
I really can't wait to see some benchmarks.
GPGPUs are notoriously hard to extract high performance. If you're an enterprise customer with no readily-available GPGPU code, Xeon Phi makes much more sense GPUs for a few reasons.
First, the talent pool for HPC x86 programmers is an order of magnitude larger than for expert GPGPU programmers - Xeon Phi is just a virtual x86 server rack with TCP/IP messaging.
Second, the amount of time and effort to extract useful performance from GPGPUs is quite a lot; if it's for internal use and you're not selling the code to the masses, you're likely to get the same amount of performance with less time on the Phi, unless you're going for "the best, regardless of money & time".
Last, most enterprise customers will want ECC + other compute features. They're sold in the pro-level 3k+ Teslas, which happen to be more expensive than the Phi.
Where GPGPU does make sense: consumer-level hardware using already-written software (workstations and hobbyists in particular) and businesses where performance/watt is crucial at any cost.
Phi architecture is closer to a GPU than a rack of x86 servers.
With 60 cores reading memory over a common ring bus latency will kill you unless you tile your loops to maximize cache reuse [1], at which point you might as well write a GPU code which preloads blocks of data to local memory and works there.
Also, to beat performance of normal x86 CPU you must use vector instructions, what gives you all the little problems GPU warps are known to cause.
a Phi is not quite as you imagine it, it's more like a single machine with 60/61 cores (when you cat /proc/cpuinfo, there are 60/61 entries).
the main optimisation techniques for GPUs aren't difficult to grasp (in my opinion), although not all classes of problem are suited to execution on GPU.
How do differences in supporting-hardware/density change the effective pricing? It seems to me like the Phi cores could end up cheaper once you factor in everything else that you need to get that many cores of something else.
The 50-core Xeon sounded great up until the $2600 price tag. You can buy a lot of CPU cores for that much money if you have quad-socket boards like that.
The article states that "All of these [CUDA/OpenCL] problems go away with the Phi. It's a pure x86 programming model that everyone is used to. It's a question of reusing, rather than rewriting, code" but I find it hard to believe I can just drop existing code into it and expect decent performance.