Hacker News new | past | comments | ask | show | jobs | submit login

I'm really excited about our massively parallel future, not least because I have to run scientific code that would greatly benefit from it. But at the moment it's so hard to program for this sort of thing: can someone explain why, in simple terms, something like OpenCL or CUDA is so damn complicated? Is there any way to avoid having to have a low-level understanding of how a GPU or co-processor works, rather than expecting the vendor to implement an easier to use solution? I'm thinking about, e.g., Matlab's "parfor" (parallel for) command, which is super easy to use.

The article states that "All of these [CUDA/OpenCL] problems go away with the Phi. It's a pure x86 programming model that everyone is used to. It's a question of reusing, rather than rewriting, code" but I find it hard to believe I can just drop existing code into it and expect decent performance.




I expect that multicore MapReduce will become popular (e.g. http://mapreduce.stanford.edu/ or google for more literature).

I suppose it is strictly less powerful than regular MapReduce, but at least with Hadoop the system administration costs are too much for a lot of people, and machines are getting beefier, so you can get a lot more done on one machine. In another recent thread there was a MS research paper about "ill-conceived" Hadoop clusters processing 14GB of data...

The main benefits I see are:

1) You don't have to write in a specialized language. You should be able to use any language with a good implementation. Scientific code often has Matlab, R, C++, and Python glued together.

2) MapReduce lets you write sequential code, which is easier to learn.

3) You can adapt/port sequential legacy code easily, so you can use a lot of your existing code.

MapReduce is of course similar to "parallel for" but more powerful -- parallel for is essentially the map stage. The reduce stage adds a lot. For some reason most people who haven't programmed MapReduce think of MapReduce as just mapping, and they don't understand reducing.

If you want to do it quick and dirty, don't underestimate "xargs -P" :) That's your "parallel for" that works with any language. You can run that on your Matlab, Python, C++, etc. You need a serialization library but there are a lot of those around. It works well and with a minimum of programming effort.


Have you used OpenMP (http://en.wikipedia.org/wiki/OpenMP)? It has the flavor of parfor -- you identify the embarrassingly parallel loops in your C or Fortran, put in something like

    #pragma openmp parallel for
in front of them, and your code transfers through pretty much intact -- it handles the thread wrappers. You can add other pragmas for the times when you need locking.

This is a much less intrusive setup than CUDA; you don't have to worry about loading data, or double/float conflicts.

The OpenMP extensions could be a very good fit for scientific programming on this coprocessor.


Thanks; I'll take a look. But OpenMP is CPU-only right? Apple's got their (currently less portable, admittedly) Grand Central Dispatch that does something similar. But as far as I know, if you want portable GPU code your only option is OpenCL, and even then it requires optimisation depending what device you're using it on (or so I've heard).


OpenMP 4.0 is likely to have support for accelerator devices (i.e, move the necessary data on to the device, run the computation, and move back to the host). in fact, that's one of the methods you can use the Phi right now (intel have extensions to OpenMP)

or if you can't be bothered to wait for such a standard, you should have a look at OpenACC[1], which does exactly this, and exists now. you end up adding code like

    #pragma acc kernels for
on top of your for loops, it does the low level work for you.

[1] http://www.openacc-standard.org/


The primary theoretical advantage of Larrabee's descendants on Intel side over the GPGPU lineage stuff from Nvidia is that GPUs were never designed for things like branch heavy code, or debuggers.

Compiler support for CUDA is further ahead of Xeon Phi. I haven't seen any evidence that Intel has been successful yet extending the auto-vectorizing capabilities of ICC to massively parallel environments.

For CUDA, one can write code in Python, Haskell, C++ etc.

At this point, Xeon Phi only offers Vectorized C and Fortran. There is a narrow domain of HPC code designed to run on Multisocket Xeon processors that could probably be re targeted trivially to the Xeon PHI.


The Intel Xeon Phi homepage mentions that Cilk ++ and Intel TBB support the Phi, which would imply that the Phi has C++ support.

If you run Linux on the Phi (which Intel ships in their Manycore Platform Support Stack), then anything that runs on 386 linux should run here, which should include Python and Haskell.

If you don't choose to use Linux on the Phi, then your tools options will be limited just like they are if you choose to not use a regular OS on a PC.


OpenCL is basically the systems programming language for GPUs, like C is for CPUs: as low-level as possible while still having the capability of being hardware-agnostic. Something like OpenCL has to exist in order for the higher-level alternatives to be possible. And just like MATLAB offers acceptable performance, we will eventually see some good high-level languages for GPGPU.

You can be sure it won't be any currently popular language, though, because almost none of them have support for the kind of pervasive parallelism needed (why isn't parfor the default kind of for loop?), and of those that do support that kind of parallelism (typically by being purely functional and supporting lazy evaluation), none come equipped with the necessary facilities to optimize the code for a particular GPU (by tweaking how the problem is split up).

The Xeon Phi does have an advantage in that it's easier to get the code running in the first place, but the difficulty of optimizing it for a massively parallel GPU-like architecture is (for now) exactly the same as faced by OpenCL and CUDA users.


If you can formulate your problem in terms of tensor math (for example, neural networks) you can use Theano [1]. It is a very high-level approach. You give it mathematical expressions and it generates and executes GPU code for you. When I last used it it only supported CUDA, so it was NVidia only, but it may be extended to OpenCL by now.

1. http://deeplearning.net/software/theano/


Use thrust (http://thrust.github.com/). Or just learn CUDA. It's really not that bad.


Thanks for the link, it looks interesting. My reluctance in learning CUDA (aside from the time investment in learning, which I'm happy to believe isn't actually too bad) is that the lower-level I have to work at, the less time I'd be able to spend writing "useful code" and the less flexibility I'd have in future if I'd want to move to something non-Nvidia.

I'm sure many other people are in the same position. I don't mind sacrificing a little performance for a much easier programming environment.


Learning CUDA is very much doable, if you're already a competent programmer you'll be up and running in a relatively short time. The biggest hurdle will be to gain sufficient insight into the intricacies of memory management and how to squeeze maximum performance out of your hardware, but if you're satisfied with just a sizeable bump then it should be easy enough.

If you want to go all out you can probably get to the required level of knowledge based on a few weeks to a few months of really hard work depending on where you are coming from in terms of experience.

The docs are excellent, there are tons of examples and google will usually turn up a solution to a problem in case you hit a snag.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: