Parallel Programming for C and C++ Done Right

tmurray · on Aug 22, 2012

Foreword: I'm biased, as I worked on CUDA for several years.

The conclusions offered by this deck are mostly FUD.

First of all, Haswell, the architecture where those transactional memory primitives are available, isn't out for another year. Saying Knights Corner was available on November 2011 is also deceptive; Intel demo'd it at SC11, but you can't buy one yet.

Second, he helpfully glosses over that Cilk++'s elemental functions are identical to how CUDA and ISPC work; write a specially decorated single-threaded function, use a specially decorated function call, and end up with parallel work. I think it's exceedingly likely that the industry will standardize on this as the data-parallel methodology of choice within the next ten years. That timeframe will depend on how quickly GPUs and CPUs converge in terms of functionality (with vastly different performance characteristics). Task-parallel stuff will be done with something else.

The really difficult question will be how to get performance portability. C++ (or Fortran) code that runs well on Haswell will probably run like crap on KNC and vice-versa due to differences in the number of threads you need in flight, cache sizes, vast latency differences, etc. (Look at OpenCL running on two GPUs or especially CPU vs GPU as an example today.) Solving that is going to be the real challenge.

positr0n · on Aug 22, 2012

I'd love to get in to a part of the industry where people care about stuff like this (purely for selfish reasons... web development is fun too but the barrier to entry is a lot lower).

Can anyone comment on the number and quality of "hard core" C++ jobs and what you think the trajectory will be like? Right now I use C++ almost exclusively at work but the code/concepts involved aren't too difficult.

jandrewrogers · on Aug 22, 2012

C++ is increasingly used for high-performance and massively parallel systems. While I used to work on systems primarily written in Java (large-scale analytics and databases) everything new is being done in C++, especially any kind of high-end compute environments. C++ has a few very real advantages on modern architectures and when tackling modern problems. With C++11, it is also a pretty decent programming language in terms of expressiveness.

C++ has two big advantages over alternatives like Java: very low and deterministic processing latency (important for real-time) and very efficient memory model (important for throughput). These are really both about memory, and C++ gives you detailed control in a way few other languages do. As to why this is important, memory performance has been scaling more slowly than many other aspects of computing architecture such that it is the bottleneck for a growing number of applications. C++ allows you to be very efficient with the memory architecture without much effort. If your application is fundamentally bound by memory performance, competent C++ can get you 2-10x returns on performance in real systems relative to languages like Java. (For some other tight-loop, CPU-bound codes, not so much.)

I'm pretty bull-ish on C++11. It is not so much that I am a fan of the language but that I am a fan of what it can do in terms of performance for databases and large-scale systems. Most of the high-scale and high-performance development going forward seems to be targeted at C++ these days. That was not always the case but the requirements of modern applications are somewhat forcing that choice.

tmurray · on Aug 22, 2012

do you think we'll see a lot of HPC apps using C++11 for parallelism within a node, or will they stick to MPI for that? most of the apps I've seen are MPI only or OpenMP + MPI--I'm not convinced that C++11 will be very relevant to them because of the minor overhead of MPI within a node (and the productivity savings of having only one API for parallelism).

jandrewrogers · on Aug 22, 2012

MPI is messaging interface rather than a parallelism API. Even in C++ most of the parallelism constructs are a strawman because many high-performance computing codes are written as single-threaded processes locked to individual cores and communicating over a messaging interface of some type. The parallelism is implemented at a higher level than either the messaging interface or the code. Many supercomputing platforms support MPI but not all of them do.

The practice of a single process locked to each core communicating over a messaging interface has trickled down to more generic massively distributed systems work because it has very good properties on modern hardware. You end up doing a fair amount of functional programming in this model because multiple tasks are managed via coroutines and other lightweight event models. This architecture is very easy to scale out because it treats every core -- on the same chip, same motherboard, or same network -- as a remote resource that has to be messaged.

MPI has one significant problem for massively parallel systems in that it has tended to be brittle when failures occur, and on sufficiently large systems failures are a routine problem. There are ways to work around it but it is not the most resilient basis for communication in extremely large systems. At the high-end of HPC MPI and similar interfaces are commonly used but for many of the next generation non-HPC systems operating on a similar scale they are using custom network processing engines built on top of IP that give more fine-grained control over network behaviors and semantics. This is not faster than MPI and often slower, and tends to be a bit more complex but it allows more robustness and resilience to be built in at a lower level. MPI was designed for a set of assumptions that work for many classic supercomputing applications but which don't match many current use cases.

jedbrown · on Aug 22, 2012

The major thing that MPI did right, and that almost all other models have done wrong, is library support. Things like attribute caching on communicators are essential to me as a parallel library developer, but look superfluous in the simple examples and for most applications.

The other thing that is increasingly important in the multicore CPU space is memory locality. It's vastly more common to be limited by memory bandwidth and latency than by the execution unit. When we start analyzing approaches with a parallel complexity model based on memory movement instead of flops, the separate address space in the MPI model doesn't look so bad. The main thing that it doesn't support is cooperative cache sharing (e.g. weakly synchronized using buddy prefetch), which is becoming especially important as we get multiple threads per core.

As for fault tolerance, the MPI forum was not happy with any of the deeper proposals for MPI-3. They recognize that it's an important issue and many people think it will be a large enough change that the next standard will be MPI-4. From my perspective, the main thing I want is a partial checkpointing system by which I can perform partial restart and reattach communicators. Everything else can be handled by other libraries. My colleagues in the MPI-FT working group expect something like this to be supported in the next round, likely with preliminary implementations in the next couple years. For now, there is MPIX_Comm_group_failed(), MPIX_Comm_remote_group_failed(), and MPIX_Comm_reenable_anysource().

tmurray · on Aug 22, 2012

any chance you can comment on the fault tolerance proposal in MPI 3?

also, do you have any examples of custom network engines on top of IP?

qznc · on Aug 22, 2012

If you want parallel/concurrent programming there are better languages than C++. However, additional requirements (real-time, performance, size, ...) make C++ desirable again, since C++ tries very hard to avoid tradeoffs in these terms.

maximilianburke · on Aug 22, 2012

My work in concurrency with C++ has been from the AAA games angle. I have found that it's a field where knowledge of parallel programming is incredibly valuable but few people end up putting the effort in to make their code take advantage of the hardware. People who are interested in parallel programming are a rare asset to have.

At least within my company we haven't used much in the way of parallel programming languages like OpenCL/CUDA/Cilk, preferring instead to use a task based concurrency library not entirely unlike Intel's TBB, but more cross platform. There's more investigation into GPGPU happening these days -- I'm curious to see if it'll pan out for more than just rendering applications. I'm also curious to see if the other (physics/audio/animation) areas can pull GPU time away from the rendering guys :)

The trajectory in this field is predictable. You get a hardware platform, have 5-7 years to figure out how to use it, and once you finally have it figured out something entirely different comes along.

malkia · on Aug 22, 2012

Video games for example - runtime, tools, etc. We had to optimize our tool pipeline to parallelize things like dxt compression, vertex welding, mip-map creation, filters, etc. as part of the pipeline. Since there were other restrictions, that we couldn't not safely run two or more assets to be converted at the same time, we chose data parallelize (where possible) the actual asset convertion. OpenMP was used for that and helped a lot (VS2008, then VS2010).

Since the process was reading, process (data parallelization), writing - and reading was much bigger than writing (you read big model, animation, or texture files, and write ~8x smaller), it was "parallelized" by prefetching files in the file cache reading them in a second thread in advance.

Allowing only one writer to the HDD turned out to be beneficial (for cases like RAID SSD you can do multiple writers efficiently but this is rather pricey).

tseabrooks · on Aug 22, 2012

Embedded development on a lot of consumer electronics devices are mostly in C++ these days. Some of the UI elements are changing to HTML5 UIs without a real browser (wired directly into C++ methods talking to HW through "magic").

Admittedly, a lot of these jobs are oversees. However, I believe the amazon and MS consumer electronics groups are still in the US and I know for a fact that until this year all of the Toshiba TV firmwares were being done in the US.

However, my experience in the CE industry has been that we aren't worried about massively performant parallelism, just "moderately performant" parallelism. I want to say that as of a few years ago Toshiba TVs had something like 10-30 threads running at all times in parallel.

trekkin · on Aug 22, 2012

I can't comment on wider industry trends, but I know for sure that some companies (specifically in NYC) pay decent money ($150k+) to people who do this stuff. And they can't find enough developers at this salary level (maybe they could if they paid more)

infinite8s · on Aug 22, 2012

The problem is that people with this level of skill can make much more in the finance industry.

alecco · on Aug 22, 2012

Data processing, but it's not perfect. Most of the work is for questionable clients: creepy user-tracking (anti-privacy), financial powerhouses, three letter government agencies.

ternaryoperator · on Aug 22, 2012

Cilk/Cilk+ is not the answer, despite years of Intel promoting it and open-sourcing it. Intel has touted many other || technologies in the past: OpenMP, TBB, and now Cilk+. These are all useful tools, but none of them is the way of boldly moving forward with || programming, IMHO. I believe the easier || programming will come from actors, CSPs, channels and other technologies that provide safe concurrency as the working basis.

gcp · on Aug 22, 2012

Cilk/Cilk+ is not the answer, despite years of Intel promoting it and open-sourcing it.

Intel removed the features that gave Cilk a reason to exist: inlets and aborts. They're useful for (partially) parallelizing hard-to-parallelize problems that don't fit well into the other frameworks like OpenMP etc. Why did they remove them? I'm guessing because they were difficult to do well, and the algorithms that need them don't present such nice linear scaling graphs for marketing slides.

However, without those features, Cilk just doesn't distinguish itself enough from OpenCL, OpenMP etc. Parallelizing easy to parallelize problems isn't the problem, it's the others we need help in dealing with!

The poster childs/demos for the original Cilk were parallelized chessprograms, some of which did quite well in real tournaments. It's a very well studied area that exhibits a lot of parallelism, but not in a form that's easy to extract (hence there are no competitive ones using OpenMP, GPUs etc). Cilk was able to do it, a major achievement. But you can't even construct those in Intel's crippled Cilk version any more. Well, not any that would be competitive, anyway, which is the point to begin with.

If your solution only solves the easy problems others have already solved, what exactly is your right of existence?

comex · on Aug 22, 2012

True, Concurrency is not Parallelism [http://concur.rspace.googlecode.com/hg/talk/concur.html], but sometimes all you need is parallel execution of a single algorithm, and you need a set of primitives that succinctly and efficiently express that parallelism... not to mention integrated access to SIMD and GPUs, where only parallelism need apply. I'd say there needs to be both a high level "break your app's work into a heterogeneous set of processes" API, like CSP, and a low level "run an algorithm on this array really fast" API.

ryanmolden · on Aug 22, 2012

Did you mean CPS instead of CSP? Not trying to be snarky/pedantic, just making sure there isn't some acronym in this space I am unfamiliar with (not entirely unlikely :)).

Edit: nope, you probably meant CSPs (http://en.wikipedia.org/wiki/Communicating_sequential_proces...), sorry.

patrikmcguire · on Aug 22, 2012

Has anyone had any real-world experience using X!0? http://en.wikipedia.org/wiki/X10_(programming_language)

It came out of IBM (Eclipse license) at about the same time as Watson, and I've gotten the impression that it (compiled down to C++) was the main language used for it.

One of the authors was a guest professor whenever I took my parallel programming course and wound up teaching about half the classes, so its abilities and use may have been exaggerated slightly, but it has a lot of constructs built in that I'd imagine to be terrible to implement otherwise - good globally synchronized clocks and memory management across everything on the current "place" (roughly one physical computer), although you could still had to manage memory you sent to different places manually.

But Wikipedia says Watson's built mostly on Hadoop, where the coolest features wouldn't really have much of an effect, so it may be just a crazy research language. I was just curious.

scott_s · on Aug 22, 2012

Watson is not "built on" Hadoop. From http://www.aaai.org/ojs/index.php/aimagazine/article/view/23...:

To preprocess the corpus and create fast run-time indices we used Hadoop. UIMA annotators were easily deployed as mappers in the Hadoop map-reduce framework. Hadoop distributes the content over the cluster to afford high CPU utilization and provides convenient tools for deploying, managing, and monitoring the corpus analysis process.

The online parts of Watson use UIMA: http://uima.apache.org/

qznc · on Aug 22, 2012

X10 is a research language: Interesting concepts, crappy implementation.

X10 is also dying as IBM is shutting down its research groups. Unless some other institution (some university chair?) will step up for maintenance, its development will halt soon.

scott_s · on Aug 22, 2012

IBM is shutting down its research groups

That is not a true statement.

jmpeax · on Aug 22, 2012

Lots of (R)s and TMs in this Intel TBB advertisement.

trekkin · on Aug 22, 2012

TBB is OSS, and we use parts of it (atomics) daily, pretty much as advertised.

bjornsing · on Aug 22, 2012

The first part of the title is descriptive, but there's not much of anything "Done Right" in there that I can see. Just the same old.