Hacker News new | past | comments | ask | show | jobs | submit login
Is Parallel Programming Hard, And, If So, What Can You Do About It? (kernel.org)
122 points by conductor on March 11, 2014 | hide | past | favorite | 26 comments



To clarify something not evident in the title, but which became obvious when I looked through the doc, this is about small-scale parallelism. The kind you find in a single (non-exotic) machine.

Massive-scale parallelism uses different data structures, algorithms, and programming models not discussed.


I write distributed-memory software and that was my first response when I first saw this a few years ago. But we are getting more on-node parallelism (and further on-node parallelism is where most of the next 10 years of performance is expected to come) so these techniques are more relevant to our large-scale computing than you'd otherwise think. Especially as we push strong-scaling limits and attempt to coordinate threads in order to share caches, the standard application of domain decomposition entails higher overheads than we would like to pay.


The question is even fuzzier: how much of that will be massive data parallelism via an on-node GPU? We are already reaching a point where a single machine with 4 CUDA cards can out perform a distributed cluster of 1000 nodes on some applications (the trick would be to eventually do both).

The memory hierarchy is our enemy here: the reason GPUs have done so well is that they schedule memory just as much (if not more) as they do computation. If you are going to go through the trouble of coordinating threads to share caches (and this is possible at all), you might have a GPU-friendly problem.


1. Coordinating threads on a GPU means using shared memory effectively, which is required to get decent performance on most any application. With many cores and hardware threads sharing cache on CPUs, it is also important for many applications if you insist on getting the best performance. For example it is mandatory to get 90% efficiency on HPL using Blue Gene/Q. Of course, not content with merely doing clever things to make their machines look good, IBM had John patent his cooperative prefetch: http://www.google.com/patents/US8490071

2. GPUs have less "close" memory (registers/shared memory/cache) per vector lane than current CPUs. This means that to get high efficiency, you have to find fine-grained parallelism without additional overhead. This is hard and it is common to see GPU algorithms make more round-trips to global memory than analogous CPU algorithms, eating into any benefits in raw bandwidth.

3. GPUs have a relatively narrow range of problem sizes in which they perform well. You typically have to use a significant fraction of device memory to expose enough parallelism to keep all the cores busy, and yet the device has limited memory compared to the CPU. On a GPU-heavy configuration, you have placed 90% of your compute next to 10% of your memory, connected by a straw (PCI bus) to the rest of the memory and the network. That is not a recipe for a versatile machine. CPUs give you vastly more flexibility in turn-around time (e.g., strong scale at >50% efficiency over a factor of 1000 as compared to 10). GPU performance results usually choose a problem size that fills device memory, but science/engineering is often not that convenient.

4. Even for problems in which GPUs perform optimally (like DGEMM or HPL), the ratio in energy efficiency is only 2x. See http://green500.org for example. Note that Blue Gene/Q is a CPU architecture that delivers the same energy efficiency as the GPU-heavy Titan. Also note that Haswell improves Intel efficiency by 2x over Sandy Bridge. The 1000x myth needs to die.

5. Enterprise GPUs (those with ECC) and Xeon Phi (MIC) are expensive ($3k-4k MSRP) relative to CPUs, and still need a host in almost all configurations. In performance tests, normalize-by-shrinkwrap needs to die. Normalize by total acquisition cost or by total energy consumption (always include the host, memory, network as applicable).


1. Yes, that's the whole point. The GPU makes mandatory what we muck around with on CPUs.

2. Every GPU generation adds more things like cache. The story changes every two years in favor of GPUs.

3. My colleague trains huge multi-gigabyte models on GPUs, so its not impossible. Terabytes of data is still the domain of MPI and increasingly MapReduce.

4. This is really not a concern for us, nor anyone who is in it for the performance (6 hours vs. 6 days).

5. The Phi is still kind of a joke. Tesla is quite competitive in terms of pricing.


Can you point to a problem where a GPU actually outperforms a CPU by 250x and the CPU is not being criminally underused? I have tried to find such examples, and never did.

Unless, maybe, you mean the communication costs are the real bottleneck in such cases? In which case I don't see the relevance of the GPU angle.


Yes, distributed clusters are limited by communicate costs, not computation power. Parallel computing is in general limited by communication costs, even on one node (as the amount of time it takes to service a cache miss). Minimizing communication is important in both cases.

DNN training is one problem where the GPU solution vastly outperforms the distributed HPC solution.


GPUs are useful for algebraic geometry. The speedup is not quite 250, but something like 80 or so.

http://www.mpi-inf.mpg.de/~emeliyan/phd_thesis.pdf


Switching one of the more fancier looking PS4 titles over to software rendering might do it. :)


I have written a pretty simple job-worker system in c ,and I'm interested in this field. Do you have any recommendations for books/resources?


For cuda books, you can go thru Wilt's, it's pretty dense https://twitter.com/CUDAHandbook

There's a handful of papers on attacking the 13 (formerly 8) dwarf problems http://tom.scogland.com/pubs/pdf/feng-ocd-icpe12.pdf

I don't have any specifics on Kaveri memory model but there's lots of small reviews: http://www.tomshardware.com/reviews/a10-7850k-a8-7600-kaveri...

compiler research and thread debugging: http://iacoma.cs.uiuc.edu/iacoma-papers/Illinois_parallelism...


and Eijkhout's book, excellent place to start, (but web server down at the moment) http://tacc-web.austin.utexas.edu/veijkhout/public_html/Arti...

Introduction to High Performance Scientific Computing - Victor Eijkhout with Edmond Chow, Robert van de Geijn


I can tell I live in a different world—I immediately thought of multi-threaded programming on a single machine.

Modern individual machines give us all kinds of opportunity for parallel execution—even my phone has vector instructions and two CPU cores. Modern Intel CPUs have 256-bit wide vectors, meaning you can do 8 floating point operations in a single instruction, assuming you can phrase your problem in the right way. If that's not enough, you can throw a pile of these cores at the problem, assuming you can coordinate work between these threads.

My experience has been most programmers struggle hard to build multi-threaded applications, countless hours lost to tracking down race conditions, deadlocks, and unexpectedly bad performance, which is a shame.

Scanning through this book, it doesn't appear to cover vector instructions, but it does look like it covers coordinating multiple threads in amazing depth. I feel like I already have a decent handle on multi-threaded programming, but I will be reading this book, and I'll probably learn plenty.


> My experience has been most programmers struggle hard to build multi-threaded applications, countless hours lost to tracking down race conditions, deadlocks, and unexpectedly bad performance, which is a shame.

A lot of it is caused by tools which are by design prone to such conditions and unless you consciously and meticulously follow very strict guidelines, such problems are inevitable. But some tools try to solve it with different design approach (such as Rust language for example) which prevents many of those potential pitfalls implicitly. I wonder if the book is focused on shared memory and locking only, or covers broader range of methodologies (at the first glance it's mostly about classic shared memory and mutual exclusion approach).


Here is Paul E. McKenney's announcement of the release: http://paulmck.livejournal.com/36854.html


Define "hard".

Does it mean, you need greater knowledge? (communication models, deadlock/livelock, etc)

Does it mean, you need greater attention to detail? (mistakes that wouldn't matter become serious)

Does it mean, you need greater working memory? (remembering what needs locks, and what already has locks; in addition to more ordinary side effects and possible exceptions etc)

Does it mean, you need greater fluid intelligence? (reasoning about which locks can be safely composed, or which memory transactions will have too much contention)


I'm coming at this from more of a physics viewpoint than from CPSC. To me, parallel programming is all about dividing a big problem into many little ones. The meat therefore lies in boundary conditions. I don't have time to read this pdf right now, but it contains just two instances of the word "boundary". That sets off alarm bells for me.


Your alarm bells are calibrated for physics, it makes sense they would false alarm on a different subject.


Is there a way to get this book in a single column format rather than the default 2-column format? Reading a 2-column book is extremely difficult for me :(


Here you go! http://kernel.org/pub/linux/kernel/people/paulmck/perfbook/p...

Each release is typeset both single-column and double-column. Single column works well for the larger-format ebook readers, and double-column works well for laptop/desktop use and for hardcopy.


Is the book available in another format? Eg epub. Pdf is not really good on an ereader.


A number of us have tried a variety of tools to produce e-reader formats, but none of them handle this book at all well. :-(

A number of people have reported good results with the single-column format on higher-end ebook readers. The single colume version of the first electronic edition may be found here: http://kernel.org/pub/linux/kernel/people/paulmck/perfbook/p...


No it isn't hard, however, you're probably using the wrong tool for the job, and have constructed your problem based on managerial input rather than mathematical formalism.

And for the final nail in the coffin you're probably asking your program to do something that violates the known laws of the universe in regard to the speed at which information can travel.


"Just get it done; or we'll get someone in here who will happily violate the known laws of the universe"


too close to home


Too bad you can't use unique_lock in C! Part of the problem is the difficulty in implementing this stuff without RAII.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: