Having spent quite a bit of time designing massively parallel algorithms (concur...

phamilton · on Jan 16, 2017

Having spent a good bit of time in HPC from a practical and academic perspective, I'll second that claim.

One example I like to bring up is the textbook hotplate, where the temperature for a single cell is equal to the average of its temperature and the temperatures around it. Each iteration is embarrassingly parallel, but the result of the iteration is necessary to proceed. Instead of synchronizing the entire cluster between iterations, each node computes the results for an extra N cells in each direction, allowing N iterations to occur before synchronization is necessary (after the first iteration, the outermost cells are inaccurate, after the 2nd iteration, the outer two layers are inaccurate, etc. Eventually, you want to sync before the cells we care about are affected). This leads to massively diminished returns as you break the problem into smaller chunks. But even for smaller chunks, doubling or tripling the work per node is still faster than synchronizing.

jandrewrogers · on Jan 16, 2017

Heh, yes, very much that.

My sole contribution to HPC was inventing a graph algorithm (circa 2009) for breadth-first search that didn't require the usual iterated barrier synchronization super-step you have with BSP algorithms. Instead, every node would run free with almost no synchronization but a clever error correction mechanism backed out all the incorrect traversals for a completely negligible cost (both space and time) relative to the ability to let the computation run as though it was embarrassingly parallel. It basically allowed every node to always be doing (mostly) constructive work without waiting on another node.

But yeah, burning a bit of CPU to eliminate synchronization is frequently a huge win. This is where knowing how much compute you can do within a given latency window is helpful.

adrianN · on Jan 16, 2017

Do you have a citation for that?

microcolonel · on Jan 16, 2017

I'm not about to join the FP circle jerk, but well-executed immutability can reduce memory bandwidth requirements rather than increasing them. This is because often in large C (and Java) codebases, the uninitiated will copy large structures and buffers when they don't need to, because they don't trust the other code[rs]. If you have a tree-like datastructure such as a nested map, immutability lets you copy the structure by reference even when modifying the contents.

That said, the added complexity in memory management means that if you can afford good programmers, it's better that they just try to avoid this practice in the first place. So HPC folks take the mutable approach.

I guess it's hard to say if an average, middle-of-the-road programmer will ever be able to productively work on parallel systems.

m_mueller · on Jan 16, 2017

Very well put, I just have one addendum: Add Fortran to the list. If you program multidimensional array like data structures as flat arrays in C or C++, use Fortran for the kernels and Python for the glue instead. It's just as performant, has much more batteries included and is much more comfortable to achieve a high performance.

zubirus · on Jan 16, 2017

>> If you look at effective highly scalable parallel codes of this nature, you see two big themes: pervasive latency hiding and designing the topology of the software to match the topology of the hardware

Do you have any good reference to learn general HPC and those topics? Best I could find when I searched last time was Udacity's High Performance Computing mooc and its reference book "Introduction to Parallel Computing".

SixSigma · on Jan 16, 2017

Computer Architecture and Parallel Processing Paperback – International Edition, January 1, 1986

by Kai Hwang (Author)

https://www.amazon.com/Computer-Architecture-Parallel-Proces...

14113 · on Jan 16, 2017

In certain circumstances, new languages can help software engineers to get high performance. For example, the lift project at the university of Edinburgh. It's a skeletal (think functional, but on the immutability/higher order functions rather than category theory) domain specific language for writing GPU kernels in. The team there have managed to use it to implement matrix-multiplication kernels that beat everything apart from NVIDIA's own assembly language implementations, just by using the safety and separation of concerns that a functional approach brings.