Having spent quite a bit of time designing massively parallel algorithms (concurrency starting at several thousand cores on up), computer scientists are often baffled when I tell them that FP and immutability don't help. In practice, it solves the wrong problem and creates a new one because massively parallel systems are almost always bandwidth bound (either memory or network) and making copies of everything just aggravates that. You can't build hardware busses to compensate or companies like Cray would have long ago.
If you look at effective highly scalable parallel codes of this nature, you see two big themes: pervasive latency hiding and designing the topology of the software to match the topology of the hardware. The requisite code to implement this is boring single-threaded C/C++ which is completely mutable but no one cares because "single-threaded". Immutability burns much needed bandwidth for no benefit. This among other reasons is why C and C++ are ubiquitous in HPC.
The challenge is that CS programs don't teach pervasive software latency hiding nor is much ink spilled on how you design algorithms to match the topology of the hardware, both of which are fairly deep and obscure theory.
We don't need new languages, we need more software engineers who understand the nature of massive parallelism on real hardware, which up until now is largely tribal knowledge among specialists that design such codes. (One of the single most important experiences I had as a software engineer was working on several different supercomputing architectures with a different cast of characters; there are important ideas in that guild about scaling code that are completely missing from mainstream computer science.)
Having spent a good bit of time in HPC from a practical and academic perspective, I'll second that claim.
One example I like to bring up is the textbook hotplate, where the temperature for a single cell is equal to the average of its temperature and the temperatures around it. Each iteration is embarrassingly parallel, but the result of the iteration is necessary to proceed. Instead of synchronizing the entire cluster between iterations, each node computes the results for an extra N cells in each direction, allowing N iterations to occur before synchronization is necessary (after the first iteration, the outermost cells are inaccurate, after the 2nd iteration, the outer two layers are inaccurate, etc. Eventually, you want to sync before the cells we care about are affected). This leads to massively diminished returns as you break the problem into smaller chunks. But even for smaller chunks, doubling or tripling the work per node is still faster than synchronizing.
My sole contribution to HPC was inventing a graph algorithm (circa 2009) for breadth-first search that didn't require the usual iterated barrier synchronization super-step you have with BSP algorithms. Instead, every node would run free with almost no synchronization but a clever error correction mechanism backed out all the incorrect traversals for a completely negligible cost (both space and time) relative to the ability to let the computation run as though it was embarrassingly parallel. It basically allowed every node to always be doing (mostly) constructive work without waiting on another node.
But yeah, burning a bit of CPU to eliminate synchronization is frequently a huge win. This is where knowing how much compute you can do within a given latency window is helpful.
I'm not about to join the FP circle jerk, but well-executed immutability can reduce memory bandwidth requirements rather than increasing them. This is because often in large C (and Java) codebases, the uninitiated will copy large structures and buffers when they don't need to, because they don't trust the other code[rs]. If you have a tree-like datastructure such as a nested map, immutability lets you copy the structure by reference even when modifying the contents.
That said, the added complexity in memory management means that if you can afford good programmers, it's better that they just try to avoid this practice in the first place. So HPC folks take the mutable approach.
I guess it's hard to say if an average, middle-of-the-road programmer will ever be able to productively work on parallel systems.
Very well put, I just have one addendum: Add Fortran to the list. If you program multidimensional array like data structures as flat arrays in C or C++, use Fortran for the kernels and Python for the glue instead. It's just as performant, has much more batteries included and is much more comfortable to achieve a high performance.
>> If you look at effective highly scalable parallel codes of this nature, you see two big themes: pervasive latency hiding and designing the topology of the software to match the topology of the hardware
Do you have any good reference to learn general HPC and those topics? Best I could find when I searched last time was Udacity's High Performance Computing mooc and its reference book "Introduction to Parallel Computing".
In certain circumstances, new languages can help software engineers to get high performance. For example, the lift project at the university of Edinburgh. It's a skeletal (think functional, but on the immutability/higher order functions rather than category theory) domain specific language for writing GPU kernels in. The team there have managed to use it to implement matrix-multiplication kernels that beat everything apart from NVIDIA's own assembly language implementations, just by using the safety and separation of concerns that a functional approach brings.
If you look at effective highly scalable parallel codes of this nature, you see two big themes: pervasive latency hiding and designing the topology of the software to match the topology of the hardware. The requisite code to implement this is boring single-threaded C/C++ which is completely mutable but no one cares because "single-threaded". Immutability burns much needed bandwidth for no benefit. This among other reasons is why C and C++ are ubiquitous in HPC.
The challenge is that CS programs don't teach pervasive software latency hiding nor is much ink spilled on how you design algorithms to match the topology of the hardware, both of which are fairly deep and obscure theory.
We don't need new languages, we need more software engineers who understand the nature of massive parallelism on real hardware, which up until now is largely tribal knowledge among specialists that design such codes. (One of the single most important experiences I had as a software engineer was working on several different supercomputing architectures with a different cast of characters; there are important ideas in that guild about scaling code that are completely missing from mainstream computer science.)