This question, what to do with a 'shared nothing' machine, is the real heart of the question.
The original article talked about putting ARM-9 cores on a chip which is also overkill, your typical object neither cares nor needs any of the complex memory architecture an ARM-9, had they chosen a Cortex-M they could have put twice the number of cores on their chip with no loss of computability.
But as John points out, the challenge is how you actually program these things. One interesting proposal I heard was to compile code with every loop unrolled and every branch hard coded either one way or the other. So for every "branch" in a piece of code you would generate two binaries, one where it was taken and one where it wasn't. That means four if statements, 16 different binaries. (pruned for cases that couldn't happen though). And then you ran your input on every copy of the program simultaneously. [1] The issue though, and it became crystal clear to me as I was working my way through how Google processed logs and log data, was that feeding the output of one computation into the input of another became the limiter and the interconnect of the fabric was the real bottleneck. That analysis got a tremendous validation when the underlying network upgrades started getting processed. It made it possible to run the same sort of test on a group of machines connected in the 'old' way, and the same machines connected in the 'new' way. That demonstrated just how important inter-machine communication could be in making them productive.
So a model for optimizing both in parallelization (interdependencies in state) and scheduling (interdependencies in time) it essential for making a shared nothing cluster perform better than a very large shared memory machine.
[1] Branch prediction is good enough these days to make this not much of a win.
> Today, loop unrolling is usually a lose if it makes the loop much bigger.
Is this true?
For example, suppose I have a loop that does 1 million iterations and an increment of `i += 1`. If I unroll that to have an increment of `i += 10`, now I only have to do 100,000 iterations. That is 900,000 eliminated branches.
Modern superscalar CPUs are computing the branch at the same time they're doing the operation. They're also renaming registers, so that the reuse of the same register on the next iteration may actually use a different register internal to the CPU. It's entirely possible for five or so iterations of the loop to be running simultaneously.
Classic loop unrolling is more appropriate to machines with lots of registers, like a SPARC. There, you wrote load, load, load, load, operate, operate, operate, operate, store, store, store, store. All the loads, all the operates, and all the stores would overlap. AMD-64, unlike IA-32, has enough registers that you could do this. But it may not be a win on a deeply pipelined CPU.
The key thing missing is if it makes the loop much bigger. if
That example does not make the loop much bigger, since it's trivial to combine multiple of these additions to one bigger one (even assuming that extra checks for overflow behavior might be needed, depending on the data type). Many other operations do not compress like this, e.g. those that iterate over many elements in a large data structure, need temporary results, ... Then loop unrolling likely causes costly cache misses.
The original article talked about putting ARM-9 cores on a chip which is also overkill, your typical object neither cares nor needs any of the complex memory architecture an ARM-9, had they chosen a Cortex-M they could have put twice the number of cores on their chip with no loss of computability.
But as John points out, the challenge is how you actually program these things. One interesting proposal I heard was to compile code with every loop unrolled and every branch hard coded either one way or the other. So for every "branch" in a piece of code you would generate two binaries, one where it was taken and one where it wasn't. That means four if statements, 16 different binaries. (pruned for cases that couldn't happen though). And then you ran your input on every copy of the program simultaneously. [1] The issue though, and it became crystal clear to me as I was working my way through how Google processed logs and log data, was that feeding the output of one computation into the input of another became the limiter and the interconnect of the fabric was the real bottleneck. That analysis got a tremendous validation when the underlying network upgrades started getting processed. It made it possible to run the same sort of test on a group of machines connected in the 'old' way, and the same machines connected in the 'new' way. That demonstrated just how important inter-machine communication could be in making them productive.
So a model for optimizing both in parallelization (interdependencies in state) and scheduling (interdependencies in time) it essential for making a shared nothing cluster perform better than a very large shared memory machine.
[1] Branch prediction is good enough these days to make this not much of a win.