> Today, loop unrolling is usually a lose if it makes the loop much bigger.
Is this true?
For example, suppose I have a loop that does 1 million iterations and an increment of `i += 1`. If I unroll that to have an increment of `i += 10`, now I only have to do 100,000 iterations. That is 900,000 eliminated branches.
Modern superscalar CPUs are computing the branch at the same time they're doing the operation. They're also renaming registers, so that the reuse of the same register on the next iteration may actually use a different register internal to the CPU. It's entirely possible for five or so iterations of the loop to be running simultaneously.
Classic loop unrolling is more appropriate to machines with lots of registers, like a SPARC. There, you wrote load, load, load, load, operate, operate, operate, operate, store, store, store, store. All the loads, all the operates, and all the stores would overlap. AMD-64, unlike IA-32, has enough registers that you could do this. But it may not be a win on a deeply pipelined CPU.
The key thing missing is if it makes the loop much bigger. if
That example does not make the loop much bigger, since it's trivial to combine multiple of these additions to one bigger one (even assuming that extra checks for overflow behavior might be needed, depending on the data type). Many other operations do not compress like this, e.g. those that iterate over many elements in a large data structure, need temporary results, ... Then loop unrolling likely causes costly cache misses.
Is this true?
For example, suppose I have a loop that does 1 million iterations and an increment of `i += 1`. If I unroll that to have an increment of `i += 10`, now I only have to do 100,000 iterations. That is 900,000 eliminated branches.
What am I missing?