> Today, loop unrolling is usually a lose if it makes the loop much bigger. Is t...

Animats · on Jan 17, 2017

Modern superscalar CPUs are computing the branch at the same time they're doing the operation. They're also renaming registers, so that the reuse of the same register on the next iteration may actually use a different register internal to the CPU. It's entirely possible for five or so iterations of the loop to be running simultaneously.

Classic loop unrolling is more appropriate to machines with lots of registers, like a SPARC. There, you wrote load, load, load, load, operate, operate, operate, operate, store, store, store, store. All the loads, all the operates, and all the stores would overlap. AMD-64, unlike IA-32, has enough registers that you could do this. But it may not be a win on a deeply pipelined CPU.

detaro · on Jan 17, 2017

The key thing missing is if it makes the loop much bigger. if

That example does not make the loop much bigger, since it's trivial to combine multiple of these additions to one bigger one (even assuming that extra checks for overflow behavior might be needed, depending on the data type). Many other operations do not compress like this, e.g. those that iterate over many elements in a large data structure, need temporary results, ... Then loop unrolling likely causes costly cache misses.