Modern superscalar CPUs are computing the branch at the same time they're doing ...

Modern superscalar CPUs are computing the branch at the same time they're doing the operation. They're also renaming registers, so that the reuse of the same register on the next iteration may actually use a different register internal to the CPU. It's entirely possible for five or so iterations of the loop to be running simultaneously.

Classic loop unrolling is more appropriate to machines with lots of registers, like a SPARC. There, you wrote load, load, load, load, operate, operate, operate, operate, store, store, store, store. All the loads, all the operates, and all the stores would overlap. AMD-64, unlike IA-32, has enough registers that you could do this. But it may not be a win on a deeply pipelined CPU.