Problem is superscalar processors the correspondence between number of instructions and speed breaks down. Partly because the processor does it's own optimization on the fly and can do multiple things in parallel.
A programmer should be careful about second guessing the compiler. And a compiler should be careful about second guessing the processor.
I'm not sure if you're implying this is premature optimisation. It isn't.
It's a performance-sensitive standard-library function, the kind of thing that deserves optimisation in assembly. It's also the kind of problem that can be accelerated with SIMD, but that necessarily means more complex code. That's why the standard library implementations aren't always dead simple.
Here's a pretty in-depth discussion [0]. They discuss CPU throttling, caches, and being memory-bound.
A programmer should be careful about second guessing the compiler. And a compiler should be careful about second guessing the processor.