If the macro scale is dominated by the code you're optimizing, which is often the case, let's say some inner loops of a codec, signal processing, etc. then the cache misses that may follow aren't really a concern. Obviously(?) taking some random code that doesn't run frequently and trying to optimize it won't make a difference. Microbenchmarks are pretty useful in this context since they let you iterate faster. More generally the question is what is the bottleneck, if the bottleneck is fetching stuff into caches then that's what you optimize for, if the bottleneck is some computation that is running from caches then that's what you optimize...
The big cautionary tale of microbenchmarking is memcpy. If you look at the Gnu memcpy, it's ridiculously long and obviously been guided by extensive benchmarking. And memcpy often shows up in profiles, so it is a big target. But in real systems a small simple loop will perform better due to icache pressure.
Memcpy lights not be a good example however because it's often replaced by a loop (or a move or eliminated altogether) by the compiler, because it's such a commun function that its understood by the compiler, and the tradeoff of the real function are also understood and taken into account (of course I'm sure there are bad optimizations)
> in real systems a small simple loop will perform better due to icache pressure
Do you have a source for this? Seems surprising GCC would leave such low-hanging fruit. G++ makes the effort to reduce std::copy to a memmove call when it can, or at least some of the time when it can (or at least, it did so in 2011). [0]
Related to this: does GCC treat memcpy differently when it can determine at compile-time that it's just a small copy?
Problem is superscalar processors the correspondence between number of instructions and speed breaks down. Partly because the processor does it's own optimization on the fly and can do multiple things in parallel.
A programmer should be careful about second guessing the compiler. And a compiler should be careful about second guessing the processor.
I'm not sure if you're implying this is premature optimisation. It isn't.
It's a performance-sensitive standard-library function, the kind of thing that deserves optimisation in assembly. It's also the kind of problem that can be accelerated with SIMD, but that necessarily means more complex code. That's why the standard library implementations aren't always dead simple.
Here's a pretty in-depth discussion [0]. They discuss CPU throttling, caches, and being memory-bound.
It's amazing to me that every CPU doesn't have specialized memory copying instructions given how common an operation this is and how it can be a bottleneck. We have hundreds of vector instructions to optimize all kinds of special cases but little for this most general thing?
Intel has `rep movs` which is recommended for general use, you can beat it for sizes < 128B using a specialized loop. A very good memcpy for general purposes would just branch over the size to either a specialized small copy or `rep movs` for larger sizes.
x86 does, it's called REP MOVS, and it copies entire cachelines at a time (if the size of the copy lets it.) AFAIK it has done this since the P6 family (Pentium II), and has only gotten better over time. As a bonus, it is only 2 bytes long. I believe the Linux kernel uses it as its default memcpy()/memmove().
It's funny enough that the REP family of instructions has existed at least since the original 8086/8088. It's very much a relic of the CISC heritage of the x86 ISA, and it was probably meant to make writing assembly code easier and to save space - rather to improve performance.
I remember it was actually quite common in early PC programs, but I don't remember well enough whether it was also generated by compilers, or just existed in hand-optimized assembly (which was of course extremely common back then).
As I tried to indicate, the microbenchmark-guided unrolled loops look better than the simple `rep` instruction because the benchmarks aren't realistic. They don't suffer from mispredicted branches, they don't experience icache pressure. Basically the microarchitectural reality is absent in a small benchmark.
There's also the fact that `rep` startup cost was higher in the past than it is now. I think it started to get really fast around Ivy Bridge.
This is all discussed in Intel's optimization manual, by the way.
Yes, some compilers emit AVX move instructions on memcpy or compiler-generated moves (e.g. pass by value struct). (In addition to the rep move mentioned in the sibling posts)