If the macro scale is dominated by the code you're optimizing, which is often th...

jeffbee · on June 25, 2020

The big cautionary tale of microbenchmarking is memcpy. If you look at the Gnu memcpy, it's ridiculously long and obviously been guided by extensive benchmarking. And memcpy often shows up in profiles, so it is a big target. But in real systems a small simple loop will perform better due to icache pressure.

makapuf · on June 25, 2020

Memcpy lights not be a good example however because it's often replaced by a loop (or a move or eliminated altogether) by the compiler, because it's such a commun function that its understood by the compiler, and the tradeoff of the real function are also understood and taken into account (of course I'm sure there are bad optimizations)

MaxBarraclough · on June 25, 2020

> in real systems a small simple loop will perform better due to icache pressure

Do you have a source for this? Seems surprising GCC would leave such low-hanging fruit. G++ makes the effort to reduce std::copy to a memmove call when it can, or at least some of the time when it can (or at least, it did so in 2011). [0]

Related to this: does GCC treat memcpy differently when it can determine at compile-time that it's just a small copy?

[0] https://stackoverflow.com/a/4707028/

Gibbon1 · on June 25, 2020

Problem is superscalar processors the correspondence between number of instructions and speed breaks down. Partly because the processor does it's own optimization on the fly and can do multiple things in parallel.

A programmer should be careful about second guessing the compiler. And a compiler should be careful about second guessing the processor.

MaxBarraclough · on June 26, 2020

I'm not sure if you're implying this is premature optimisation. It isn't.

It's a performance-sensitive standard-library function, the kind of thing that deserves optimisation in assembly. It's also the kind of problem that can be accelerated with SIMD, but that necessarily means more complex code. That's why the standard library implementations aren't always dead simple.

Here's a pretty in-depth discussion [0]. They discuss CPU throttling, caches, and being memory-bound.

[0] https://news.ycombinator.com/item?id=18260154

jeffbee · on June 25, 2020

Only personal experience. If you look at the memcpy in llvm's libc, it was contributed by Googlers who share my experience and perspective.

api · on June 25, 2020

It's amazing to me that every CPU doesn't have specialized memory copying instructions given how common an operation this is and how it can be a bottleneck. We have hundreds of vector instructions to optimize all kinds of special cases but little for this most general thing?

jeffbee · on June 25, 2020

Intel has `rep movs` which is recommended for general use, you can beat it for sizes < 128B using a specialized loop. A very good memcpy for general purposes would just branch over the size to either a specialized small copy or `rep movs` for larger sizes.

The GCC version is just bananas. https://sourceware.org/git/?p=glibc.git;a=blob_plain;f=sysde...

Compare the newer ERMS implementation: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86...

andrewf · on June 25, 2020

What's weird is that if you git blame a bit, a lot of the elaborate stuff was contributed by Intel.

jeffbee · on June 25, 2020

Intel is at least as guilty of chasing misleading benchmarks as anyone else in the history of our field.

saagarjha · on June 25, 2020

No AVX2-optimized version? I'm disappointed…

Edit: it's rolled into the erms version.

userbinator · on June 25, 2020

x86 does, it's called REP MOVS, and it copies entire cachelines at a time (if the size of the copy lets it.) AFAIK it has done this since the P6 family (Pentium II), and has only gotten better over time. As a bonus, it is only 2 bytes long. I believe the Linux kernel uses it as its default memcpy()/memmove().

Here's some very interesting discussion about it: https://stackoverflow.com/questions/43343231/enhanced-rep-mo...

unscaled · on June 25, 2020

It's funny enough that the REP family of instructions has existed at least since the original 8086/8088. It's very much a relic of the CISC heritage of the x86 ISA, and it was probably meant to make writing assembly code easier and to save space - rather to improve performance.

I remember it was actually quite common in early PC programs, but I don't remember well enough whether it was also generated by compilers, or just existed in hand-optimized assembly (which was of course extremely common back then).

api · on June 25, 2020

Why isn't it the default everywhere?

jeffbee · on June 25, 2020

As I tried to indicate, the microbenchmark-guided unrolled loops look better than the simple `rep` instruction because the benchmarks aren't realistic. They don't suffer from mispredicted branches, they don't experience icache pressure. Basically the microarchitectural reality is absent in a small benchmark.

There's also the fact that `rep` startup cost was higher in the past than it is now. I think it started to get really fast around Ivy Bridge.

This is all discussed in Intel's optimization manual, by the way.

xelxebar · on June 25, 2020

Neat. I was unaware of this manual. For the lazy but curious:

https://software.intel.com/content/www/us/en/develop/downloa...

And if you want ALL THE THINGS:

https://software.intel.com/content/www/us/en/develop/article...

yvdriess · on June 25, 2020

Yes, some compilers emit AVX move instructions on memcpy or compiler-generated moves (e.g. pass by value struct). (In addition to the rep move mentioned in the sibling posts)

ckennelly · on June 25, 2020

llvm-libc's memcpy is heavily sized optimized for this reason. A dedicated instruction that is fast for all cases, is ideal, though.