Fast memcpy, a system design

userbinator · on Dec 20, 2022

fleet-wide profiling revealed that 25-35% of all CPU time was spent just moving bytes around[...]If data movement were faster, more work could be done on the same processors.

Here's a different thought experiment: how much of that data movement is actually necessary? If data movement was taking up so much time, why not investigate why and if that's even necessary, instead of thinking it's the hardware that needs to be faster? As the old saying goes, "the fastest way to do something is to not do it at all." One style of code that seems to be particularly exacerbated by the widespread use of HLLs is making lots of unnecessary temporary copies of data. The biggest inefficiency isn't the hardware (unless your goal is to sell more hardware, but that's a different rant...) --- it's the insanely complex and bloated layers of software that gets layered on top of what is otherwise perfectly adequate hardware. From that perspective, faster hardware only leads to even more wasteful and inefficient software. Making memory copies faster will only cause software to make more unnecessary copies.

But after reading the whole thing, I can say that this is certainly a very strange article. It's written by someone who is presumably quite knowledgeable about hardware, yet seems completely oblivious to the fact that x86 has had a dedicated instruction for this since the 8086, and 16 bytes per cycle was already achievable with the P6 (1996) "fast strings" enhancement:

https://stackoverflow.com/questions/33902068/what-setup-does...

ardel95 · on Dec 20, 2022

While you are right, we are where we are and there is tremendous leverage in making an extremely common (even if undesirable) thing more efficient. It kind of doesn't matter why millions of lines of code perform memcpys - they do, and we're unlikely to change majority of them anytime soon.

P.S. I work on the type of systems where we obsess over unneeded memory accesses and copies. Minimizing those is not as trivial as it may seem, especially if you care about memory safety.

vanderZwan · on Dec 20, 2022

> While you are right, we are where we are and there is tremendous leverage in making an extremely common (even if undesirable) thing more efficient.

Aren't you worried that this will just lead to another manifestation of the Jevons Paradox? I get that you already mention that everyone in your "field" understands that memory is the bottleneck to avoid, but still.

[0] https://en.wikipedia.org/wiki/Jevons_paradox

maria2 · on Dec 20, 2022

Memory gets moved for a lot of reasons. Any type of FFI usually requires memory copies. For instance, if you pass a C string from Swift to C, the C side needs to copy the memory to use it after the immediate call. Or if C returns a string to Swift, Swift needs to copy the string into its own format.

_dhruva · on Dec 20, 2022

This is my standpoint too.

Assuming Google uses gRPC and protobufs extensively, I can only imagine a chaotic polyglot environment where buffers are copied and transformed very often.

I work at a fairly large Python shop using libraries in C. Cost of FFI is non-trivial, memory management and moving data is the usual culprit.

fathyb · on Dec 20, 2022

There is also the I/O, most interaction with disks or NICs are copying twice. We know efficient zero-copy I/O is achievable, it's just that it requires a lot of work for kernel and app developers with a very limited ROI.

georgelyon · on Dec 20, 2022

While the overall point is valid, this isn’t strictly true. A Swift string is frequently backed by a contiguous UTC8 buffer terminated with a \0, so passing to C is just a matter of passing the pointer and requires no copies.

maria2 · on Dec 20, 2022

But Swift has no way of signaling to C that the memory can be freed so I’m pretty sure it does a copy when using String’s cString constructor.

saagarjha · on Dec 20, 2022

When you pass a Swift String to C you do so with a given lifetime. By default this is the duration of the call.

jeffreygoesto · on Dec 20, 2022

Always two of them there are. Time and space. Spatial separation has the great benefit that it frees you from separation in time, which is technically and mentally more challenging. If you want or need that anything can happen at any time on either side, you end up copying (and ideally closing the door after). Very much like Erlang does with the Beam.

hinkley · on Dec 20, 2022

Cynical answer: because that's not sexy.

I have some coworkers trying to re-architect a big chunk of our application to selectively provide data because "only about 10% is used on each interaction". Nobody has bothered to check what % of the data is never used at all. If it's 50%, then you're complicating a whole lot of code and reuse semantics in order to get a 2-4x improvement in 'performance' that you're selling as a 10x improvement.

But to your bigger question, defensive copies are about keeping someone else from munging your data. Borrow Semantics prevent some of those scenarios, which at least moves some of that inefficiency to compile time, where the tax is paid by the perpetrator and their peers, instead of by the users.

titzer · on Dec 20, 2022

A lot of the additional data movement is a.) defensive copies because too many data structures are mutable and b.) change of data format to internal software representations, e.g. wire->decode->IR1->...->IR1->encode->wire.

robotresearcher · on Dec 20, 2022

> change of data format to internal software representations

There is so much of this. Is there any hope for a world where we can ship working memory without costly XDR? Does anyone have data on the rate of change of time or energy spent on XDR translation?

titzer · on Dec 20, 2022

I have thought a lot about this over the years, and I think one of the issues is that it is hard to specify fixed data formats in most programming languages (in a declarative way), leading to lots of code that decodes byte-by-byte or word-by-word. Since writing all that sucks, many programmers have turned to higher-level data formats such as JSON or protobufs that are decoded by libraries.

I have been working on a new feature in Virgil to make writing explicit byte-by-byte layouts easier and directly supported by the compiler. It's still a work in progress though.

nerpderp82 · on Dec 20, 2022

If the heap were treated like a typed database, with transaction semantics, many of our problems with disappear. Layout should not be an implementation detail. I know Wasm won't solve this explicitly, but at least have a chance while everything is being redefined.

robotresearcher · on Dec 20, 2022

Wouldn’t that move the data transformation costs from at I/O time to all the time?

monocasa · on Dec 20, 2022

Not if the encoding can simply be consumed in place, needing only a validation pass if it's coming from an untrusted source. Cap'n Proto is pretty close to that for one example.

NavinF · on Dec 20, 2022

> a world where we can ship working memory

Yes, you can have zero serialization latency: https://capnproto.org/

CalChris · on Dec 20, 2022

It's a lot of work to reason about programs in order to reduce copies. A different, cheaper and more feasible approach is to reasoning about alignments and sizes.

touisteur · on Dec 20, 2022

I've seen some long running work on the rust compiler to use high level guarantees (lifetimes, ownership...) to improve codegen and one of the big tasks seems to be 'remove unnecessary memory copies'. Fun to watch, I hope there'll be a cool write up.

ilyt · on Dec 20, 2022

I'd imagine a lot of that was just same simple operation on a steam of data; there isn't that much to optimize there aside from finding a way to operate on more data at a time

titzer · on Dec 19, 2022

Just make memcpy a bonafide hardware instruction already. Since it could take unbounded execution time, you could, e.g. limit to a fixed, large size, like 4KB. That'd be a huge benefit, just in handling all the alignment/misalignment cases. And it's big enough that the loop you need to put around it has relatively low overhead. You could specify that address translations (and therefore faults) happen at the beginning, and no data is written.

stephencanon · on Dec 19, 2022

ARM did so in ARMv8.8, and x86 has had REP MOVS since forever (though it only became consistently faster than SW sequences relatively recently).

nerpderp82 · on Dec 20, 2022

Not just an instruction, but something the MMU can use to coalesce memory in the background, if the MMU can do multiple levels of indirection or RC pages, then it can also do GC. The MMU has been too underpowered for too long.

userbinator · on Dec 19, 2022

x86 has had that since the 8086. It's called REP MOVS. As far as I know, it's still the fastest general-purpose way to do copying.

loeg · on Dec 20, 2022

It certainly wasn't the fastest general-purpose way to do copying for the decades of CPUs between the 8086 (late 1970s) and introduction of enhanced rep movsb (ERMSB) in Ivybridge (2012). For post-Ivybridge CPUs, I believe it was non optimal for some usage patterns for a while, too. It may be optimal in very recent Intel CPUs, but that's a far cry from saying it's been the fastest since introduction in the 8086.

E.g., Agner: https://www.agner.org/optimize/blog/read.php?i=372

E.g., Stackoverflow: https://stackoverflow.com/questions/43343231/enhanced-rep-mo...

stephencanon · on Dec 20, 2022

Right. The follow-on to ERMSB, FSRM ("fast short rep movs"), which first appeared in Icelake, finally makes it consistently competitive with SW sequences¹.

¹ but you still want to branch around it when the length is zero ("when the length is zero?!" I hear you cry; it turns out that if you instrument actual systems, you will find that this happens shockingly often).

moonchild · on Dec 20, 2022

> FSRM finally makes it consistently competitive with SW sequences

Mateusz guzik says it's decent above 128 bytes, but that software sequences still win below that.

stephencanon · on Dec 20, 2022

It depends on the exact distribution of sizes (and especially if the size[s] are statically knowable—e.g. if you are copying exactly 31 bytes, or something like an unknown size between 48 and 62 bytes, a SW sequence will still win), but it is now _competitive_ if not actually as fast (previously it was often 2-3x slower in that range, even when the length was not fixed).

titzer · on Dec 19, 2022

I've heard this before and unfortunately it isn't.

userbinator · on Dec 20, 2022

Do you have any more information to share?

The ultra-bloated unrolled vector implementations may appear a little faster in microbenchmarks but the amount of cache they take up and other effects like AVX downclocking (for those implementations that use AVX) mean that in practice they can actually be slower overall.

titzer · on Dec 20, 2022

I've heard this directly from engineers at Intel.

saagarjha · on Dec 20, 2022

Is this pre- or post-ERMS?

loeg · on Dec 20, 2022

I don't believe it's been consistently the fastest way to copy memory even post-ERMSB. (It would be great and convenient if it were!)

saagarjha · on Dec 20, 2022

Perhaps for very small sizes (just use a mov directly) and very large ones (that fall out of your cache, so you’d probably want non-temporal stores). I think it would be difficult to cover all cases.

stephencanon · on Dec 20, 2022

It pretty much is (except for length-zero) starting in Icelake, with FSRM (aka ERMSB II, The Reckoning)

titzer · on Dec 20, 2022

It'd be great if this were true, since that would finally put an end to the madness. But I've hoped this for many, many years, but always been disappointed when someone spends ungodly hours on some hyper-tuned AVX thing that just barely eeks past, and we get stuck with another unwieldy monster for a decade.

titzer · on Dec 20, 2022

I heard this around 2018 or so.

zX41ZdbW · on Dec 19, 2022

memcpy implementation is a rabbit hole.

The main trap is trying to benchmark and profile it without the context of the whole application.

I recommend to read a comment in ClickHouse source code (disclaimer: written by me)

https://github.com/ClickHouse/ClickHouse/blob/master/base/gl...

vanderZwan · on Dec 20, 2022

> The main trap is trying to benchmark and profile it without the context of the whole application.

Emery Berger's "Performance (Really) Matters" talk also goes into this[0]. He paints a really, really depressing picture of how unreliable benchmarks truly are. Like, things like link order having ridiculous impacts and stuff like that.

[0] https://www.youtube.com/watch?v=7g1Acy5eGbE&t=306s

chubot · on Dec 20, 2022

I wonder if 25-35% of all CPU cycles being devoted to moving bytes around is a design problem, i.e. a result of the RPC-oriented programming model we used at Google

I know there were also measurements about protobuf encode/decode, and I think it was a similarly large portion of the total workload. (Anecdotally I think that's why people say the protobuf libraries are huge and big and slow to compile -- because they're highly optimized that's probably what empirically resulted in the fastest execution time)

It feels like a content-oriented and value-oriented model would have less of this profile, i.e. with (correct) local caching and differential compression

secondcoming · on Dec 20, 2022

Interestingly, the publicly available protobuf implementation is quite bad for only using std::string for string & repeated bytes fields, with std::string_view support only coming soon. I believe Google’s internal protobuf code support ABSL’s string_view equivalent.

nerpderp82 · on Dec 20, 2022

https://github.com/protocolbuffers/upb

A Hardware Accelerator for Protocol Buffers https://dl.acm.org/doi/abs/10.1145/3466752.3480051

chubot · on Dec 22, 2022

Ah yeah, I think the historical issue is that Google doesn't use std::string internally for performance reasons. They have their own C++ string implementation, which protobuf uses internally, like all other code.

And then when they open sourced it, I guess they left out that code. So the open source version is probably "de-optimized" a bit unfortunately.

WalterBright · on Dec 20, 2022

In my entire long career in software, faster memcpy is a constant topic. I remember using the IBM PC DMA chip to make it faster.

moonchild · on Dec 19, 2022

> We introduce two new instructions to handle the short move case quickly

These are more or less built in to avx512 already...

(The hot path is: load immediate; bzhi; kmov; load; store. 5 instructions rather than 2, but still trivial. If you look at the _latencies_ involved, the difference is trivial.)\

(Afaik it's in sve too.)

gct · on Dec 20, 2022

Please, quit messing with my scroll speed.

plasticeagle · on Dec 20, 2022

I know, it's awful isn't it? The irony of an article about detailed low-level optimisation of memcpy (which is probably being used unnecessarily anyway), re-implementing scrolling in javascript...

You've got to admire it, really.

dxf · on Dec 20, 2022

If anyone is curious or otherwise interested, the version of memcpy() used for Google production is basically what is in LLVM's libc, https://libc.llvm.org

loeg · on Dec 20, 2022

LLVM's implementation is here, more or less: https://github.com/llvm/llvm-project/blob/main/libc/src/stri...

jeffbee · on Dec 20, 2022

I really like this refutation of the GNU-style micro-optimized-but-slow-in-practice mess of assembly.

moonchild · on Dec 20, 2022

I would like such a refutation, notionally. I do not particularly like this refutation, considering that it is just as overabstracted, and probably not much faster (if at all).

What I would really like is to see more widespread hoisting of dispatch. Rather than calling 'memcpy', call 'memcpy_small', 'memcpy8', etc.

jeffbee · on Dec 20, 2022

That's the whole gist of gchatelet's implementation. Make it easy for the compiler to inline memcpy, so the compiler gets a chance to propagate the information it has about the parameters. In many cases it can eliminate all the branches but one.

The GNU way of using a huge, branchy assembly block that is selected at runtime using ifunc means that the compiler never even got a chance.

Regarding the question of whether or not they are faster, see section 4.4 of this paper. Replacing the glibc memcmp with something trivial resulted in up to 1% speedup in Google web search, when considering the whole program. It doesn't microbenchmark as well as glibc memcmp, but it is a better function that doesn't wreck the performance of the rest of the system.

https://storage.googleapis.com/pub-tools-public-publication-...

moonchild · on Dec 20, 2022

> Make it easy for the compiler to inline memcpy, so the compiler gets a chance to propagate the information it has about the parameters

I have low hopes for compilers. Inlining heuristics are terribly complicated, and optimisations that result therefrom will only be things that the compiler can prove for _all_ invocations. Inlining won't get you 'this call site is usually large', or 'this call site is small, but predictable', or 'this call site is really unpredictable, so use weird branchfree tricks'. (A JIT compiler might do better, but people do not usually JIT c or c++.)

> Replacing the glibc memcmp with something trivial resulted in up to 1% speedup in Google web search, when considering the whole program. It doesn't microbenchmark as well as glibc memcmp, but it is a better function that doesn't wreck the performance of the rest of the system

It's not all or nothing, and rep cmps is utter trash. My memcmp, for instance, I was careful to keep below 4 cache lines (or 3, considering only the version that doesn't overread)—<https://github.com/moon-chilled/fancy-memcmp/blob/master/mem...>—and it actually has reasonable throughput.

(Also: inlining is not the way to go if you are frontend-bound...)

menaerus · on Dec 20, 2022

> (Also: inlining is not the way to go if you are frontend-bound...)

When workload is frontend-bound, then the culprit is usually either in instruction cache misses (e.g. a lot of unpredictable virtual calls or generally poor code layout) or branch mispredictions (e.g. a lot of unpredictable branches in your code). I fail to see how inlining the code can be correlated to these two effects other than, what I believe is a myth, that inlining the code will by (1) growing the executable size (2) put more pressure on the instruction cache size and (3) therefore end up with degraded performance. In my experience, rarely I have seen even nr. (1) taking place (compilers and linkers are way too smart nowadays) and I think I have never managed to measure the effects of (2) and (3) in a non micro-benchmark scenario.

Anyway, if I were to optimize the workload found to be frontend-bound, eliminating branches and getting rid of the dynamic dispatch would be the first two things to check. Inlining maybe but only maybe in the third place.

als0 · on Dec 19, 2022

> We suggest a few new instructions

Not really a novel idea, particularly since ARM already announced memcpy instructions earlier this year.

unwind · on Dec 20, 2022

Very interesting and hardcore solutions, adding new instructions to improve data copying in a more "clever" way that just the repeat-prefix seems like an interesting path.

There was a silly typo in the very first sentence that threw me off:

When I worked at Google, fleet-wide profiling revealed that 25-35% of all CPU time was spent just moving bytes around: memcpy, strcmp, [...]

I think that should be `strcpy()`, since obviously `strcmp()` doesn't move data around.

Someone · on Dec 20, 2022

strcpy is a better example, but strcmp does memory-to-cpu moves, which is part of what they’re trying to speed up.

xxs · on Dec 20, 2022

prefetching explained later would likely benefit strcmp

cryptonector · on Dec 20, 2022

This reminds me of the old Motorola M68K dbcc instruction, which was a specialized looping instruction where you could loop back only a small number of instructions which would all be cached tightly in a tiny cache that existed only for the dbcc case.

Are there modern analogues to that dbcc instruction?

ant6n · on Dec 20, 2022

A branch?

cryptonector · on Dec 20, 2022

dbcc had a specific semantic that you could only branch back a very small amount and that the CPU would cache the body of the loop. A modern equivalent would use a small amount of cache that is not even L1.

kazinator · on Dec 20, 2022

You should try to code in such a way that if memcpy were made 100X slower, you wouldn't notice.

saagarjha · on Dec 20, 2022

Ok, how?

actionfromafar · on Dec 20, 2022

Can’t we get content addressable memory already? That would make memcpy moot in many cases.

rwmj · on Dec 20, 2022

Something similar in this paper, putting an FPGA between the processor and DRAM so you can arbitrarily remap bits of it: https://arxiv.org/abs/2109.14349

warkdarrior · on Dec 20, 2022

Content addressable memory with built-in copy-on-write, fine granularity, and security tags? Not sure it is possible.

actionfromafar · on Dec 20, 2022

Would be amazing though? I’m thinking lots of tiny compute near the RAM chips.

rsaxvc · on Dec 20, 2022

Sounds like Belt.

meindnoch · on Dec 20, 2022

Ok, I’ll bite: in which cases does content addressable memory obsolete memcpy? It’s just a lookup table in hardware…

actionfromafar · on Dec 20, 2022

Content addressable memory is the ultimate "zero copy architecture". Imagine a "CPU instruction" or library call which determines that, "no, I have already seen this address" (where address isn't an address as much as a unique hash) and can pass that hash along to the destination instead of running a copy.

Overhead increases by effectively removing pointers as we know them, replacing them with hash+length, but the reward is lower memory pressure and better cache locality. It also carries the promise of making a "buffer overrun" not only impossible but unthinkable. In such a world, incrementing "one extra" would not step into some other heap or stack, because everything would be segmented.

secondcoming · on Dec 20, 2022

On GCP adding RAM to a machine type increases costs by a stupid amount. Now it seems like they’re thinking of ways to penalise you for actually using it, because you may interfere with another tenant on the same machine.

323 · on Dec 19, 2022

Intel is known to analyze code from actual applications and invent new instructions that seem to be missing.

Surely they are aware of the memcpy problem, so if it's true that servers spend a lot of time copying memory, what is Intel (or AMD) doing regarding this?

twoodfin · on Dec 20, 2022

See the Data Streaming Accelerator:

https://01.org/blogs/2019/introducing-intel-data-streaming-a...

loeg · on Dec 20, 2022

Ah, a new name for IOAT (PCIe enumerated DMA engine in the SoC). I used this back on Broadwell.

userbinator · on Dec 20, 2022

what is Intel (or AMD) doing regarding this?

In this case they haven't invented new instructions at all but have been steadily improving REP MOVS over time.

323 · on Dec 20, 2022

So does that mean that further major optimizations are basically impossible?

You don't always need the operation finished right away. Maybe REP MOVS could return a sort of a handle that you then wait for when you actually need to use the destination, and the CPU can keep chugging along in the background. Like when you submit commands to the GPU or how you can wait at memory fences for other threads in a warp to complete.

__init · on Dec 20, 2022

Further optimization are certainly possible, especially as the hardware improves and gains additional capabilities. With regards to your suggestion, this is essentially what already happens, thanks to the magic of out-of-order cores.

meindnoch · on Dec 20, 2022

https://en.m.wikipedia.org/wiki/Direct_memory_access