fleet-wide profiling revealed that 25-35% of all CPU time was spent just moving bytes around[...]If data movement were faster, more work could be done on the same processors.
Here's a different thought experiment: how much of that data movement is actually necessary? If data movement was taking up so much time, why not investigate why and if that's even necessary, instead of thinking it's the hardware that needs to be faster? As the old saying goes, "the fastest way to do something is to not do it at all." One style of code that seems to be particularly exacerbated by the widespread use of HLLs is making lots of unnecessary temporary copies of data. The biggest inefficiency isn't the hardware (unless your goal is to sell more hardware, but that's a different rant...) --- it's the insanely complex and bloated layers of software that gets layered on top of what is otherwise perfectly adequate hardware. From that perspective, faster hardware only leads to even more wasteful and inefficient software. Making memory copies faster will only cause software to make more unnecessary copies.
But after reading the whole thing, I can say that this is certainly a very strange article. It's written by someone who is presumably quite knowledgeable about hardware, yet seems completely oblivious to the fact that x86 has had a dedicated instruction for this since the 8086, and 16 bytes per cycle was already achievable with the P6 (1996) "fast strings" enhancement:
While you are right, we are where we are and there is tremendous leverage in making an extremely common (even if undesirable) thing more efficient. It kind of doesn't matter why millions of lines of code perform memcpys - they do, and we're unlikely to change majority of them anytime soon.
P.S. I work on the type of systems where we obsess over unneeded memory accesses and copies. Minimizing those is not as trivial as it may seem, especially if you care about memory safety.
> While you are right, we are where we are and there is tremendous leverage in making an extremely common (even if undesirable) thing more efficient.
Aren't you worried that this will just lead to another manifestation of the Jevons Paradox? I get that you already mention that everyone in your "field" understands that memory is the bottleneck to avoid, but still.
Memory gets moved for a lot of reasons. Any type of FFI usually requires memory copies. For instance, if you pass a C string from Swift to C, the C side needs to copy the memory to use it after the immediate call. Or if C returns a string to Swift, Swift needs to copy the string into its own format.
Assuming Google uses gRPC and protobufs extensively, I can only imagine a chaotic polyglot environment where buffers are copied and transformed very often.
I work at a fairly large Python shop using libraries in C. Cost of FFI is non-trivial, memory management and moving data is the usual culprit.
There is also the I/O, most interaction with disks or NICs are copying twice. We know efficient zero-copy I/O is achievable, it's just that it requires a lot of work for kernel and app developers with a very limited ROI.
While the overall point is valid, this isn’t strictly true. A Swift string is frequently backed by a contiguous UTC8 buffer terminated with a \0, so passing to C is just a matter of passing the pointer and requires no copies.
Always two of them there are. Time and space. Spatial separation has the great benefit that it frees you from separation in time, which is technically and mentally more challenging. If you want or need that anything can happen at any time on either side, you end up copying (and ideally closing the door after). Very much like Erlang does with the Beam.
I have some coworkers trying to re-architect a big chunk of our application to selectively provide data because "only about 10% is used on each interaction". Nobody has bothered to check what % of the data is never used at all. If it's 50%, then you're complicating a whole lot of code and reuse semantics in order to get a 2-4x improvement in 'performance' that you're selling as a 10x improvement.
But to your bigger question, defensive copies are about keeping someone else from munging your data. Borrow Semantics prevent some of those scenarios, which at least moves some of that inefficiency to compile time, where the tax is paid by the perpetrator and their peers, instead of by the users.
A lot of the additional data movement is a.) defensive copies because too many data structures are mutable and b.) change of data format to internal software representations, e.g. wire->decode->IR1->...->IR1->encode->wire.
> change of data format to internal software representations
There is so much of this. Is there any hope for a world where we can ship working memory without costly XDR? Does anyone have data on the rate of change of time or energy spent on XDR translation?
I have thought a lot about this over the years, and I think one of the issues is that it is hard to specify fixed data formats in most programming languages (in a declarative way), leading to lots of code that decodes byte-by-byte or word-by-word. Since writing all that sucks, many programmers have turned to higher-level data formats such as JSON or protobufs that are decoded by libraries.
I have been working on a new feature in Virgil to make writing explicit byte-by-byte layouts easier and directly supported by the compiler. It's still a work in progress though.
If the heap were treated like a typed database, with transaction semantics, many of our problems with disappear. Layout should not be an implementation detail. I know Wasm won't solve this explicitly, but at least have a chance while everything is being redefined.
Not if the encoding can simply be consumed in place, needing only a validation pass if it's coming from an untrusted source. Cap'n Proto is pretty close to that for one example.
It's a lot of work to reason about programs in order to reduce copies. A different, cheaper and more feasible approach is to reasoning about alignments and sizes.
I've seen some long running work on the rust compiler to use high level guarantees (lifetimes, ownership...) to improve codegen and one of the big tasks seems to be 'remove unnecessary memory copies'. Fun to watch, I hope there'll be a cool write up.
I'd imagine a lot of that was just same simple operation on a steam of data; there isn't that much to optimize there aside from finding a way to operate on more data at a time
Just make memcpy a bonafide hardware instruction already. Since it could take unbounded execution time, you could, e.g. limit to a fixed, large size, like 4KB. That'd be a huge benefit, just in handling all the alignment/misalignment cases. And it's big enough that the loop you need to put around it has relatively low overhead. You could specify that address translations (and therefore faults) happen at the beginning, and no data is written.
Not just an instruction, but something the MMU can use to coalesce memory in the background, if the MMU can do multiple levels of indirection or RC pages, then it can also do GC. The MMU has been too underpowered for too long.
It certainly wasn't the fastest general-purpose way to do copying for the decades of CPUs between the 8086 (late 1970s) and introduction of enhanced rep movsb (ERMSB) in Ivybridge (2012). For post-Ivybridge CPUs, I believe it was non optimal for some usage patterns for a while, too. It may be optimal in very recent Intel CPUs, but that's a far cry from saying it's been the fastest since introduction in the 8086.
Right. The follow-on to ERMSB, FSRM ("fast short rep movs"), which first appeared in Icelake, finally makes it consistently competitive with SW sequences¹.
¹ but you still want to branch around it when the length is zero ("when the length is zero?!" I hear you cry; it turns out that if you instrument actual systems, you will find that this happens shockingly often).
It depends on the exact distribution of sizes (and especially if the size[s] are statically knowable—e.g. if you are copying exactly 31 bytes, or something like an unknown size between 48 and 62 bytes, a SW sequence will still win), but it is now _competitive_ if not actually as fast (previously it was often 2-3x slower in that range, even when the length was not fixed).
The ultra-bloated unrolled vector implementations may appear a little faster in microbenchmarks but the amount of cache they take up and other effects like AVX downclocking (for those implementations that use AVX) mean that in practice they can actually be slower overall.
Perhaps for very small sizes (just use a mov directly) and very large ones (that fall out of your cache, so you’d probably want non-temporal stores). I think it would be difficult to cover all cases.
It'd be great if this were true, since that would finally put an end to the madness. But I've hoped this for many, many years, but always been disappointed when someone spends ungodly hours on some hyper-tuned AVX thing that just barely eeks past, and we get stuck with another unwieldy monster for a decade.
> The main trap is trying to benchmark and profile it without the context of the whole application.
Emery Berger's "Performance (Really) Matters" talk also goes into this[0]. He paints a really, really depressing picture of how unreliable benchmarks truly are. Like, things like link order having ridiculous impacts and stuff like that.
I wonder if 25-35% of all CPU cycles being devoted to moving bytes around is a design problem, i.e. a result of the RPC-oriented programming model we used at Google
I know there were also measurements about protobuf encode/decode, and I think it was a similarly large portion of the total workload. (Anecdotally I think that's why people say the protobuf libraries are huge and big and slow to compile -- because they're highly optimized that's probably what empirically resulted in the fastest execution time)
It feels like a content-oriented and value-oriented model would have less of this profile, i.e. with (correct) local caching and differential compression
Interestingly, the publicly available protobuf implementation is quite bad for only using std::string for string & repeated bytes fields, with std::string_view support only coming soon. I believe Google’s internal protobuf code support ABSL’s string_view equivalent.
Ah yeah, I think the historical issue is that Google doesn't use std::string internally for performance reasons. They have their own C++ string implementation, which protobuf uses internally, like all other code.
And then when they open sourced it, I guess they left out that code. So the open source version is probably "de-optimized" a bit unfortunately.
> We introduce two new instructions to handle the short move case quickly
These are more or less built in to avx512 already...
(The hot path is: load immediate; bzhi; kmov; load; store. 5 instructions rather than 2, but still trivial. If you look at the _latencies_ involved, the difference is trivial.)\
I know, it's awful isn't it? The irony of an article about detailed low-level optimisation of memcpy (which is probably being used unnecessarily anyway), re-implementing scrolling in javascript...
If anyone is curious or otherwise interested, the version of memcpy() used for Google production is basically what is in LLVM's libc, https://libc.llvm.org
I would like such a refutation, notionally. I do not particularly like this refutation, considering that it is just as overabstracted, and probably not much faster (if at all).
What I would really like is to see more widespread hoisting of dispatch. Rather than calling 'memcpy', call 'memcpy_small', 'memcpy8', etc.
That's the whole gist of gchatelet's implementation. Make it easy for the compiler to inline memcpy, so the compiler gets a chance to propagate the information it has about the parameters. In many cases it can eliminate all the branches but one.
The GNU way of using a huge, branchy assembly block that is selected at runtime using ifunc means that the compiler never even got a chance.
Regarding the question of whether or not they are faster, see section 4.4 of this paper. Replacing the glibc memcmp with something trivial resulted in up to 1% speedup in Google web search, when considering the whole program. It doesn't microbenchmark as well as glibc memcmp, but it is a better function that doesn't wreck the performance of the rest of the system.
> Make it easy for the compiler to inline memcpy, so the compiler gets a chance to propagate the information it has about the parameters
I have low hopes for compilers. Inlining heuristics are terribly complicated, and optimisations that result therefrom will only be things that the compiler can prove for _all_ invocations. Inlining won't get you 'this call site is usually large', or 'this call site is small, but predictable', or 'this call site is really unpredictable, so use weird branchfree tricks'. (A JIT compiler might do better, but people do not usually JIT c or c++.)
> Replacing the glibc memcmp with something trivial resulted in up to 1% speedup in Google web search, when considering the whole program. It doesn't microbenchmark as well as glibc memcmp, but it is a better function that doesn't wreck the performance of the rest of the system
It's not all or nothing, and rep cmps is utter trash. My memcmp, for instance, I was careful to keep below 4 cache lines (or 3, considering only the version that doesn't overread)—<https://github.com/moon-chilled/fancy-memcmp/blob/master/mem...>—and it actually has reasonable throughput.
(Also: inlining is not the way to go if you are frontend-bound...)
> (Also: inlining is not the way to go if you are frontend-bound...)
When workload is frontend-bound, then the culprit is usually either in instruction cache misses (e.g. a lot of unpredictable virtual calls or generally poor code layout) or branch mispredictions (e.g. a lot of unpredictable branches in your code). I fail to see how inlining the code can be correlated to these two effects other than, what I believe is a myth, that inlining the code will by (1) growing the executable size (2) put more pressure on the instruction cache size and (3) therefore end up with degraded performance. In my experience, rarely I have seen even nr. (1) taking place (compilers and linkers are way too smart nowadays) and I think I have never managed to measure the effects of (2) and (3) in a non micro-benchmark scenario.
Anyway, if I were to optimize the workload found to be frontend-bound, eliminating branches and getting rid of the dynamic dispatch would be the first two things to check. Inlining maybe but only maybe in the third place.
Very interesting and hardcore solutions, adding new instructions to improve data copying in a more "clever" way that just the repeat-prefix seems like an interesting path.
There was a silly typo in the very first sentence that threw me off:
When I worked at Google, fleet-wide profiling revealed that 25-35% of all CPU time was spent just moving bytes around: memcpy, strcmp, [...]
I think that should be `strcpy()`, since obviously `strcmp()` doesn't move data around.
This reminds me of the old Motorola M68K dbcc instruction, which was a specialized looping instruction where you could loop back only a small number of instructions which would all be cached tightly in a tiny cache that existed only for the dbcc case.
Are there modern analogues to that dbcc instruction?
dbcc had a specific semantic that you could only branch back a very small amount and that the CPU would cache the body of the loop. A modern equivalent would use a small amount of cache that is not even L1.
Something similar in this paper, putting an FPGA between the processor and DRAM so you can arbitrarily remap bits of it: https://arxiv.org/abs/2109.14349
Content addressable memory is the ultimate "zero copy architecture". Imagine a "CPU instruction" or library call which determines that, "no, I have already seen this address" (where address isn't an address as much as a unique hash) and can pass that hash along to the destination instead of running a copy.
Overhead increases by effectively removing pointers as we know them, replacing them with hash+length, but the reward is lower memory pressure and better cache locality. It also carries the promise of making a "buffer overrun" not only impossible but unthinkable. In such a world, incrementing "one extra" would not step into some other heap or stack, because everything would be segmented.
On GCP adding RAM to a machine type increases costs by a stupid amount.
Now it seems like they’re thinking of ways to penalise you for actually using it, because you may interfere with another tenant on the same machine.
Intel is known to analyze code from actual applications and invent new instructions that seem to be missing.
Surely they are aware of the memcpy problem, so if it's true that servers spend a lot of time copying memory, what is Intel (or AMD) doing regarding this?
So does that mean that further major optimizations are basically impossible?
You don't always need the operation finished right away. Maybe REP MOVS could return a sort of a handle that you then wait for when you actually need to use the destination, and the CPU can keep chugging along in the background. Like when you submit commands to the GPU or how you can wait at memory fences for other threads in a warp to complete.
Further optimization are certainly possible, especially as the hardware improves and gains additional capabilities. With regards to your suggestion, this is essentially what already happens, thanks to the magic of out-of-order cores.
Here's a different thought experiment: how much of that data movement is actually necessary? If data movement was taking up so much time, why not investigate why and if that's even necessary, instead of thinking it's the hardware that needs to be faster? As the old saying goes, "the fastest way to do something is to not do it at all." One style of code that seems to be particularly exacerbated by the widespread use of HLLs is making lots of unnecessary temporary copies of data. The biggest inefficiency isn't the hardware (unless your goal is to sell more hardware, but that's a different rant...) --- it's the insanely complex and bloated layers of software that gets layered on top of what is otherwise perfectly adequate hardware. From that perspective, faster hardware only leads to even more wasteful and inefficient software. Making memory copies faster will only cause software to make more unnecessary copies.
But after reading the whole thing, I can say that this is certainly a very strange article. It's written by someone who is presumably quite knowledgeable about hardware, yet seems completely oblivious to the fact that x86 has had a dedicated instruction for this since the 8086, and 16 bytes per cycle was already achievable with the P6 (1996) "fast strings" enhancement:
https://stackoverflow.com/questions/33902068/what-setup-does...