Drepper's "What Every Programmer Should Know About Memory" [1] is a good resource on a similar topic. Not so long ago, there was an analysis done on it in a series of blog posts [2] from a more modern perspective.
In a similar vein, Andrew Kelly, the creator of Zig, gave a nice talk about how to make use of the different speeds of different CPU operations in designing programs: Practical Data-Oriented Design https://vimeo.com/649009599
Another (more portable) way to get the cache line size is to use std::hardware_(destructive|constructive)_interference_size from C++17.
While it's more portable, there are some drawbacks. Drawbacks with this approach are: (1) the statically known cache line size might not be right if you deploy your code to a target CPU different from the one you compiled for, (2) using this value to define structure + member alignment in headers could end up with interesting bugs if different source files including the header are built with different flags. Also: your toolchain might not support it yet.
Yeah, 64 bytes is kind of an unstated x86 thing. It'd be hell for them to change that, a lot of perf conscious code aligns to 64 byte boundaries to combat false sharing.
The Cortex A9 had 32 byte cache lines for one prominent counterexample.
But my point was more that the size is baked into x86 in a pretty deep way these days. You'd be looking at new releases from all software that cares about such things on x86 to support a different cache line size without major perf regressions. So all of the major kernels, probably the JVM and CLR, game engines (and good luck there).
IMO Intel should stick a "please query the size of the cache line if you care about it's length" clause into APX, to push code today to stop #defining CACHE_LINE_SIZE (64) on x86.
> IMO Intel should stick a "please query the size of the cache line if you care about it's length" clause into APX, to push code today to stop #defining CACHE_LINE_SIZE (64) on x86.
CPUID EAX=1, bits 8-15 (i.e., second byte) of EBX in the result tell you the cache line size. It's been there since Pentium 4, apparently.
You can also get line size for each cache level with CPUID EAX=4, along with the set-associativity and other low-level cache parameters.
> IMO Intel should stick a "please query the size of the cache line if you care about it's length" clause into APX, to push code today to stop #defining CACHE_LINE_SIZE (64) on x86.
This doesn't help.
The issue with increasing cache line size is false sharing. And false sharing isn't something that's only dealt with by people who know or care about cache line width. The problem is that mostly, people just write oblivious code, then test it, and if it is slow, possibly do something about it. A way to query cache line width gets the infinitesimally small portion of cases where someone actually consciously padded structs to cache line width, and misses all the cases where things just happen to not fit on the same line with 64B lines and do fit on the same line with 128B lines.
Like it or not, coherency line width = 64B is just a part of the living x86 spec, and won't be changed. If possible, we'd probably wish for it to be larger. But at least we were lucky not to be stuck with something smaller.
Something I've experienced first hand. Programming the ps3 forced you to manually do what CPU caches does in the background, which is why the ps3 was a pain in the butt for programmers who were so used to object-oriented style programming.
It forced you to think in terms of: [array of input data -> operation -> array of intermediate data -> operation -> array of final output data]
Our OOP game engine had to transform their OOP data to array of input data before feeding it into operation, basically a lot of unnecessary memory copies. We had to break objects into "operations", which was not intuitive. But, that got rid a lot of memory copies. Only then we managed to get decent performance.
The good thing, by doing this we also get automatic performance increase on the xbox360 because we were consciously ? unconsciously ? optimizing for cache usage.
I learned so much from this blog and from the discussion. HN is so awesome. +1 for learning about lacpu -C here.
A while back I had to create a high speed steaming data processor (not a spark cluster and similar creatures), but a c program that could sit in-line in a high speed data stream and match specific patterns and take actions based on the type of pattern that hit. As part of optimizing for speed and throughput a colleague and I did an obnoxious level of experimentation with read sizes (slurps of data) to minimize io wait queues and memory pressure. Being aligned with the cache-line size, either 1x or 2x was the winner. Good low level close to the hardware c fun for sure.
I think cache coherency protocols are less intuitive and less talked about when people discuss about caching, so it would be nice to have some discussion on that too.
But otherwise this is a good general overview of how caching is useful.
Really cool stuff and a nice introduction but curious how much modern compilers do for you already. Especially if you shift to the JIT world - what ends up being the difference between code where people optimize for this vs write in a style optimized around code readability/reuse/etc.
JIT compilers can't compensate for poorly organized data. Ultimately, understanding these low-level concepts, affect high-level algorithm design and selection.
Watching the Andrew Kelly video mentioned above, really drives home the point that even if your compiler automatically optimizes structure ordering, to minimize padding and alignment issues, it can't fix other higher-level decisions. An example being, using two separate lists of structs to maintain their state data, rather than a single list with each struct having an enum to record its state.
JIT languages tend to have the worst language-provided locality as they are often accompanied by GCs and lack of value types (there are exceptions to this, but it's broadly the case). And a JIT cannot re-arrange heap memory layout of objects as it must be hot-swappable. This is why despite incredibly huge investments in them such languages just never reach aot performance despite how much theoretical advantage a jit could have.
AOT'd languages could re-arrange a struct for better locality however the majority (if not all) languages rigidly require the fields are laid out in the order defined for various reasons.
I wish people would stop saying this. It's like CS-woo the idea that some magical solution exists that saves you from having to think real hard about the hardware, that magical abstractions save the day.
All of these things boil down to combinatorial optimization problems (bin packing ring a bell?). And there are no widely available compilers or JITs or whatever that bundle ILP solvers. Thus, what you're really getting with every compiler is a heuristic/approximate solution (to many many combinatorial optimization problems). Decide for yourself whether you're comfortable with your code just being approximately good or if you need to actually understand how your system works.
"On the other hand, data coming from main memory cannot be assumed to be sequential and the data cache implementation will try to only fetch the data that was asked for."
Not correct. Prefetching has been around for a while, and rather important in optimization.
Hi. Gustavo Pezzi here (from pikuma.com). You're absolutely right; I'll make sure to pass this to Gabriel and I'm sure we'll add this detail to the blog post soon. Thank you for the heads up.
Lots of fun stuff here. Last I checked, Intel's prefetching only works in the positive direction! A reason to have loops move forward. Also, number of simultaneous prefetching streams depends on architecture.
To ensure that member is itself still aligned properly in "global space". The start of the struct is assumed to be universally aligned (malloc, etc.. make that a requirement in fact) or aligned for the requirements of the struct itself (eg, array). Thus any offset into the struct only needs to be aligned to the requirements of the largest type.
It's not. It's equal the maximum alignment of their members. For primitive types (like integers, floating-point types and pointers), size == alignment on most machines nowadays (although on 32-bit machines, it can be a toss-up whether a 64-bit integer is 64-bit aligned or 32-bit aligned), so it can look like it's based on size though.
[1] https://people.freebsd.org/~lstewart/articles/cpumemory.pdf
[2] https://samueleresca.net/analysis-of-what-every-programmer-s...