Hacker News new | past | comments | ask | show | jobs | submit login
Exploring How Cache Memory Works (pikuma.com)
129 points by _xivi 6 months ago | hide | past | favorite | 33 comments



Drepper's "What Every Programmer Should Know About Memory" [1] is a good resource on a similar topic. Not so long ago, there was an analysis done on it in a series of blog posts [2] from a more modern perspective.

[1] https://people.freebsd.org/~lstewart/articles/cpumemory.pdf

[2] https://samueleresca.net/analysis-of-what-every-programmer-s...


In a similar vein, Andrew Kelly, the creator of Zig, gave a nice talk about how to make use of the different speeds of different CPU operations in designing programs: Practical Data-Oriented Design https://vimeo.com/649009599


In case you are wondering about your cache-line size on a Linux box, you can find it in sysfs.. something like..

    cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size


  grep . /sys/devices/system/cpu/cpu*/cache/index*/coherency_line_size
would be better, but

  lscpu -C
is more useful.


Didn't know about 'lscpu -C'.. thanks!


Another (more portable) way to get the cache line size is to use std::hardware_(destructive|constructive)_interference_size from C++17.

While it's more portable, there are some drawbacks. Drawbacks with this approach are: (1) the statically known cache line size might not be right if you deploy your code to a target CPU different from the one you compiled for, (2) using this value to define structure + member alignment in headers could end up with interesting bugs if different source files including the header are built with different flags. Also: your toolchain might not support it yet.


Wait wait wait.

M2 processors have 128 byte wide cache lines?? That's a big deal. We've been at 64 bytes since what, the Pentium?


In practicality intel CPUs have pulled down 128 bytes at a minimum when you access memory for a very long time.

64 byte cache lines are there an part of other alignment boundaries for things like atomics, but accessing memory pull down two cache lines at time.


Memory fetch length and coherency unit size are different things, and the latter matters much more.


Yeah, 64 bytes is kind of an unstated x86 thing. It'd be hell for them to change that, a lot of perf conscious code aligns to 64 byte boundaries to combat false sharing.


all ARM-designed cores are also 64-bytes. It's not just an x86 thing


Some Cortex-A53s have 16-byte cachelines, which I found out the hard way recently.


The Cortex A9 had 32 byte cache lines for one prominent counterexample.

But my point was more that the size is baked into x86 in a pretty deep way these days. You'd be looking at new releases from all software that cares about such things on x86 to support a different cache line size without major perf regressions. So all of the major kernels, probably the JVM and CLR, game engines (and good luck there).

IMO Intel should stick a "please query the size of the cache line if you care about it's length" clause into APX, to push code today to stop #defining CACHE_LINE_SIZE (64) on x86.


> IMO Intel should stick a "please query the size of the cache line if you care about it's length" clause into APX, to push code today to stop #defining CACHE_LINE_SIZE (64) on x86.

CPUID EAX=1, bits 8-15 (i.e., second byte) of EBX in the result tell you the cache line size. It's been there since Pentium 4, apparently.

You can also get line size for each cache level with CPUID EAX=4, along with the set-associativity and other low-level cache parameters.


> IMO Intel should stick a "please query the size of the cache line if you care about it's length" clause into APX, to push code today to stop #defining CACHE_LINE_SIZE (64) on x86.

This doesn't help.

The issue with increasing cache line size is false sharing. And false sharing isn't something that's only dealt with by people who know or care about cache line width. The problem is that mostly, people just write oblivious code, then test it, and if it is slow, possibly do something about it. A way to query cache line width gets the infinitesimally small portion of cases where someone actually consciously padded structs to cache line width, and misses all the cases where things just happen to not fit on the same line with 64B lines and do fit on the same line with 128B lines.

Like it or not, coherency line width = 64B is just a part of the living x86 spec, and won't be changed. If possible, we'd probably wish for it to be larger. But at least we were lucky not to be stuck with something smaller.


> The Cortex A9 had 32 byte cache lines for one prominent counterexample.

Ok, all arm-designed cores for the last 15 years then :)


From a Raspberry Pi 5: L2 cache line is 128

  Vendor ID:              ARM
  Model name:             Cortex-A76

  $ lscpu -C
  NAME ONE-SIZE ALL-SIZE WAYS TYPE        LEVEL SETS PHY-LINE COHERENCY-SIZE
  L1d       64K     256K    4 Data            1  256                      64
  L1i       64K     256K    4 Instruction     1  256                      64
  L2       512K       2M    4 Unified         2 1024                     128
  L3         2M       2M   16 Unified         3 2048                      64


Something I've experienced first hand. Programming the ps3 forced you to manually do what CPU caches does in the background, which is why the ps3 was a pain in the butt for programmers who were so used to object-oriented style programming.

It forced you to think in terms of: [array of input data -> operation -> array of intermediate data -> operation -> array of final output data]

Our OOP game engine had to transform their OOP data to array of input data before feeding it into operation, basically a lot of unnecessary memory copies. We had to break objects into "operations", which was not intuitive. But, that got rid a lot of memory copies. Only then we managed to get decent performance.

The good thing, by doing this we also get automatic performance increase on the xbox360 because we were consciously ? unconsciously ? optimizing for cache usage.


I learned so much from this blog and from the discussion. HN is so awesome. +1 for learning about lacpu -C here.

A while back I had to create a high speed steaming data processor (not a spark cluster and similar creatures), but a c program that could sit in-line in a high speed data stream and match specific patterns and take actions based on the type of pattern that hit. As part of optimizing for speed and throughput a colleague and I did an obnoxious level of experimentation with read sizes (slurps of data) to minimize io wait queues and memory pressure. Being aligned with the cache-line size, either 1x or 2x was the winner. Good low level close to the hardware c fun for sure.


I think cache coherency protocols are less intuitive and less talked about when people discuss about caching, so it would be nice to have some discussion on that too.

But otherwise this is a good general overview of how caching is useful.


Great article. I have always had an open question in my mind about struct alignment and this explained it very succinctly.


Really cool stuff and a nice introduction but curious how much modern compilers do for you already. Especially if you shift to the JIT world - what ends up being the difference between code where people optimize for this vs write in a style optimized around code readability/reuse/etc.


JIT compilers can't compensate for poorly organized data. Ultimately, understanding these low-level concepts, affect high-level algorithm design and selection.

Watching the Andrew Kelly video mentioned above, really drives home the point that even if your compiler automatically optimizes structure ordering, to minimize padding and alignment issues, it can't fix other higher-level decisions. An example being, using two separate lists of structs to maintain their state data, rather than a single list with each struct having an enum to record its state.


JIT languages tend to have the worst language-provided locality as they are often accompanied by GCs and lack of value types (there are exceptions to this, but it's broadly the case). And a JIT cannot re-arrange heap memory layout of objects as it must be hot-swappable. This is why despite incredibly huge investments in them such languages just never reach aot performance despite how much theoretical advantage a jit could have.

AOT'd languages could re-arrange a struct for better locality however the majority (if not all) languages rigidly require the fields are laid out in the order defined for various reasons.


> the majority (if not all) languages rigidly require the fields are laid out in the order defined for various reasons.

The as-if rule gives an escape hatch, although in practice it is hard to take advantage of, especially without whole program optimization.


I wish people would stop saying this. It's like CS-woo the idea that some magical solution exists that saves you from having to think real hard about the hardware, that magical abstractions save the day.

All of these things boil down to combinatorial optimization problems (bin packing ring a bell?). And there are no widely available compilers or JITs or whatever that bundle ILP solvers. Thus, what you're really getting with every compiler is a heuristic/approximate solution (to many many combinatorial optimization problems). Decide for yourself whether you're comfortable with your code just being approximately good or if you need to actually understand how your system works.


"On the other hand, data coming from main memory cannot be assumed to be sequential and the data cache implementation will try to only fetch the data that was asked for."

Not correct. Prefetching has been around for a while, and rather important in optimization.


Hi. Gustavo Pezzi here (from pikuma.com). You're absolutely right; I'll make sure to pass this to Gabriel and I'm sure we'll add this detail to the blog post soon. Thank you for the heads up.


Lots of fun stuff here. Last I checked, Intel's prefetching only works in the positive direction! A reason to have loops move forward. Also, number of simultaneous prefetching streams depends on architecture.


Why is the natural alignment of structs equal to the size of their largest member?


To ensure that member is itself still aligned properly in "global space". The start of the struct is assumed to be universally aligned (malloc, etc.. make that a requirement in fact) or aligned for the requirements of the struct itself (eg, array). Thus any offset into the struct only needs to be aligned to the requirements of the largest type.

https://www.kernel.org/doc/html/latest/core-api/unaligned-me... has a lot more general context on alignment and why it's important


It's not. It's equal the maximum alignment of their members. For primitive types (like integers, floating-point types and pointers), size == alignment on most machines nowadays (although on 32-bit machines, it can be a toss-up whether a 64-bit integer is 64-bit aligned or 32-bit aligned), so it can look like it's based on size though.


Super interesting. Thank you!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: