How L1 and L2 CPU caches work, and why they’re an essential part of modern chips

userbinator · on Nov 1, 2014

In the late 70s/early-to-mid 80s memory was actually faster than the CPU, and this is partly what motivated the RISC philosophy - at the time, instruction decoding was the main bottleneck and there was plenty of memory bandwidth available so it was a good idea to design CPUs to make better use of that. Now that it's the exact opposite, and we have multilevel caches, I wonder how different CPUs would be today if the engineers back then had realised that their surplus of memory performance would be very short-lived and cores would become faster while latencies to access memory continue to increase, and designed for this long term situation.

colanderman · on Nov 1, 2014

Does memory really bottleneck instruction throughput though? I've never ever seen code that stalled on an instruction fetch.

CPUs generally hide latency by auto-prefetching the instruction stream into cache, and memory bandwidth has been keeping up just fine with CPU speed. Hence memory latency rarely hinders instruction throughput, unless your code jumps around unpredictably every few instructions (most doesn't; the only common class of program I can think of which does are emulators/interpreters).

Probably the only differences in ISA design/implementation between this universe and one where memory latency tracked CPU speed would be that (a) instructions would need not be prefetched into cache (possibly there'd be no cache), and (b) we'd still have wacky instructions like the TMS9900's X instruction, which executes the instruction at a given memory address. [1]

[1] http://en.wikipedia.org/wiki/Texas_Instruments_TMS9900#Instr...

sliverstorm · on Nov 1, 2014

You totally can hide latency with auto-prefetching. But there are a few problems.

- This depends on correct branch prediction, otherwise you could easily wind up having fetched the wrong fork

- L2 latency is, what, 12 cyles? L3 is 28? (Sandy Bridge). To completely hide this latency, we would need the prefetcher to run 28 cycles ahead. Considering caches work in cachelines and not instructions, this means we have to run potentially hundreds of instructions ahead of execution, without mispredicting one branch...

Whether your code is going to bottleneck all depends on size and code patterns. If your software fits in the L1 icache, you will only miss when context-switching. If your software is 100MB and does not exhibit good code locality, you might be stalling all over the place.

ganzuul · on Nov 1, 2014

Is this code locality what makes functional style programming perform well?

Athas · on Nov 1, 2014

No. There are only a few properties of functional programming that help performance, and they are mostly spent compensating for the overhead typically imposed by high-level languages:

* Easier parallelisation (although compiler auto-parallelisation has turned out to not work well in practice).

* Compacting garbage collectors can mimimise fragmentation, and thus improve locality. This is not specific to functional languages, though, but a general benefit of automatic memory management.

* For pure languages, you may be able to implement generational garbage collectors slightly more efficiently, as no object of an older generation will have references to objects in a young generation. This is a very fragile property though - even laziness breaks it - and I'm not sure anyone has tried to use it in practice.

* Purity means that the compiler has to be less careful about optimisations and can make more assumptions about the code. This is part of what makes automatic (or semi-automatic) loop fusion practical in functional languages.

In general, I would not say that functional-style languages "perform well". They perform "well enough", but naive performance is not the reason to pick a functional language over an imperative one.

m0th87 · on Nov 1, 2014

> For pure languages, you may be able to implement generational garbage collectors slightly more efficiently, as no object of an older generation will have references to objects in a young generation. This is a very fragile property though - even laziness breaks it - and I'm not sure anyone has tried to use it in practice.

That's how it works in Haskell, despite the laziness: https://www.haskell.org/haskellwiki/GHC/Memory_Management

Athas · on Nov 1, 2014

Huh - how can that work? Take some partially evaluated object 'F <thunk>', where F is a data constructor. This object is old and thus stored in the oldest generation. When we force the thunk, we have to update the reference in the object to point to the resulting value - which, as it is newly allocated, should be in the nursery. Is the trick simply to allocate in whichever generation our referencing pointer belongs? As I see it, the basic problem is that lazy evaluation requires mutation at the heap-level.

colanderman · on Nov 1, 2014

As someone who's programmed extensively in both C and OCaml, I wish my functional programs could approach the performance of my C programs.

Although I doubt code locality differs much between the two (although FP might fare worse due to the code space overhead imposed by polymorphism), data locality is generally much worse in FP languages due to poorer control over data layout. (Garbage collection, run-time type tagging, excessive indirection, and inability to specify structure layout all contribute to this.)

The purported benefits of FP performance generally stem from either (a) the ability to ignore pointer aliasing effects when optimizing, which is not unique to FP (see Fortran, or C's restrict keyword), (b) the ability to automatically parallelize code due to immutability of data, which is generally far less a benefit than it sounds, or (c) the ability for the compiler or runtime to optimize away computations which are never used, which, while easier with a pure FP language, is again not unique to FP (see most modern C compilers).

pron · on Nov 2, 2014

As others have said, functional-style programs usually perform worse than imperative ones, large because they have worse (data) memory locality due to their use of persistent data structures that don't normally reuse memory addresses (and thus cache lines), so every "mutation" is a cache-miss.

hcrisp · on Nov 1, 2014

Relevant article by Francesc Alted, creator of PyTables: "Why Modern CPUs Are Starving And What Can Be Done About It". http://www.pytables.org/docs/CISE-12-2-ScientificPro.pdf

jokoon · on Nov 1, 2014

Isn't there any C/C++ tutorial on how to avoid cache misses ?

I already read the classical argument of having data oriented designs and to avoid linked lists, but I've never really heard of anything concrete on performance in programming.

All I hear is "profile profile profile". I guess most of the time it will point out a place where my code is slow, but if I find one, how do I solve it ?

For example, can I create some simple bit of code that do a cache miss at will ? Any example of such code ? How do I detect a cache miss ?

bboreham · on Nov 1, 2014

Cache misses are independent of language; however C++ gives you a lot of control over where items are placed in memory.

To create a cache miss in one processor all you need to force the processor to look at M different memory addresses that map to the same cache location, where M is higher than the N-way associativity for your processor. The specifics of this differ for each processor, but as an example, Intel Sandy Bridge CPUs have 8-way associativity, so if you fetch 9 addresses that are 4096 bytes apart you will cause a cache miss.

Lots of small tips and links if you Google, e.g.

http://stackoverflow.com/questions/8744088/what-is-the-best-...

http://stackoverflow.com/questions/3359524/detecting-cache-m...

The full detail is in books like http://www.intel.co.uk/content/www/uk/en/architecture-and-te...

bboreham · on Nov 1, 2014

Another interesting case: suppose you have a contiguous array of objects that are 48 bytes big, and your cache line is 64 bytes. Then every second object will straddle two cache lines. For random access to any one such object, padding them out to 64 bytes each will run faster on average, as it will access main memory only once and evict fewer cache lines. But you can only fit 3/4 as many objects in your cache now, so if you need to access a lot of them then your program will run slower.

So, "profile, profile, profile", but also understand factors like this that pull in different directions in different scenarios.

queensnake · on Nov 1, 2014

There's Ulrich Drepper's "What every programmer should know about memory" - http://lwn.net/Articles/250967/. It's long, it's from 2007, but its fundamentals are right.

ehsanu1 · on Nov 1, 2014

The reason all you hear is "profile profile profile" is that the performance implications of a given code change may sometimes be completely non-obvious, due to all the complex trickery in the processor and cache hierarchy. So what do you do after profiling and finding a performance bottleneck? If you can't identify exactly what the bottleneck's cause is and the solution is not obvious, I suppose all that's left is to do a best guess of the cause and think of a possible solution. Then profile after implementing your solution to check if it made a (positive) difference. And do the profiling/benchmarks even if the solution seems obvious, since it's so easy to be wrong.

Detecting cache misses can be done with valgrind: http://valgrind.org/docs/manual/cg-manual.html

There are architecture-specific instruction-level profilers as well. This video mentions a couple and is also a great overview of how non-intuitive performance can be with modern CPUs: http://channel9.msdn.com/Events/Build/2014/4-587

gioele · on Nov 1, 2014

For general programs, i.e. programs not meant to be run a exact particular machine, it is better to focus on better algorithms rather than better code.

In this case it is useful to look at "Cache-oblivious algorithms": Cache oblivious algorithms allow you to write code that performs well regardless of the architecture and processor model they are run on. From Wikipedia: «a cache-oblivious algorithm (or cache-transcendent algorithm) is an algorithm designed to take advantage of a CPU cache without having the size of the cache (or the length of the cache lines, etc.) as an explicit parameter.» http://en.wikipedia.org/wiki/Cache-oblivious_algorithm

Sadly the field is new and there are not many such algorithms around, but you can already find algorithms for many common tasks such as traversing linked lists and calculating FFTs.

TillE · on Nov 1, 2014

Accessing contiguous chunks of memory in order minimizes cache misses. It's not a whole lot more complicated than that.

This is a good intro video on the subject: https://www.youtube.com/watch?v=16ZF9XqkfRY

jokoon · on Nov 1, 2014

woooh thanks a lot for that video! I hope this will bother donald knuth :)

gtani · on Nov 1, 2014

here's a few books/theses that have decent intro's to the realm of temporal/spatial locality, GPU etc. Not C++ specific (except CUDA code is C++, i think)

http://tacc-web.austin.utexas.edu/veijkhout/public_html/istc...

http://www.mpi-sws.org/~turon/turon-thesis.pdf

ganzuul · on Oct 31, 2014

ARM is using L3 cache as a DMA-like interconnect, on the SoC side of things.

shortsightedsid · on Nov 1, 2014

I'm not sure about ARMv8 but in ARMv7 SoCs the ARM Subsystem does not include L3. Plus each L1 is tied to each core.

ganzuul · on Nov 1, 2014

I believe this is a relevant source: http://www.arm.com/products/system-ip/interconnect/corelink-...

The L3 cache is 'integrated', which I presume means it's different names for the same thing.

shortsightedsid · on Nov 2, 2014

Ah! Thanks for sharing. It is for ARMv8 which I really need to read up on.

bluedino · on Nov 1, 2014

This is probably a simple question, but it may not be obvious to others as well: How effective are caches on a system that is running multiple different processes at one time, much less hundreds? Is each working set that a process uses at any certain time small enough to fit in the cache along with the others?

Looking back at caches that used to be 4k or 64k (external) and are now 256k and measured in megabytes (external), I suppose that could be the case.

pron · on Nov 2, 2014

The number of threads a CPU can run at once is equal to the number of cores (we'll ignore hyperthreading for the sake of simplicity), and the number of cores is equal to the number of (L1) caches.

If you're referring to kernel-level multitasking, then these processes don't really run concurrently from the CPUs perspective; in fact, they're switching interval is quite long (relative to the processor speed).

However, the cost of bringing in a thread's data into the cache once it's been scheduled by the kernel is not negligible at all, and is part of the significant total task-switching cost.

mmphosis · on Nov 1, 2014

All that complexity (and space) to handle the cache, why not simply add more cache, lots more cache, and don't use RAM at all?

acveilleux · on Nov 1, 2014

Cost. More cache means bigger die size and lower yield. Also once the area of the cache gets too big, the latency increases too much. That's why the P3 moved the L2 cache back on the die while it was in a separate package on the P2...

sliverstorm · on Nov 1, 2014

Also density. DRAM is orders of magnitude more dense per silicon area than SRAM, but you can't generally put DRAM cells on the same chip as the CPU.

Sharlin · on Nov 1, 2014

Yes. It's important to realize that at current clock speeds, much of the latency is simply the direct result of the finite speed of light. You simply can't fit gigabytes of memory in a location that's physically close enough to the processor.

_xy8h · on Nov 1, 2014

Hence the location of memory slots always being physically parallel to the CPU socket, squiggles in the traces in the innermost lines from memory socket to CPU (to equal the distance of the longer traces along the edges of the socket to the CPU so timings remain in sync), and why die shrinks are more important for performance than scaling out... If a chip is too physically big, the speed of light can limit its performance before other inefficiencies will.

Really fascinating and reinforces how no matter how one can imagine how much "faster" computers will get in the future, there is still that speed limit of information.

xxs · on Nov 2, 2014

>>much of the latency is simply the direct result of the finite speed of light.<<

This should be in bold: besides the 6 transistors per bit for L1, there is so much space to fit on a die that's close enough to the core.

Everlag · on Nov 1, 2014

The more cache you have, the harder it is to get it to the frequency you desire. A larger cache also means a more power hungry cpu producing more heat and subject to significant reliance on Moore's Law in order to avoid stalling out on cache size and being forced to go back to RAM.

Also, its a lot easier to scale outside of a chip with external, or on-package, DRAM which allows for modularity within the the same cpu segment.