What every programmer should know about memory (2007) [pdf]

harry8 · on June 22, 2023

https://stackoverflow.com/questions/8126311/how-much-of-what...

Discussion of it in slightly more modern context, worth looking at.

karmakaze · on June 21, 2023

I get by with latency numbers every programmer should know[0]. Knowing a bit about how cache sharing works is helpful some times[1].

[0] https://gist.github.com/hellerbarde/2843375?permalink_commen...

[1] https://en.wikipedia.org/wiki/False_sharing

frithsun · on June 21, 2023

What only a small subset of highly specialized programmers should know about memory (2007) [pdf]

santiagobasulto · on June 21, 2023

I'm a very high level programmer, turned Data Scientist, turned founder (I successfully exited my prev startup). I still read this paper (a few years ago). I guess if you're "born" a programmer, all these nerdy/curious things are entertaining. One of my "developer" passions is multithreading and concurrency. Even now running my second startup, I, from time to time, sit to write some concurrent puzzle programs.

I also like to follow Tobi Lütke. The guy is the founder/CEO of Shopify (with all which that might entail), and he still tweets amazingly interesting low level coder stuff.

niuzeta · on June 22, 2023

> sit to write some concurrent puzzle programs.

I concur. There's something about concurrency and multithreading that ticks your brain in a weird way. It keeps me stimulated.

My favorite is setting up a scenario that would normally be considered as a disasterous concurrency failure. Like intentionally messing up a `volatile` state. You can't ever have those in prod, of course, but seeing them in a controlled, _expected_ way tickles my brain in a very perverse way.

sbmthakur · on June 22, 2023

Any sample of your puzzle program that you may be willing to share? I want to perform my own experiments with multi threading but being a Javascript dev, it's not something I deal with quite often. Also great work on DataWars. The onboarding experience and the site layout is pleasant.

mhh__ · on June 22, 2023

The principles here apply at basically every level of the stack.

lizardking · on June 21, 2023

Yes, every developer needs 114 pages worth of knowledge about memory.

mhh__ · on June 22, 2023

Yes, why not?

You don't have to understand all of it but if you can't be bothered to get the gist of this maybe computers aren't for you?

Note that I'm implicitly making a distinction between people who write code and people who are explicitly developers.

lizardking · on June 22, 2023

This is gatekeeping. People who develop software are developers. Many (probably most) of them have done so for years without 114 pages worth of in depth knowledge of how memory works. So by definition they don't need to know it to do their jobs successfully.

jpc0 · on June 27, 2023

They don't need to know it but then it should also not be a surprise to you when programs take gigabytes of memory and several seconds to do things that used to kilobytes of memory and milliseconds, for the same workload.

Delving deep gives you fundamental knowledge of systems, sometimes there isn't time for that which in that case use the generic library that does stuff you never needed and ship the product. But in 10 years time when that needs to actually become performant code you need to know how to fix it or someone else will be paid to do it instead.

mrwnmonm · on June 21, 2023

Yeah, I read some of it before, and had no idea why I would need this.

mmkos · on June 22, 2023

Whenever articles like this bubbled up, I used to want to delve right in to satisfy some inner need to be deeply knowledgeable about the craft. I don't feel that way anymore now that I'm in my 30s, something flipped. Now I'm much more interested in higher-level abstractions and building complete products that solve some real problems, rather than occupying myself learning about the nitty gritty details of how software works under the hood.

I think it's maybe nice to know how we interact with memory in software (memory allocation/deallocation, naybe how manual memory mgmt is done, something about garbage collection?), but that's about it.

vbh21 · on June 22, 2023

I feel the same wayy now that I'm older. The nitty-gritty of some of the memory management stuff just isn't relevant. The work I'm doing now is at a different scale and I let the language designers worry about this. It is still useful to know some details as they apply when I'm programming, but a lot of this stuff just doesn't command my attention!

everyone · on June 21, 2023

I dunno, I'm a game programmer and it seems a bit too much for me personally. I just John Carmack it, and always do the simplest thing for me as the dev. If the benchmarks arent fit for purpose¹ then I will look at profiler to see what lowest hanging fruit is to optimise. I'm not gonna guess what is slow, I am not a computer, I only just look at profiler.

¹This almost never happens.

dragontamer · on June 22, 2023

Ah yes, John Carmack. The developer who is known for writing simple, easy to understand code.

    float Q_rsqrt( float number )
    {
     long i;
     float x2, y;
     const float threehalfs = 1.5F;
    
     x2 = number * 0.5F;
     y  = number;
     i  = * ( long * ) &y;                       // evil floating point bit level hacking
     i  = 0x5f3759df - ( i >> 1 );               // what the fuck?
     y  = * ( float * ) &i;
     y  = y * ( threehalfs - ( x2 * y * y ) );   // 1st iteration
    // y  = y * ( threehalfs - ( x2 * y * y ) );   // 2nd iteration, this can be removed
    
     return y;
    }

everyone · on June 22, 2023

He could utterly suck as a programmer, or even be fictitious, but that "John Carmack approach" as I call it is still great imo.

dragontamer · on June 22, 2023

Except the reason why Quake (and other such 1st person shooter) games are very fast and optimized... is because John Carmack encouraged high-performance programming which is necessarily only practiced and understood by a very small number of programmers.

Carmack (and id software games) are legendary in terms of their high-performance. And given the source code of these games... the source code filled with raw assembly language and other such tricks like I posted above... the source code is nearly impossible to read for the "typical" programmer.

----------

This is absolutely not a company (or engineer) who did things "simply". He did things in a extremely difficult, high-performance first engineering mindset.

I dare say that John Carmack's practice is closer to "premature optimization", built around abusing incredibly low-level tricks to improve the performance of their various video games... and strategically choosing parts of the video-game simulation that people would prefer high-performance over accuracy. (IIRC: the Fast Inverse Square Root algorithm above is only ~4 digits or 12-bits worth of accuracy).

This was a programmer (or at least, programming team) that at a high level, decided that 12ish-bits of accuracy (out of 23-maximum bits of precision) was enough for their purposes in the vast majority of their code. And then used that slight performance increase to have a smoother-feeling game than their competition.

----------

IIRC, John Carmack is a legendary programmer in terms of low-level 3d details. He is a good programmer, but his style is the complete opposite of what you're calling it.

mhh__ · on June 22, 2023

Carmack didn't write that

NGRhodes · on June 22, 2023

It exists in the Q3A code:

https://github.com/id-Software/Quake-III-arena/blob/master/c...

mhh__ · on June 22, 2023

You wouldn't actually know how to optimize it other than by guessing if you didn't have a rough idea of the computers memory hierarchy.

sbmthakur · on June 22, 2023

What does "John Carmack it" means? Take the simplest and correct approach?

everyone · on June 22, 2023

Like what I described, start with the utter simplest/easiest code for you the developer, and only optimise it/make it more complicated, later if it's benchmarked and not up to a standard you are aiming for (of feature quality or performance or whatever). There was a bit about it in that book "Masters of DOOM" I also personally equate simplest/easiest code for you with most maintainable code also.

antonvs · on June 21, 2023

Hardware memory characteristics may have changed in the last 16 years, but the attractions of misleading clickbait titles have not.

redhal · on June 21, 2023

For a great read with more recent numbers, I recommend you checkout this post https://news.ycombinator.com/item?id=22287993

cbsmith · on June 21, 2023

The most important stuff in this paper isn't the numbers.

utopcell · on June 21, 2023

Still, the parent is a valuable link. Thanks for sharing.

cbsmith · on June 22, 2023

Absolutely.

dragontamer · on June 22, 2023

1. RAM is a parallel machine of its own. This document describes DDR3 (which is 2 versions out from modern DDR5), but that doesn't really matter. Today's DDR5 is just more parallel.

2. This document very precisely describes the parallelization mechanism of DDR3. Read/writes to DDR3 are parallelized by channel, rank, and bank. DDR4 added bank-group as an additional layer of parallelism (above bank below rank).

3. An individual bank has performance characteristics: an individual bank is much faster when it can stay "open", thus only getting a CAS-latency, rather than a close+precharge+RAS+CAS latency cycle.

4. CPUs try to exploit this parallelism with various mechanisms. Software prefetching, hardware prefetchers, levels of cache, and so forth.

5. NUMA is more relevant today as EPYC systems are innately quad-NUMA. HugePages is also more relevant today as 96MB+ L3 per CCX have appeared on Ryzen/EPYC machines, meaning you need HugePages to access L3 without a TLB hit. TLB isn't really RAM, but its closely related to RAM and could be a bottleneck before you hit a RAM bottleneck, so its still important to learn if you're optimizing against RAM-bottlenecks.

jeffffff · on June 22, 2023

one thing that is missing from this document is that there are a limited number of line fill buffers in the L1 cache of each core, which limits the number of concurrent memory accesses that can be in flight well before anything related to channels/ranks/banks begins to matter. https://lemire.me/blog/2018/11/05/measuring-the-memory-level...

utopcell · on June 21, 2023

I've read this doc many years ago, and I remember it was worth the time. If folks are contemplating on what has changed since 2007 that this doc is not covering, I'd say it is CCX awareness for AMD CPUs.

alberth · on June 21, 2023

FYI - this doesn't seem to be related to FreeBSD (even though it's hosted there).

It just appears to be an external PDF that FreeBSD committer Lawrence Stewart saved to their fileshare, written by someone else.

cbsmith · on June 21, 2023

Yeah, there's a pretty famous history on this paper (and with Ulrich Drepper).

quackulus · on June 21, 2023

Please do tell

cbsmith · on June 21, 2023

He's been the maintainer of glibc (still a developer), and a variety of other key bits of open source software. He's kind of famously abrasive/obtuse: https://news.ycombinator.com/item?id=2378013. The paper was kind of a canonical work, referenced by most everyone (which is no doubt why there's a copy of it on the FreeBSD site).

Am4TIfIsER0ppos · on June 22, 2023

In this thread: the reason why software is so shit in 2023

kaycey2022 · on June 21, 2023

What would be an up to date alternative to this document?

tunnuz · on June 22, 2023

Is there an updated version of this?

AnimalMuppet · on June 21, 2023

(2007).

bediger4000 · on June 21, 2023

Does the age affect how I, as a programmer, should read it?

mdaverde · on June 21, 2023

Literally, yes.

As example, the north bridge isn't as explicit today as the document makes it seem. A lot of functionality that used to be reserved for the north bridge is now being tucked away into the CPUs or motherboards themselves.

Still, worth reading.

kevstev · on June 21, 2023

Tbh, this paper is a bit much, unless you have a lot of time on your hands for it, and also the necessary understanding of CPU architecture to really grok it.

I think there are better blog posts out there that explain the concepts. "Cache locality" "memory cost of context switching" are some terms to put in search engines. Maybe add "thread local storage" in there too.

The big takeaway from this paper though, at least when I first read it, and the concept has risen to prominence in the years since- backed up by benchmarks- that Big O isn't really the end of the argument when it comes to performance.

The way CPUs work is that its extremely fast to rip through contiguous datasets. Jumping around to random memory addresses is way slower. A corollary to that is that jumping between threads is very slow for the same reasons- loading and unloading random chunks of memory. There are a lot of details around the specifics of why though- different cache levels and their latencies, and what happens when there is a cache miss, etc..

This lead HFT type programs to favor linear arrays to store things, and pinning threads to CPU cores, amongst other things, but that's the gist of it.

burnished · on June 21, 2023

Yes, because the advice is necessarily hardware dependent.

wmf · on June 21, 2023

This is why I think this document was too detailed and should have omitted low-level details. The way you optimize your code hasn't really changed since 2007, but the hardware churn since then triggers people to say "well akshually the northbridge..."

cbsmith · on June 21, 2023

Yes and no. A lot of the concepts about memory management and how memory functions are still the same or similar. It's a good baseline to start with, and then look at how things have changed since then (they certainly have changed, but most of the paper still applies).

AnimalMuppet · on June 21, 2023

Perhaps yes. Multiples cores are now much more a thing than they were in 2007; that at least could be a change in emphasis. OSes may do things differently in terms of how they implement memory mapping. And so on.