Linux Memory Management FAQ

barnacled · on Feb 12, 2021

For anybody who's interested I also wrote up a whole bunch of notes on this at https://github.com/lorenzo-stoakes/linux-vm-notes and superceded by far more recent https://github.com/lorenzo-stoakes/linux-mm-notes

I have made a few patches into the mm subsystem some simply inspired by researching for the articles.

thricegr8 · on Feb 13, 2021

If someone wants to begin their journey in understanding and level of aptitude you've displayed on your sites, where should they begin?

barnacled · on Feb 13, 2021

Thanks, I am reading the kernel source as I go and using that to answer questions as I figure things out.

aduitsis · on Feb 12, 2021

Thank you, this is great!

barnacled · on Feb 13, 2021

Thank you! I wanted to reply last night but got rate limited. I've been questioning my side project recently (hence why no updates for a month) so hearing positive feedback from people does help motivate.

dbattaglia · on Feb 12, 2021

"Virtual addresses are the size of a CPU register. On 32 bit systems each process has 4 gigabytes of virtual address space all to itself, which is often more memory than the system actually has."

I guess this is not the most up-to-date document?

gruez · on Feb 12, 2021

>I guess this is not the most up-to-date document?

it's also not correct. It doesn't have all 4GB "all to itself", because a portion of that (usually 1 or 2 GB) is mapped to the kernel.

Out_of_Characte · on Feb 13, 2021

"each process has 4 gigabytes of virtual address space all to itself"

A process does indeed have all 4GB of VIRTUAL adress space to itself. unless I'm misunderstanding you.

dezgeg · on Feb 13, 2021

Yes, VIRTUAL memory. Most operating systems (Windows, Linux) leave 2 or 3 GB for the user process and reserve the of the address space for themselves.

That way userspace-to-kernel switch does not require changing active page table (and also avoids switcharoo each time kernel needs to access userspace memory).

littlestymaar · on Feb 12, 2021

If you are thinking about the “which is often more memory than the system actually has" part, I don't know if it's outdated even today: the vast majority of Linux systems these days are Android phones, and I wouldn't be surprised at all if a good proportion of those didn't have more than 4GB of RAM.

kelnos · on Feb 12, 2021

I think that's probably still true for what 32-bit systems are still out there today.

And regardless, I think the majority of systems running Linux today are phones, which usually have 4GB or less of RAM.

But I expect the FAQ was probably originally thinking about desktop or server systems, so, yeah, the intent there is probably out of date. Those types of systems are rarely 32-bit these days, and usually have a bit more than 4GB of RAM.

spijdar · on Feb 13, 2021

> I think the majority of systems running Linux today are phones, which usually have 4GB or less of RAM.

Even this is quickly becoming less and less true (for new phones). Even the Pinephone comes with 3 GB of RAM at a $200 price point, and that's inflated because of the niche, low volume nature of its production.

Samsung's "mid range" A series smartphones, for instance, start at 3GB at the absolute lowest end, with most models coming with 6 GB of memory. I expect this will be even more common in a year or two.

mhh__ · on Feb 13, 2021

My OnePlus 3T is nearly 4 years old now and has 6GB (and is really showing its age...)

forty · on Feb 13, 2021

What's the use case of having so much ram on a smartphone ? Gaming?

ta988 · on Feb 13, 2021

Allowing app developers to not worry about optimization, put more trackers and more annoying ads...

herpderperator · on Feb 13, 2021

App-switching (multitasking) without LRU apps getting force-closed to make room for active apps. In other words, if you like to keep apps open, more RAM will reduce the chances of an app opened a while ago having to "start fresh" when you switch back to it, losing whatever state it had when you last used it.

sigjuice · on Feb 12, 2021

I think there might be some more hardware-specific nuance here. e.g. /proc/cpuinfo says this on a couple of different x86_64 systems that I checked.

  address sizes : 36 bits physical, 48 bits virtual
  address sizes : 40 bits physical, 48 bits virtual

PS: I don't understand what this means, btw.

throwaway8581 · on Feb 12, 2021

It means that it can address 40-bits of address space worth of physical memory, but that virtual memory addresses can use 48 bits. Physical addresses are just your RAM bytes numbered 1 through whatever. Virtual address space is the address space of a process, which includes mapped physical memory, unmapped pages, guard pages, and other virtual memory tricks.

hansendc · on Feb 13, 2021

> Physical addresses are just your RAM bytes numbered 1 through whatever.

Not really. There are lots of holes in the physical address map. Look at /proc/iomem. Look at all of the gunk in there at addresses lower than the amount of RAM you have. Look at the highest “System RAM” address. It will be higher than the amount of actual physical RAM that you have.

db48x · on Feb 12, 2021

Your CPU can handle 39-bit physical memory addresses (up to 512 GB of physical memory), and 48-bit virtual addresses (256 TB). Your operating system maintains a mapping from virtual to physical addresses, usually arranging the map so that every process has a separate memory space. Pointers are all still 64 bits long though.

barnacled · on Feb 12, 2021

In practice the actual available usable address space for userland is 64 TiB due to user/kernel split and the kernel maintaining a virtual mapping of the entire physical address space (minus I/O ranges) [0].

However newer incoming 5-level page intel chips [1] will allow up to 57 bits of address space, 128 PiB in theory though in practice 32 PiB of userland memory. See also [0] for discussion on practical limit for 5-page too!

[0]:https://github.com/lorenzo-stoakes/linux-mm-notes/blob/maste...

[1]:https://en.wikipedia.org/wiki/Intel_5-level_paging

db48x · on Feb 12, 2021

True, though /proc/cpuinfo only reports the size, which is ultimately what the CPU cares about. Plus the most relevant limit is what your motherboard and wallet supports, which is often far lower.

barnacled · on Feb 12, 2021

Indeed, and as you say, sensibly speaking you are hardly likely to hit those limits in any likely (esp. home) setup. The actual meaningful limit is usually the CPU physical one as home CPUs very often have stringent memory limits (often 32 GiB or so) and of course you rely on the motherboard's limitations also.

Having said that I did write a patch to ensure that the system would boot correctly with 256 TiB of RAM [0] so perhaps I am not always a realist... or dream of the day I can own that system ;)

[0]:https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-n...

db48x · on Feb 12, 2021

You're not the only one dreaming; I had to use >200GB of swap on my home system last year.

sigjuice · on Feb 12, 2021

So are the 16 leftmost bits of a virtual address always 0?

db48x · on Feb 12, 2021

Oddly enough the unused bits are in the middle of the address. They're also sign-extended rather than filled with zeros, so sometimes they are ones and other times they are zeros.

db48x · on Feb 13, 2021

In the middle of the address _space_, I should say.

moonchild · on Feb 13, 2021

If you interpret your addresses as signed values, then the entire address space can be contiguous.

I'm not suggesting that this is a good idea, but is certainly an idea.

db48x · on Feb 13, 2021

Negative addresses? Anathema!

barnacled · on Feb 12, 2021

They have to be same as the maximum addressable bit, i.e. in the case of 48 bit virtual address size the 48th bit.

This is actually kind of a cute way of dividing kernel and userland space as you just set the upper bit to 1 for kernel addresses and 0 for userland.

EDIT: Specifically talking about x86-64 here.

https://github.com/lorenzo-stoakes/linux-mm-notes/blob/maste...

amscanne · on Feb 12, 2021

No, it must be sign-extended from the top bit of the valid set. Otherwise the address is non-canonical.

xxpor · on Feb 12, 2021

This is true for x86-64, not true for other architectures such as arm64.

Apple uses the high bits to cryptographicly sign the pointer value.

haberman · on Feb 13, 2021

Hmm, it appears that the top byte on arm64 is only ignored if TBI (Top Byte Ignore) is enabled.

I don't think pointer signing requires TBI though. Pointer signing uses the PAC instruction to sign a pointer, and the AUT instruction to verify and unpack the signed pointer, but in its signed/packed form it is not a usable pointer. So actual addressable pointers need not support non-canonical addresses.

saagarjha · on Feb 16, 2021

Apple runs PAC without TBI enabled, I believe.

everybodyknows · on Feb 12, 2021

Fascinating. Does this confer some of the benefits of ECC RAM, for pointer data only — without the hardware cost?

my123 · on Feb 12, 2021

It's for a different purpose. (as in mitigate to some extent security bugs) And isn't an Apple feature only but an Arm one. (that is only rolling out on Cortex with Cortex-A78C and A78AE)

A paper on it from Qualcomm: https://www.qualcomm.com/media/documents/files/whitepaper-po...

And there's also MTE which is upcoming.

danielheath · on Feb 12, 2021

Some (but I believe the advantage is that it’s much harder to inject valid code from a buffer overflow).

saagarjha · on Feb 16, 2021

PAC is more about CFI than preventing shellcode injection (which is done through codesigning and memory protection, mostly).

jcul · on Feb 12, 2021

Yes generally for userspace addresses they are 0. But more importantly they can be used for other stuff, commonly referred to as pointer tagging / smuggling etc.

It's a useful optimisation technique where you can add some extra metadata without having to dereference a pointer.

dfox · on Feb 13, 2021

The reason why amd64 checks whether the addresses are “canonical” is discourage exactly this trick. On almost all platforms that simply ignored upper byte of pointer (m68k, s390, IIRC even early ARMs) this lead to significant compatibility issues.

As for storing tags in pointers on 64b platforms it is probably better to use the 3 low order bits. Another useful trick is what was used in PDP-10 MacLisp and is used by BDW GC: encode the type information in virtual memory layout itself.

jcul · on Feb 13, 2021

I guess it checks it when you actually try to dereference the pointer? On Intel too you still have to "repair"the pointer before you use it. It's definitely not the safest optimisation but it can be used to great effect when needed.

I think Intel is adding CPU support for pointer tagging operations in the future which should make them a lot easier / safer / more efficient to work with, though I can't find a reference now, it doesn't refer to it as pointer tagging.

Any more information on encoding the type information in virtual memory layout? Sounds cool.

I guess you have different types allocated in specific regions?

dfox · on Feb 13, 2021

Most general purpose ISAs (eg. SPARC and IIRC RiscV has something similar) with some kind of intrinsic support for tagged pointers also prefer the tags in low order bits.

And you are right that the tag inside address trick involves allocating objects of same type in different continuous regions. Usually such that whole page contains object of same type (as far as the tagging scheme is concerned) and by either masking off lower ten-ish bits of pointer you get to type header or you have some global out-of-line map of page frame->type.

bonzini · on Feb 12, 2021

On 32-bit systems, 4 GiB is indeed often more memory than the system has (think 512 MiB for some Raspberry Pis). And on 64-bit x86 systems each process has 256 PiB, which is also more memory than the system has.

phtrivier · on Feb 12, 2021

The Drepper series of article dates from 2007. Is it still relevant or has anything fundamental changed in memory handling in the last 13 years ?

Agingcoder · on Feb 12, 2021

It's still relevant.

Other than that, I also think that even when outdated, computing history is worth reading anyway, since it gives you a natural understanding of _why_ we do what we do these days. In your day job, it also gives you a different appreciation for what people did and why they did it, and why 'this horrible code' may have made sense at the time.

Furthermore, performance engineering is fundamentally about opposing code and hardware limitations. If hardware limitations are different, you'll get different code, but the principles remain the same.

If you're curious, write a basic emulator for older hardware (the NES is a great choice) , it's both fun and eye-opening!

Edit: the NES emulator will answer 'how do you fit super mario bros in 32k, and how can it run on such limited hardware?'

einpoklum · on Feb 12, 2021

> computing history is worth reading anyway

Sometimes, but a description of the state of the art in the past does not become a historical tract with the passage of time. The better ones do; others just become outdated.

Agingcoder · on Feb 13, 2021

Well, the ones which fail (and which become outdated) can also teach us valuable lessons : looking at the current state of the art doesn't necessarily tells you what happens if you do things differently.

In other words, we tend to focus on positive results, but negative ones ('don't do this or.. !') can be equally interesting and useful.

brandmeyer · on Feb 12, 2021

Highly relevant. The only part that I would discount is that he was pretty bullish on the prospects for hardware transactional memory, and his forward-looking statements about it didn't pan out. In fairness, much of the industry was bullish about HTM at that time.

eru · on Feb 13, 2021

In contrast, software transactional memory is still a pretty neat abstraction for some concurrency problems.

(And, of course, hardware transactional memory can be used to implement 'software' transactional memory faster than in software.)

However, STM only really works well in languages that are pure by default, like eg Haskell (or perhaps Erlang might be close enough). In a language with pervasive mutations and side effects, it's too annoying to use. Microsoft tried to make it work for .net for a while, and gave up.

mhh__ · on Feb 12, 2021

The prefetching aspects are basically bunk as hardware prefetching is much better now. The rest is still gold.

n_jd · on Feb 13, 2021

Been a while since I've read them but I recall there was info about the FSB and northbridge, which no longer exist outside the CPU. They've been replaced by internal memory controllers and PCI-E controllers.

mlaretallack · on Feb 12, 2021

The times I have had to explain how mm works is draining. yes you can malloc 2M, no that does not mean you have 2M to use.

dataflow · on Feb 12, 2021

Well, it does mean that in C. But some folks prefer to play by their own rules.

barnacled · on Feb 12, 2021

Actually no, the malloc doesn't allocate any memory it just updates the process's VMA to say that the allocated virtual range is valid. The pages are then faulted in on write. This is where things like OOM killer become very confusing for people.

In linux (in sane configurations) allocations are just preorders.

EDIT: I can't reply below due to rate limiting:

I'd argue that overcommit just makes the difference between allocation and backing very stark.

Your memory IS in fact allocated in the process VMA, it's just the anonymous pages cannot necessarily be backed.

This differs, obviously, in other OSes as pointed out. Also differs if you turn overcommit off but since so much in linux assumes it your system will soon break if you try it.

wahern · on Feb 12, 2021

This depends on the OS. Solaris and Windows both do strict accounting by default, and overcommit is opt-in at a fine-grain API level. Linux is relatively extreme in its embrace of overcommit. So extreme that strict accounting isn't even possible--even if you disable overcommit in Linux, there are too many corner cases in the kernel where a process (including innocent processes) will be shot down under memory pressure. Too many Linux kernel programmers designed their subsystems with the overcommit mentality. That said, I still always disable overcommit as it makes it less likely for innocent processes to be killed when under heavy load.

An example of a split-the-difference approach is macOS, which AFAIU implements overcommit but also dynamically instantiates swap so that overcommit-induced OOM killing won't occur until your disk is full.

Also, it's worth mentioning that on all these systems process limits (see, e.g., setrlimit(2)) can still result in malloc returning NULL.

dataflow · on Feb 13, 2021

> Solaris and Windows both do strict accounting by default, and overcommit is opt-in at a fine-grain API level.

Not sure what you mean by this - I don't think Windows has overcommit in any form, whether opt-in or opt-out. What it does have is virtual address space reservation, but that's separate from commitment; reserved virtual memory is not backed by any page, no matter how much free RAM you have, until you explicitly tell the system to commit physical memory to it.

In fact I'm not even sure 'opt-in' to overcommit is possible in principle. Because if you opt-in to overcommit, you jeopardize other applications' integrity—who likely did not opt-in.

wahern · on Feb 13, 2021

I thought there was a flag or commonly used library function that would do VirtualAlloc(MEM_RESERVER) and then from an in-process page fault handler attempt VirtualAlloc(MEM_COMMIT). But I guess I was wrong? I assume it's possible, just not as common as I thought.

dataflow · on Feb 13, 2021

I don't know of a common (or uncommon) function like this, though I think you could indeed implement it if you really want to (likely via AddVectoredExceptionHandler). It still requires explicitly telling the OS to commit just-in-time, so it's not "overcommitting". The closest built-in thing to this that I know of is PAGE_GUARD, which is internally used for stack extension, but that's all I can think of. The use cases for such a thing would be incredibly niche though—like kind of high-performance sparse page management where every single memory access instruction counts. Like maybe if you're writing a VM or emulator or something. Something that's only appropriate for << 1% of programs.

dataflow · on Feb 12, 2021

I said "in C". You're talking "in Linux" (or glibc/whatever). Which, as I already said, plays by its own rules and defies C. It's broken by design.

rrss · on Feb 13, 2021

I don't think the c standard specifies this behavior. malloc must return either a pointer where you can store an object, or null. I think platform details about when accesses to that pointer might fail are outside the scope of the language / stdlib standard.

Are failures when accessing the allocated pointer due to overcommit substantially different than failures due to ECC errors or other hardware failure, with regard to what is specified in the c standard?

(FWIW I don't particularly like overcommit-by-default either)

AnimalMuppet · on Feb 13, 2021

So if I malloc 2 MB or 2 GB or whatever in a C program running on Linux, but I have not yet either read from or written to that memory, then what's the state? Has the C library forced Linux to actually allocate it, or has it not? Or does it depend, and if so, on what?

dataflow · on Feb 13, 2021

It depends on the overcommit setting. By default it's on, and that indicates Linux doesn't promise to back it with a physical page. Only the virtual address range is allocated (i.e. the only guarantee is that future allocations within your process won't return addresses from that range). This implies that if you try to write to it, your write might segfault due to OOM. If overcommit is turned off, then Linux promises it will be backed with a physical page if you try to it, meaning your write won't segfault due to OOM. Aside from these, I think everything else is an implementation detail, but generally OSes map unwritten pages to the same zero page as an optimization, and then when a write occurs they back it with a physical page.

wahern · on Feb 12, 2021

> Also differs if you turn overcommit off but since so much in linux assumes it your system will soon break if you try it.

I agree, reliance on overcommit has resulted in stability problems in Linux. But IME stability problems aren't induced by disabling overcommit, they're induced by disabling swap. The stability problems occur precisely because by relying on magical heuristics to save the day, we end up with an overall MM architecture that reacts extremely poorly under memory pressure. Whether or not overcommit is enabled, physical memory is a limited resource, and when Linux can't relieve physical memory pressure by dumping pages to disk, bad things happen, especially when under heavy I/O load (e.g. the buffer cache can grab pages faster than the OOM killer can free them).

quotemstr · on Feb 12, 2021

And that's why strict allocation tracking (no overcommit) should be the default. But those of us in favor of guaranteed forward progress and sensible resource accounting lost this fight a long time ago.

jdsully · on Feb 12, 2021

In the C standard malloc should return null if it can’t fulfill the request. Linux violates this but it usually works out in the end since virtual memory makes true OOM very rare.

ymbeld · on Feb 13, 2021

I don’t know what true OOM means, but my desktop has crashed I think at least three times in the last four months and the console said “OOM killer”. About 15GB of usable RAM, 2GB swap drive. I just have to have the usual applications open plus another browser in addition to Firefox, namely Chrome. (But naturally I don’t try to actively reproduce the behavior since I usually have better things to do than wait 10 minutes from everything becoming unresponsive -- even switching from the graphical session to a console -- to the OOM killer finally deciding to kill Chrome.) And I don’t run any virtual machines, just a big, fat IDE and stuff like that.

ben_bai · on Feb 13, 2021

Your problem is the 2GB of swap. Get rid of it and it will just crash without 10min of slowdown (while swap disk is getting written to). </sarcasm>

Linux overcommitting memory and especially chrome/firefox beeing big-fat-memory-hogs are the problem. In fact every application which doesn't cope malloc beeing out of memory or assuming everybody has multiple gigs of memory to spare should "reevaluate".

ymbeld · on Feb 14, 2021

Seriously though that’s a good idea. Might be better to just disable swap. :) Well, at least until I go out and buy more RAM.

ben_bai · on Feb 18, 2021

The soft way would be to set ulimit for memory to something other than unlimited. To cap the max mem limit per process

ymbeld · on Feb 19, 2021

Thanks, that’s a good suggestion.

blt · on Feb 12, 2021

Much of this applies to other OSes with virtual memory also.