Thank you! I wanted to reply last night but got rate limited. I've been questioning my side project recently (hence why no updates for a month) so hearing positive feedback from people does help motivate.
"Virtual addresses are the size of a CPU register. On 32 bit systems each process has 4 gigabytes of virtual address space all to itself, which is often more memory than the system actually has."
Yes, VIRTUAL memory. Most operating systems (Windows, Linux) leave 2 or 3 GB for the user process and reserve the of the address space for themselves.
That way userspace-to-kernel switch does not require changing active page table (and also avoids switcharoo each time kernel needs to access userspace memory).
If you are thinking about the “which is often more memory than the system actually has" part, I don't know if it's outdated even today: the vast majority of Linux systems these days are Android phones, and I wouldn't be surprised at all if a good proportion of those didn't have more than 4GB of RAM.
I think that's probably still true for what 32-bit systems are still out there today.
And regardless, I think the majority of systems running Linux today are phones, which usually have 4GB or less of RAM.
But I expect the FAQ was probably originally thinking about desktop or server systems, so, yeah, the intent there is probably out of date. Those types of systems are rarely 32-bit these days, and usually have a bit more than 4GB of RAM.
> I think the majority of systems running Linux today are phones, which usually have 4GB or less of RAM.
Even this is quickly becoming less and less true (for new phones). Even the Pinephone comes with 3 GB of RAM at a $200 price point, and that's inflated because of the niche, low volume nature of its production.
Samsung's "mid range" A series smartphones, for instance, start at 3GB at the absolute lowest end, with most models coming with 6 GB of memory. I expect this will be even more common in a year or two.
App-switching (multitasking) without LRU apps getting force-closed to make room for active apps. In other words, if you like to keep apps open, more RAM will reduce the chances of an app opened a while ago having to "start fresh" when you switch back to it, losing whatever state it had when you last used it.
It means that it can address 40-bits of address space worth of physical memory, but that virtual memory addresses can use 48 bits. Physical addresses are just your RAM bytes numbered 1 through whatever. Virtual address space is the address space of a process, which includes mapped physical memory, unmapped pages, guard pages, and other virtual memory tricks.
> Physical addresses are just your RAM bytes numbered 1 through whatever.
Not really. There are lots of holes in the physical address map. Look at /proc/iomem. Look at all of the gunk in there at addresses lower than the amount of RAM you have. Look at the highest “System RAM” address. It will be higher than the amount of actual physical RAM that you have.
Your CPU can handle 39-bit physical memory addresses (up to 512 GB of physical memory), and 48-bit virtual addresses (256 TB). Your operating system maintains a mapping from virtual to physical addresses, usually arranging the map so that every process has a separate memory space. Pointers are all still 64 bits long though.
In practice the actual available usable address space for userland is 64 TiB due to user/kernel split and the kernel maintaining a virtual mapping of the entire physical address space (minus I/O ranges) [0].
However newer incoming 5-level page intel chips [1] will allow up to 57 bits of address space, 128 PiB in theory though in practice 32 PiB of userland memory. See also [0] for discussion on practical limit for 5-page too!
True, though /proc/cpuinfo only reports the size, which is ultimately what the CPU cares about. Plus the most relevant limit is what your motherboard and wallet supports, which is often far lower.
Indeed, and as you say, sensibly speaking you are hardly likely to hit those limits in any likely (esp. home) setup. The actual meaningful limit is usually the CPU physical one as home CPUs very often have stringent memory limits (often 32 GiB or so) and of course you rely on the motherboard's limitations also.
Having said that I did write a patch to ensure that the system would boot correctly with 256 TiB of RAM [0] so perhaps I am not always a realist... or dream of the day I can own that system ;)
Oddly enough the unused bits are in the middle of the address. They're also sign-extended rather than filled with zeros, so sometimes they are ones and other times they are zeros.
Hmm, it appears that the top byte on arm64 is only ignored if TBI (Top Byte Ignore) is enabled.
I don't think pointer signing requires TBI though. Pointer signing uses the PAC instruction to sign a pointer, and the AUT instruction to verify and unpack the signed pointer, but in its signed/packed form it is not a usable pointer. So actual addressable pointers need not support non-canonical addresses.
It's for a different purpose. (as in mitigate to some extent security bugs) And isn't an Apple feature only but an Arm one. (that is only rolling out on Cortex with Cortex-A78C and A78AE)
Yes generally for userspace addresses they are 0. But more importantly they can be used for other stuff, commonly referred to as pointer tagging / smuggling etc.
It's a useful optimisation technique where you can add some extra metadata without having to dereference a pointer.
The reason why amd64 checks whether the addresses are “canonical” is discourage exactly this trick. On almost all platforms that simply ignored upper byte of pointer (m68k, s390, IIRC even early ARMs) this lead to significant compatibility issues.
As for storing tags in pointers on 64b platforms it is probably better to use the 3 low order bits. Another useful trick is what was used in PDP-10 MacLisp and is used by BDW GC: encode the type information in virtual memory layout itself.
I guess it checks it when you actually try to dereference the pointer?
On Intel too you still have to "repair"the pointer before you use it.
It's definitely not the safest optimisation but it can be used to great effect when needed.
I think Intel is adding CPU support for pointer tagging operations in the future which should make them a lot easier / safer / more efficient to work with, though I can't find a reference now, it doesn't refer to it as pointer tagging.
Any more information on encoding the type information in virtual memory layout? Sounds cool.
I guess you have different types allocated in specific regions?
Most general purpose ISAs (eg. SPARC and IIRC RiscV has something similar) with some kind of intrinsic support for tagged pointers also prefer the tags in low order bits.
And you are right that the tag inside address trick involves allocating objects of same type in different continuous regions. Usually such that whole page contains object of same type (as far as the tagging scheme is concerned) and by either masking off lower ten-ish bits of pointer you get to type header or you have some global out-of-line map of page frame->type.
On 32-bit systems, 4 GiB is indeed often more memory than the system has (think 512 MiB for some Raspberry Pis). And on 64-bit x86 systems each process has 256 PiB, which is also more memory than the system has.
Other than that, I also think that even when outdated, computing history is worth reading anyway, since it gives you a natural understanding of _why_ we do what we do these days. In your day job, it also gives you a different appreciation for what people did and why they did it, and why 'this horrible code' may have made sense at the time.
Furthermore, performance engineering is fundamentally about opposing code and hardware limitations. If hardware limitations are different, you'll get different code, but the principles remain the same.
If you're curious, write a basic emulator for older hardware (the NES is a great choice) , it's both fun and eye-opening!
Edit: the NES emulator will answer 'how do you fit super mario bros in 32k, and how can it run on such limited hardware?'
Sometimes, but a description of the state of the art in the past does not become a historical tract with the passage of time. The better ones do; others just become outdated.
Well, the ones which fail (and which become outdated) can also teach us valuable lessons : looking at the current state of the art doesn't necessarily tells you what happens if you do things differently.
In other words, we tend to focus on positive results, but negative ones ('don't do this or.. !') can be equally interesting and useful.
Highly relevant. The only part that I would discount is that he was pretty bullish on the prospects for hardware transactional memory, and his forward-looking statements about it didn't pan out. In fairness, much of the industry was bullish about HTM at that time.
In contrast, software transactional memory is still a pretty neat abstraction for some concurrency problems.
(And, of course, hardware transactional memory can be used to implement 'software' transactional memory faster than in software.)
However, STM only really works well in languages that are pure by default, like eg Haskell (or perhaps Erlang might be close enough). In a language with pervasive mutations and side effects, it's too annoying to use. Microsoft tried to make it work for .net for a while, and gave up.
Been a while since I've read them but I recall there was info about the FSB and northbridge, which no longer exist outside the CPU. They've been replaced by internal memory controllers and PCI-E controllers.
Actually no, the malloc doesn't allocate any memory it just updates the process's VMA to say that the allocated virtual range is valid. The pages are then faulted in on write. This is where things like OOM killer become very confusing for people.
In linux (in sane configurations) allocations are just preorders.
EDIT: I can't reply below due to rate limiting:
I'd argue that overcommit just makes the difference between allocation and backing very stark.
Your memory IS in fact allocated in the process VMA, it's just the anonymous pages cannot necessarily be backed.
This differs, obviously, in other OSes as pointed out. Also differs if you turn overcommit off but since so much in linux assumes it your system will soon break if you try it.
This depends on the OS. Solaris and Windows both do strict accounting by default, and overcommit is opt-in at a fine-grain API level. Linux is relatively extreme in its embrace of overcommit. So extreme that strict accounting isn't even possible--even if you disable overcommit in Linux, there are too many corner cases in the kernel where a process (including innocent processes) will be shot down under memory pressure. Too many Linux kernel programmers designed their subsystems with the overcommit mentality. That said, I still always disable overcommit as it makes it less likely for innocent processes to be killed when under heavy load.
An example of a split-the-difference approach is macOS, which AFAIU implements overcommit but also dynamically instantiates swap so that overcommit-induced OOM killing won't occur until your disk is full.
Also, it's worth mentioning that on all these systems process limits (see, e.g., setrlimit(2)) can still result in malloc returning NULL.
> Solaris and Windows both do strict accounting by default, and overcommit is opt-in at a fine-grain API level.
Not sure what you mean by this - I don't think Windows has overcommit in any form, whether opt-in or opt-out. What it does have is virtual address space reservation, but that's separate from commitment; reserved virtual memory is not backed by any page, no matter how much free RAM you have, until you explicitly tell the system to commit physical memory to it.
In fact I'm not even sure 'opt-in' to overcommit is possible in principle. Because if you opt-in to overcommit, you jeopardize other applications' integrity—who likely did not opt-in.
I thought there was a flag or commonly used library function that would do VirtualAlloc(MEM_RESERVER) and then from an in-process page fault handler attempt VirtualAlloc(MEM_COMMIT). But I guess I was wrong? I assume it's possible, just not as common as I thought.
I don't know of a common (or uncommon) function like this, though I think you could indeed implement it if you really want to (likely via AddVectoredExceptionHandler). It still requires explicitly telling the OS to commit just-in-time, so it's not "overcommitting". The closest built-in thing to this that I know of is PAGE_GUARD, which is internally used for stack extension, but that's all I can think of. The use cases for such a thing would be incredibly niche though—like kind of high-performance sparse page management where every single memory access instruction counts. Like maybe if you're writing a VM or emulator or something. Something that's only appropriate for << 1% of programs.
I don't think the c standard specifies this behavior. malloc must return either a pointer where you can store an object, or null. I think platform details about when accesses to that pointer might fail are outside the scope of the language / stdlib standard.
Are failures when accessing the allocated pointer due to overcommit substantially different than failures due to ECC errors or other hardware failure, with regard to what is specified in the c standard?
(FWIW I don't particularly like overcommit-by-default either)
So if I malloc 2 MB or 2 GB or whatever in a C program running on Linux, but I have not yet either read from or written to that memory, then what's the state? Has the C library forced Linux to actually allocate it, or has it not? Or does it depend, and if so, on what?
It depends on the overcommit setting. By default it's on, and that indicates Linux doesn't promise to back it with a physical page. Only the virtual address range is allocated (i.e. the only guarantee is that future allocations within your process won't return addresses from that range). This implies that if you try to write to it, your write might segfault due to OOM. If overcommit is turned off, then Linux promises it will be backed with a physical page if you try to it, meaning your write won't segfault due to OOM. Aside from these, I think everything else is an implementation detail, but generally OSes map unwritten pages to the same zero page as an optimization, and then when a write occurs they back it with a physical page.
> Also differs if you turn overcommit off but since so much in linux assumes it your system will soon break if you try it.
I agree, reliance on overcommit has resulted in stability problems in Linux. But IME stability problems aren't induced by disabling overcommit, they're induced by disabling swap. The stability problems occur precisely because by relying on magical heuristics to save the day, we end up with an overall MM architecture that reacts extremely poorly under memory pressure. Whether or not overcommit is enabled, physical memory is a limited resource, and when Linux can't relieve physical memory pressure by dumping pages to disk, bad things happen, especially when under heavy I/O load (e.g. the buffer cache can grab pages faster than the OOM killer can free them).
And that's why strict allocation tracking (no overcommit) should be the default. But those of us in favor of guaranteed forward progress and sensible resource accounting lost this fight a long time ago.
In the C standard malloc should return null if it can’t fulfill the request. Linux violates this but it usually works out in the end since virtual memory makes true OOM very rare.
I don’t know what true OOM means, but my desktop has crashed I think at least three times in the last four months and the console said “OOM killer”. About 15GB of usable RAM, 2GB swap drive. I just have to have the usual applications open plus another browser in addition to Firefox, namely Chrome. (But naturally I don’t try to actively reproduce the behavior since I usually have better things to do than wait 10 minutes from everything becoming unresponsive -- even switching from the graphical session to a console -- to the OOM killer finally deciding to kill Chrome.) And I don’t run any virtual machines, just a big, fat IDE and stuff like that.
Your problem is the 2GB of swap. Get rid of it and it will just crash without 10min of slowdown (while swap disk is getting written to). </sarcasm>
Linux overcommitting memory and especially chrome/firefox beeing big-fat-memory-hogs are the problem. In fact every application which doesn't cope malloc beeing out of memory or assuming everybody has multiple gigs of memory to spare should "reevaluate".
I have made a few patches into the mm subsystem some simply inspired by researching for the articles.