> Or, why does an out-of-memory system hang instead of just telling the next process "nope, we don't have 300KB to allocate to you anymore"
Blame UNIX for that, and the fork() system call.
It's a design quirk. fork() duplicates the process. So suppose your web browser consumes 10GB RAM out of the 16GB total on the system, and wants to run a process for anything. Like it just wants to exec something tiny, like `uname`.
1. 10GB process does fork().
2. Instantly, you have two 10GB processes
3. A microsecond later, the child calls exec(), completely destroying its own state and replacing it with a 36K binary, freeing 10GB RAM.
So there's two ways to go there:
1. You could require step 2 to be a full copy. Which means either you need more RAM, a huge chunk of which would always sit idle, or you need a lot of swap, for the same purpose.
2. We could overlook the memory usage increase and pretend that we have enough memory, and only really panic if the second process truly needs its own 10GB RAM that we don't have. That's what Linux does.
The problem with #2 is that dealing with this happens completely in the background, at times completely unpredictable to the code. The OS allocates memory when the child changes memory, like does "a=1" somewhere. A program can't handle memory allocations failures there because as far as it knows, it's not allocating anything.
So what you get is this fragile fiction that sometimes breaks and requires the kernel to kill something to maintain the system in some sort of working state.
Windows doesn't have this issue at all because it has no fork(). New processes aren't children and start from scratch, so firefox never gets another 10GB sized clone. It just starts a new, 36K sized process.
> Blame UNIX for that, and the fork() system call.
At least that design failure of UNIX has been fixed long ago. There are posix_spawn(3) and various clone(2) flavours which allow to spawn new process without copying the old one. And a lot of memory-intensive software actually use them, so modern Linux distros can be used without memory overprovisioning.
I'd rather blame people who are still using fork(2) for anything that can consume more than 100MB of memory.
I'm someone who likes to use fork() and then actually use both processes as they are, with shared copy-on-write memory. I'm happy to use it on things consuming much more than 100MB of memory. In fact that's where I like it the most. I'm probably a terrible person.
But what would be better? This way I can massage my data in one process, and then fork as many other processes that use this data as I like without having to serialise it to disk and and then load it again. If the data is not modified after fork it consumes much less memory (only the page tables). Usually a little is modified, consuming only a little memory extra. If all of it is modified it doesn't consume more memory than I would have otherwise (hopefully, not sure if the Linux implementation still keeps the pre-fork copy around).
(And no, not threads. They would share modifications, which I don't want. Also since I do this in python they would have terrible performance.)
So if I got it right, you're using fork(2) as a glorified shared memory interface. If my memory is (also) right, you can allocate shared read-only mapping with shm_open(3) + mmap(2) in parent process, and open it as a private copy-on-write mapping in child processes.
I have used fork as a stupid simple memory arena implementation. fork(); do work in the child; only malloc, never free; exit. It is much, much heavier than a normal memory arena would be, but also much simpler to use. Plus, if you can split the work in independent batches, you can run multiple children at a time in parallel.
As with all such stupid simple mechanisms, I would not advise its use if your program spans more than one .c file and more than a thousand lines.
posix_spawn() is great, but Linux doesn't implement it. glibc does based on fork()+exec(). Other Unix(-like) OSes do implement posix_spawn() as system call. Also while you can use posix_spawn() in the vast majority of cases, if it doesn't cover certain process setup options that you need you still have to use fork()+exec(). But yeah, it would be good if Linux had it as a system call. It would probably help PostgreSQL.
Yes, of course. Did remember its not fork(), but some other *fork() and couldn't remember the name. But just the kind of thing it does. Its also not exec(), but probably execvp() or execvpe() or something like that.
When using vfork() what are you allowed to do? Can you even do IO redirection (piping)? The man page says:
> ... the behavior is undefined if the process created by vfork() either modifies any data other than a variable of type pid_t used to store the return value from vfork() ...
But the time between fork and exec is exactly where you do a lot of setup, like IO redirection, dropping privileges, setuid(), setting a signal mask (nohup) etc. and I don't think you can do that without setting any variables. You certainly write to the stack when calling a function.
If you can't do these things you can't really use it to implement posix_spawn(). I guess it could use vfork() in the case no actions are required, but only then.
Can a modern distro really be used without over provisioning? Because the last time I tried it either the DE or display server hard locked immediately and I had to reboot the system.
Having this ridiculous setting as the default has basically ensured that we can never turn it off because developers expect things to work this way. They have no idea what to do if malloc errors on them. They like being able to make 1TB allocs without worrying about the consequences and just letting the kernel shoot processes in the head randomly when it all goes south. Hell, the last time this came up many swore that there was literally nothing a programmer could do in the event of OOM. Learned helplessness.
It's a goddamned mess and like many of Linux's goddamned messes not only are we still dealing with it in 2023, but every effort to do anything about it faces angry ranty backlash.
Almost everything in life is overprovisioned, if you think about it: Your ISP, the phone network, hospitals, bank reserves (and deposit insurance)...
What makes the approach uniquely unsuitable for memory management? The entire idea of swapping goes out of the window without overprovisioning as well, for better or worse.
Perhaps there is some confusion because I used "overprovision" when the appropriate term here is "overcommit", but Windows manages to work fine without unix-style overcommit. I suspect most OSs in history do not use unix's style of overcommit.
> What makes the approach uniquely unsuitable for memory management?
The fact that something like OOM killer even needs to exist. Killing random processes to free up memory you blindly promised but couldn't deliver is not a reasonable way to do things.
What an absurdly whataboutism filled response. Meanwhile Windows has been doing it the correct way for 20 years or more and never has to kill a random process just to keep functioning.
So you're saying the correct way to support fork() is to... not support it? This seems pretty wasteful in the majority of scenarios.
For example, it's a common pattern in many languages and frameworks to preload and fully initialize one worker process and then just fork that as often as required. The assumption there is that, while most of the memory is theoretically writable, practically, much of it is written exactly once and can then be shared across all workers. This both saves memory and the time needed to uselessly copy it for every worker instance (or alternatively to re-initialize the worker every single time, which can be costly if many of its data structures are dynamically computed and not just read from disk).
How do you do that without fork()/overprovisioning?
I'm also not sure whether "giving other examples" fits the bill of "whataboutism", as I'm not listing other examples of bad things to detract from a bad thing under discussion – I'm claiming that all of these things are (mostly) good and useful :)
> How do you do that without fork()/overprovisioning?
You use threads. The part that fork() would have kept shared is still shared, the part that would have diverged is allocated inside each thread independently.
And if you find dealing with locking undesirable you can use some sort of message system, like Qt signals to minimize that.
> the part that would have diverged is allocated inside each thread independently
That’s exactly my criticism of that approach: It’s conceptually trickier (fork is opt-in for sharing; threads are opt-out/require explicit copying) and requires duplicating all that memory, whether threads end up ever writing to it or not.
Threads have their merits, but so do subprocesses and fork(). Why force developers to use one over the other?
> Threads have their merits, but so do subprocesses and fork(). Why force developers to use one over the other?
I used to agree with you, but fork() seems to have definitely been left behind. It has too many issues.
* fork() is slow. This automatically makes it troublesome for small background tasks.
* passing data is inconvenient. You have to futz around with signals, return codes, socketpair or shared memory. It's a pain to set up. Most of what you want to send is messages, but what UNIX gives you is streams.
* Managing it is annoying. You have to deal with signals, reaping, and doing a bunch of state housekeeping to keep track of what's what. A signal handler behaves like an annoying, really horribly designed thread.
* Stuff leaks across easily. A badly designed child can feed junk into your shared filehandles by some accident.
* It's awful for libraries. If a library wants to use fork() internally that'll easily conflict with your own usage.
* It's not portable. Using fork() automatically makes your stuff UNIX only, even if otherwise nothing stops it from working on Windows.
I think the library one is a big one -- we need concurrency more than ever, but under the fork model different parts of the code that are unaware of each other will step over each other's toes.
Unless you're sure that you're going to write to the majority of the copy-on-write memory resulting from fork(), this seems like overkill.
Maybe there should be yet another flavor of fork() that does copy-on-write, but treats the memory as already-copied for physical memory accounting purposes? (Not sure if "copy-on-write but budget as distinct" is actually representable in Linux's or other Unixes' memory model, though.)
> Maybe there should be yet another flavor of fork() that does copy-on-write, but treats the memory as already-copied for physical memory accounting purposes?
How about a version which copies the pages but marks them read-only in the child process, except for a set of ranges passed to fork (which would be copy-on-write as now). The child process then has to change any read-only pages to copy-on-write (or similar) to modify them.
This allows the OS to double-count and hence deny fork if the range of pages passed to fork leads to out of memory situation. It also allows the OS to deny the child process changing any read-only pages if it would lead to an out of memory situation. Both of those scenarios could be gracefully handled by the processes if they wish.
It would also keep the current positive behavior of the forked process having read access to the parent memory for data structures or similar.
Option 3: vfork() has existed for a long time. The child process temporarily borrows all of the parent's address space. The calling process is frozen until the child exits or calls a flavor of exec. Granted, it's pretty brittle and any modification of non-stack address space other than changing a variable of type pid_t is undefined behavior before exec is called. However, it gets around the disadvantages of fork() while maintaining all of the flexibility of Unix's separation of process creation (fork/vfork) and process initialization (exec*).
vfork followed immediately by exec gives you Windows-like process creation, and last I checked, despite having the overhead of a second syscall, was still faster than process creation on Windows.
No, that's not how it works. The process table gets duplicated and copy-on-write takes care of the pages. As long as they are identical they will be shared, there is no way that 10GB of RAM will be allocated to the forked process and that all of the data will be copied.
This is the only right answer. What actually happens is you instantly have two 10G processes which share the same address space, and:
3. A microsecond later, the child calls exec(), decrementing the reference count to the memory shared with the parent[1] and faulting in a 36k binary, bringing our new total memory usage to 1,045,612KB (1,048,576K + 36K)
CoW has existed since at least 1986, when CMU developed the Mach kernel.
What GP is really talking about is overcommit, which is a feature (on by default) in Linux which allows you to ask for more memory than you have. This was famously a departure from other Unixes at the time[2], a departure that fueled confusion and countless flame wars in the early Internet.
> 2. We could overlook the memory usage increase and pretend that we have enough memory, and only really panic if the second process truly needs its own 10GB RAM that we don't have. That's what Linux does
"pretend" → share the memory and hope most of it will be read-only or unallocated eventually; "truly needs to own" → CoW
It will never happen. To begin with all of the code pages are going to be shared because they are not modified.
Besides that the bulk of the fork calls are just a preamble to starting up another process and exiting the current one. It's mostly a hack to ensure continuity for stdin/stdout/stderr and some other resources.
It will most likely not happen? It's absolutely possible to write a program that forks and both forks overwrite 99% of shared memory pages. It almost never happens, which is GP's point, but it's possible and the reason it's a fragile hack.
What usually happens in practice is you're almost OOM, and one of the processes running in the system writes to a page shared with another process, forcing the system to start good ol' OOM killer.
Sorry, but no, it can't happen, you can not fork a process and end up with twice the memory requirements just because of the fork. What you can do is to simply allocate more memory than you were using before and keep writing.
The OOM killer is a nasty hack, it essentially moves the decision about what stays and what goes to a process that is making calls way above its pay grade, but overcommit and OOM go hand in hand.
It does not happen using fork()/exec() as described above. For it to happen we would need to fork() and continue using old variables and data buffers in the child that we used in the parent, which is a valid but rarely used pattern.
Please read the parent comments. Overcommit is necessary exactly because kernel has to reserve memory for both processes, and overcommit allows to reserve more memory than there is physically present.
If kernel could not reserve memory for forked process, overcommit would not be necessary.
This is a misconception you and parent are perpetuating. fork() existed in this problematic 2x memory implementation _way_ before overcommit, and overcommit was non-existent or disabled on Unix (which has fork()) before Linux made it the default. Today with CoW we don't even have this "reserve memory for forked process" problem, so overcommit does nothing for us with regard to fork()/exec() (to say nothing of the vfork()/clone() point others have brought up). But if you want you can still disable overcommit on linux and observe that your apps can still create new processes.
What overcommit enables is more efficient use of memory for applications that request more memory than they use (which is most of them) and more efficient use of page cache. It also pretty much guarantees an app gets memory when it asks for it, at the cost of getting oom-killed later if the system as a whole runs out.
I think you've got it backwards: With overcommit, there is no memory reservation. The forked processes gets an exact copy of the other's page table, but with all writable memory marked as copy-on-write instead. The kernel might well be tallying these up to some number, but nothing important happens with it.
Only without overcommit does the kernel does need to start accounting for hypothetically-writable memory before it actually is written to.
But large fraction, if all you do afterwards is an exec call. Given 8 bytes per page table entry and 4k pages, it's 1/512 memory wasted. So if your process uses 8GB, it's 16MB. Still takes noticeable time if you spawn often.
I've never had the page tables be the cause of out of memory issues. Besides the fact that they are usually pre-allocated to avoid recursive page faults, but nothing would stop you from making the page tables themselves also copy-on-write during a fork.
Aren't page tables nested? I don't know if any OS or hardware architecture actually supports it, but I could imagine the parent-level page table being virtual and copy-on-write itself.
CoW is a strategy where you don't actually copy memory until you write to it. So, when the 10GB process spawns a child process, that child process also has 10GB of virtual memory, but both processes are backed by the same pages. It's only when one of them writes to a page that a copy happens. When you fork+exec you never actually touch most of those pages, so you never actually pay for them.
(Obviously, that's the super-simplified version, and I don't fully understand the subtleties involved, but that's exactly what GP means: it's harder to analyse)
To make it slightly more complicated: you don't pay for the 10 GB directly, but you still pay for setting up the metadata, and that scales with the amount of virtual memory used.
We actually ran into this a long time ago with Solaris and Java.
Java has JSP, Java Server Pages. JSP processing translates a JSP file into Java source code, compiles it, then caches, loads, and executes the resulting class file.
Back then, the server would invoke the javac compiler through a standard fork and exec.
That’s all well and good, save when you have a large app server image sucking up the majority of the machine. As far as we could tell, it was a copy on write kind of process, it didn’t actually try to do the actual work when forking the app server. Rather it tried to do the allocation, found it didn’t have the space or swap, and just failed with a system OOM error (which differs from a Java out of memory/heap error).
As I recall adding swap was the short term fix (once we convinced the ops guy that, yes it was possible to “run out of memory”). Long term we made sure all of our JSPs were pre-compiled.
Later, this became a non issue for a variety of reasons, including being able to run the compiler natively within a running JVM.
I find your writing style really pleasant and understandable! Much more so than the StackExchange answer. I really like the breakdown into steps, then what could happen steps and the follow ups. Where can I read more (from you?) in this style about OS and memory management?
posix_spawn isn't powerful enough since it only supports a limited set of process setup operations. So if you need to do some specific syscalls before exec that aren't covered by it then the fork/exec dance is still necessary.
In principle one can use vfork+exec instead but that very hard to get right.
The Windows example is a non-sequitur as in both cases, you end up with the 36K sized process both in Windows and Linux if you want to spawn a sub-process that exec's. The fork() => exec() path is not the issue (if there is an issue at all here), and if you use threading the memory is not forked like this to start with (on either of the OSes).
I guess the case you want to highlight is more if you for example mmap() 10 GB of RAM on that 16 GB machine that only has 5 GB unused swap space left and where all of the physical RAM is filled with dirty pages already. Should the mmap() succeed, and then the process is killed if it eventually tries to use more pages than will fit in RAM or the backing swap? This is the overcommit option which is selectable on Linux. I think the defaults seem pretty good and accept that a process can get killed long after the "explicit" memory mapping call is done.
> blame UNIX for that, and the fork() system call.
Given that most code I have seen would not be able to handle an allocation failure gracefully I wouldn't call it "blame", if the OS just silently failed memory allocations on whatever program tried to allocate next you would basically end up with a system where random applications crash, which is similar to what the OOM killer does, just with no attempt to be smart about it. Even better, it is outright impossible to gracefully handle allocation failures in some languages, see for example variable length arrays in C.
No code is written to handle allocation failure because it knows that it's running on an OS with overcommit where handling allocation failure is impossible. Overcommit means that you encounter the problem not when you call `malloc()` but when you do `*pointer = value;`, which is impossible to handle.
> Also, why would you bother to handle it gracefully when the OS won't allow you to do it?
There are many situations where you can get an allocation failure even with over provisioning enabled.
> Just don't use VLAs if then? "Problem" solved.
Yes, just don't use that language feature that is visually identical to a normal array. Then make sure that your standard library implementation doesn't have random malloc calls hidden in functions that cannot communicate an error and abort instead https://www.thingsquare.com/blog/articles/rand-may-call-mall.... Then ensure that your dependencies follow the same standards of handling allocation failures ... .
I concede that it might be possible, but you are working against an ecosystem that is actively trying to sabotage you.
A small reminder: In the age of Unix multiuser systems were very common. Fork was the optimal solution to be able to serve as much concurrent users or programs as possible while keeping the implementation simple.
So much of design constraints in our base abstractions are not relevant today but we're still cobbling together solutions built on legacy technical decisions.
Blame UNIX for that, and the fork() system call.
It's a design quirk. fork() duplicates the process. So suppose your web browser consumes 10GB RAM out of the 16GB total on the system, and wants to run a process for anything. Like it just wants to exec something tiny, like `uname`.
1. 10GB process does fork().
2. Instantly, you have two 10GB processes
3. A microsecond later, the child calls exec(), completely destroying its own state and replacing it with a 36K binary, freeing 10GB RAM.
So there's two ways to go there:
1. You could require step 2 to be a full copy. Which means either you need more RAM, a huge chunk of which would always sit idle, or you need a lot of swap, for the same purpose.
2. We could overlook the memory usage increase and pretend that we have enough memory, and only really panic if the second process truly needs its own 10GB RAM that we don't have. That's what Linux does.
The problem with #2 is that dealing with this happens completely in the background, at times completely unpredictable to the code. The OS allocates memory when the child changes memory, like does "a=1" somewhere. A program can't handle memory allocations failures there because as far as it knows, it's not allocating anything.
So what you get is this fragile fiction that sometimes breaks and requires the kernel to kill something to maintain the system in some sort of working state.
Windows doesn't have this issue at all because it has no fork(). New processes aren't children and start from scratch, so firefox never gets another 10GB sized clone. It just starts a new, 36K sized process.