I think it was when I was reading the zlib source code, while dinosaurs roamed outside my college window, where I first saw a variant of this, of a library that statically allocated memory for a graceful shutdown routine before doing anything, so even if memory were exhausted by a bug in other code later on it could still exit cleanly.
It's still done. I did the code review for the guy who wrote it for our server. If we're crashing due to an OOM situation we still have enough memory to write out a minidump.
TigerBeetle uses Deterministic Simulation Testing to test and keep testing these paths. Fuzzing and static allocation are force multipliers when applied together, because you can now flush out leaks and deadlocks in testing, rather than letting these spillover into production.
Without static allocation, it's a little harder to find leaks in testing, because the limits that would define a leak are not explicit.
Oh for me too, but I wish it was easier, like something supported seriously by my language of choice, my OS of choice or my libc of choice, with the flick of an flag... I don't retest the whole libc and kernel every time (although, err, someone should, right? :-)
The OOM killer can be disabled completely on a system, or a process can be excluded from it by setting an OOM flag under `/proc/<pid>`
It's not generally considered good practice in production, but I think that's largely because most software does not do what is being suggested here... in the scenario where you have, seems like the right thing to do.
It's definitely good practice in production and is often necessary.
The techniques mentioned above will (perhaps surprisingly) not eliminate errors related to OOM, due to the nature of virtual memory. Your program can OOM at runtime even if you malloc all your resources up front, because allocating address space is not the same thing as allocating memory. In fact, memory can be deallocated (swapped out), and then your application may OOM when it tries to access memory it has previously used successfully.
Without looking, I can confidently say that tigerbeetle does in fact dynamically allocate memory -- even if it does not call malloc at runtime.
> It's definitely good practice in production and is often necessary.
I'd be curious if you have any resources/references on what is considered good practice in that now, then.
It's been a long time since I did much ops stuff outside a few personal servers, so it may well be my background is out of date... but I've certainly heard the opposite in the past. The argument tended to run along the lines that most software doesn't even attempt to recover and continue from a failed malloc, may not even be able to shut down cleanly at all if it did anyway, and the kernel may have more insight into how to best recover...so just let it do its thing.
Sure. A lot of systems are single-app systems. They run a single instance of a rdbms, app, etc - or several instances all of which are represent the sole purpose of the system. In these cases it's generally good practice to disable OOM because if the app is killed, nothing else on the system has value anyway.
It would be ideal to not tickle the OOM killer, but it does happen. A great example would be Redis, using bgsave. There's a lot to criticize about redis and bgsave and I don't mean to defend its architecture. Its behavior is fairly extreme so it provides an example useful for illustration. Because Redis forks its entire in-memory state and writes itself to disk, it will sometimes appear to the system as if it has doubled in memory use. It's a huge, sudden memory pressure event, exacerbated by any writes forcing CoW allocations while the bgsave runs.
Many other app servers or database systems can have similar sudden memory pressure. You basically don't ever want a primary app on a system to be killed, so it's usually best to just disable OOM killer entirely on these processes.
There's often no failed malloc() in these situations. Often you will see OOM in situations where no malloc() has ever failed, because malloc() is just allocating address space without any attached memory, which is nearly free, and which almost always succeeds. The failure will occur later when the allocated page is first written to -- causing a page fault and an actual allocation to happen. There's no associated function call or system call. Page faults are triggered simply by accessing memory. This is why the OOM killer exists, as there's no function available to return a failure to when memory can't be produced. Such is the idiosyncratic behavior of lazily-allocated memory in modern virtual memory systems.
tldr: Malloc never fails, because malloc allocates address space not memory. Memory writes trigger failures, because writes create page faults and trigger the actual allocations. But the failures often express behavior elsewhere, via the action-at-a-distance magic of the OOM.
Oh for sure. Not casting any shade cast at all on your work - I am really happy to see what you're doing. This kind of thing has a lot of value even in a virtual memory environment.
C++ on Windows. It's an old codebase that has been used to run custom test hardware.
We also target Linux with the same codebase, but the environment is a bit different. On Linux we're using a system slice and we're not doing minidumps.
IIRC, .NET does something similar for OutOfMemoryExceptions. The exception is thrown before actually running out of memory. That allows just enough space for the OutOfMemoryException object (and associated exception info) to be allocated.
I believe the runtime preallocates an OOME on startup, so if one is encountered it can just fill in the blanks and throw without needing any additional allocations.
Iirc also perl does this. I’m on the phone right now, but iirc $^O was a variable filled with about 20k of worthless data that could be freed in case of an out of memory condition.