> some systems, such as Linux, won't fail a memory allocation but instead crash ...

jcranmer · on Dec 13, 2018

Linux, in most distributions, enables overcommit. That is a fact that anyone distributing software is going to have to deal with. Saying that you personally choose to disable it whenever possible doesn't make that fact go away.

> But for infrastructure software OOM should be handled no different than file or network errors--subcomponents should be capable of maintaining consistent state and bubbling the error upward to let the larger application make the decision.

OOM, to me, is more like a null pointer dereference or a division by zero. If it happens, it's because I as a programmer screwed up, either by leaking memory, having a data structure that needs to be disk-based instead of memory-based, or by failing to consider resource bounds.

The problem with trying to handle memory allocation is that a) it can happen just about anywhere, b) you have to handle it without allocating any more memory, and c) if there is a single place where you forget to handle allocation failure, your program is not robust to OOM errors. I rather expect that the fraction of programs that could not be crashed with a malicious allocator that returns allocation failure at the worst possible time is well south of 0.01%.

wahern · on Dec 13, 2018

> a) it can happen just about anywhere

That's like saying a NULL pointer dereference or division by zero can happen anywhere. Only for poorly written code that doesn't think through and maintain its invariants. Languages like C++ and Rust make it easier to check your invariants, but plenty of C code does this, including kernels and complex libraries. And they don't do it by inserting assertions before every pointer deference or division operation.

As I said, the RAII-pattern is one way to keep your invariants such that most allocations only happen at resource initialization points, not at every point in a program where the object is manipulated.

> b) you have to handle it without allocating any more memory,

Unwinding state doesn't need more memory, not in languages like C, C++, or Rust. If it did then the kernel would panic when cleaning up after programs terminated when out of memory.

This argument is a common refrain from GUI programmers who wish to throw up a notification window. But that's distinct from recovering to a steady state.

In those particular contexts where recovery isn't possible or practical, then you can't recover. Again, scripting languages are an obvious example, though some, like Lua, handle OOM and let you recover at the C API boundary. (Notably, in Lua's case the Lua VM context remains valid and consistent, and in fact you can handle OOM gracefully purely within Lua, but only at points where the VM is designed to be a recovery point, such as at pcall or coroutine.resume.) But the existence of such contexts doesn't mean recovery is never possible or even rarely possible.

> c) if there is a single place where you forget to handle allocation failure, your program is not robust to OOM errors.

If you keep an RAII pattern then handling OOM is little different than deallocating objects, period. In that case, your statement is the equivalent of saying that because memory bugs exist, no program should bother deallocating memory at all.

Now, I've seen programs that can't handle deallocation; programs that were written to be invoked as one-shot command-line utilities and never concerned themselves with memory leaks. Trying to fix them after the fact so they can run from a service process is indeed usually a hopeless endeavor. Likewise, trying to fix a program that didn't concern itself with OOM is also a hopeless endeavor. But it doesn't follow that therefore when starting a project from scratch one shouldn't bother with OOM at all, no more than saying nobody should bother with deallocation.

The objections people have to OOM handling are self-fulfilling. When programmers don't consider or outright reject OOM handling then of course their code will be littered with logic that implicitly or explicitly relies on the invariant of malloc never failing. So what? You can't infer anything from that other than that programs only work well, if at all, when the implicit or explicit assumptions they were written under continue to hold.