> some systems, such as Linux, won't fail a memory allocation but instead crash you when you try to use it
False.
1) You can disable overcommit, and there are many of us that do this as a matter of course on all our Linux servers.
2) malloc can fail because of process limits (e.g. setrlimit or cgroups).
I don't program in C++, but I do use RAII-like patterns in C. By that I mean that when I create and initialize objects, all the necessary internal resources--particularly those that rely on dynamic allocation--are also created and initialized in the same routine.
That means most places where memory allocation can fail are grouped closely together into a handful of initialization routines, and the places where allocation failure results in unwinding of a logical task are even fewer. (While C doesn't automatically release resources, following an RAII-like pattern means deallocations are just as structured and orderly as allocations.)
I can understand bailing on allocation failure in scripting languages--not only is there much more dynamic allocation taking place, but allocation happens piece-meal all over the program, and often in a very unstructured manner (strings of variably length generated all over the place). Furthermore, often script execution occurs in a context that that can be isolated and therefore unwound at that interface boundary--i.e. a C- or C++-based service executing a script to handle a transaction.
But in languages like C++ and Rust, especially for infrastructure software and libraries, it's a sin IMO. These are languages intended for use in situations where you can carefully structure your code, they make it trivial to minimize the number of allocations (because POD-oriented), and they permit one to group and isolate allocations in ways (e.g. RAII) that make it practical (if not trivial) to unwind and recover program state.[1]
But why even bother?
1) Because these languages are often used in situations where failure matters. A core piece of system software that fails on malloc is a core piece of system software that is unreliable, and programs that rely on you can behave in unpredictable and even insecure ways.
1.a) Go authors take the view that reliability comes from running multiple instances in the cloud. Yes, that's one way, but it's not the only way, not always an option, and in any event anybody with enough experience dealing with "glitches" in cloud services understands that at least in terms of QoS there's no substitute for well-written, reliable service instances.
1.b) Security. It's often trivial to create memory pressure on a box. OOM killers are notorious for killing random processes, and even without overcommit the order of allocations across all processes is non-deterministic. Therefore, not handling OOM gives attackers a way to selectively kill critical services on a box. Disruption of core services can tickle bugs across the system.
2) Overcommit is an evil all its own. It leads to the equivalent of buffer bloat. Overcommit makes it difficult if not impossible to respond to memory resource backpressure. This leads to reliance on magic numbers and hand-tweaking various sorts of internal limits of programs. We've come full circle to the 1980s where enterprise software no longer scales automatically (which for a brief period in the late 1990s early 2000s was a real goal, often achieved), but instead requires lots of knob turning to become minimally reliable. Nobody questions this anymore. (Ironically, Linux helped lead the way to knob-free enterprise operating systems by making kernel data structures like process tables and file descriptor tables fully dynamic rather than statically sized at compile or boot time, so the kernel automatically scaled from PCs to huge servers without requiring a sysadmin to tweak this stuff. Notably, Linux doesn't just crash if it can't allocate a new file descriptor entry, nor do most programs immediately crash when open fails. Even more ironically, overcommit on Linux was originally justified to support old software which preallocated huge buffers; but such software was written in an era where sysadmins were expected to tailor hardware and system resources to the software application. Overcommit has had perpetuated the original sin.)
Not all software needs to handle OOM. Not event all C, C++, or Rust components. But for infrastructure software OOM should be handled no different than file or network errors--subcomponents should be capable of maintaining consistent state and bubbling the error upward to let the larger application make the decision. And if you're not writing critical infrastructure software, why are you using these languages anyhow?[2] If a language or framework doesn't permit components to do this, then they're fundamentally flawed, at least in so far as they're used (directly or indirectly) for critical services. You wouldn't expect the Linux kernel to panic on OOM (although some poorly written parts will, causing no end up headaches). You wouldn't expect libc to panic on OOM. There's no categorical dividing line beyond which developers are excused from caring about such issues.
[1] Granted, Rust is a little more difficult as it's hostile to allocation-free linked-lists and trees, such as in BSD <sys/queue.h> and <sys/tree.h>. Hash tables require [potential] insertion-time allocation. Still, it's not insurmountable. Many forms of dynamic allocation that can't be rolled into a task-specific initialization phase, such as buffers and caches, are colocated with other operations, like file operations, which already must handle spurious runtime failures, and so the failure paths can be shared.
[2] Maybe for performance? Performance critical tasks are easily farmed out to libraries, libraries are particularly suited to handling OOM gracefully by unwinding state up to the interface boundary.
Linux, in most distributions, enables overcommit. That is a fact that anyone distributing software is going to have to deal with. Saying that you personally choose to disable it whenever possible doesn't make that fact go away.
> But for infrastructure software OOM should be handled no different than file or network errors--subcomponents should be capable of maintaining consistent state and bubbling the error upward to let the larger application make the decision.
OOM, to me, is more like a null pointer dereference or a division by zero. If it happens, it's because I as a programmer screwed up, either by leaking memory, having a data structure that needs to be disk-based instead of memory-based, or by failing to consider resource bounds.
The problem with trying to handle memory allocation is that a) it can happen just about anywhere, b) you have to handle it without allocating any more memory, and c) if there is a single place where you forget to handle allocation failure, your program is not robust to OOM errors. I rather expect that the fraction of programs that could not be crashed with a malicious allocator that returns allocation failure at the worst possible time is well south of 0.01%.
That's like saying a NULL pointer dereference or division by zero can happen anywhere. Only for poorly written code that doesn't think through and maintain its invariants. Languages like C++ and Rust make it easier to check your invariants, but plenty of C code does this, including kernels and complex libraries. And they don't do it by inserting assertions before every pointer deference or division operation.
As I said, the RAII-pattern is one way to keep your invariants such that most allocations only happen at resource initialization points, not at every point in a program where the object is manipulated.
> b) you have to handle it without allocating any more memory,
Unwinding state doesn't need more memory, not in languages like C, C++, or Rust. If it did then the kernel would panic when cleaning up after programs terminated when out of memory.
This argument is a common refrain from GUI programmers who wish to throw up a notification window. But that's distinct from recovering to a steady state.
In those particular contexts where recovery isn't possible or practical, then you can't recover. Again, scripting languages are an obvious example, though some, like Lua, handle OOM and let you recover at the C API boundary. (Notably, in Lua's case the Lua VM context remains valid and consistent, and in fact you can handle OOM gracefully purely within Lua, but only at points where the VM is designed to be a recovery point, such as at pcall or coroutine.resume.) But the existence of such contexts doesn't mean recovery is never possible or even rarely possible.
> c) if there is a single place where you forget to handle allocation failure, your program is not robust to OOM errors.
If you keep an RAII pattern then handling OOM is little different than deallocating objects, period. In that case, your statement is the equivalent of saying that because memory bugs exist, no program should bother deallocating memory at all.
Now, I've seen programs that can't handle deallocation; programs that were written to be invoked as one-shot command-line utilities and never concerned themselves with memory leaks. Trying to fix them after the fact so they can run from a service process is indeed usually a hopeless endeavor. Likewise, trying to fix a program that didn't concern itself with OOM is also a hopeless endeavor. But it doesn't follow that therefore when starting a project from scratch one shouldn't bother with OOM at all, no more than saying nobody should bother with deallocation.
The objections people have to OOM handling are self-fulfilling. When programmers don't consider or outright reject OOM handling then ofcourse their code will be littered with logic that implicitly or explicitly relies on the invariant of malloc never failing. So what? You can't infer anything from that other than that programs only work well, if at all, when the implicit or explicit assumptions they were written under continue to hold.
False.
1) You can disable overcommit, and there are many of us that do this as a matter of course on all our Linux servers.
2) malloc can fail because of process limits (e.g. setrlimit or cgroups).
I don't program in C++, but I do use RAII-like patterns in C. By that I mean that when I create and initialize objects, all the necessary internal resources--particularly those that rely on dynamic allocation--are also created and initialized in the same routine.
That means most places where memory allocation can fail are grouped closely together into a handful of initialization routines, and the places where allocation failure results in unwinding of a logical task are even fewer. (While C doesn't automatically release resources, following an RAII-like pattern means deallocations are just as structured and orderly as allocations.)
I can understand bailing on allocation failure in scripting languages--not only is there much more dynamic allocation taking place, but allocation happens piece-meal all over the program, and often in a very unstructured manner (strings of variably length generated all over the place). Furthermore, often script execution occurs in a context that that can be isolated and therefore unwound at that interface boundary--i.e. a C- or C++-based service executing a script to handle a transaction.
But in languages like C++ and Rust, especially for infrastructure software and libraries, it's a sin IMO. These are languages intended for use in situations where you can carefully structure your code, they make it trivial to minimize the number of allocations (because POD-oriented), and they permit one to group and isolate allocations in ways (e.g. RAII) that make it practical (if not trivial) to unwind and recover program state.[1]
But why even bother?
1) Because these languages are often used in situations where failure matters. A core piece of system software that fails on malloc is a core piece of system software that is unreliable, and programs that rely on you can behave in unpredictable and even insecure ways.
1.a) Go authors take the view that reliability comes from running multiple instances in the cloud. Yes, that's one way, but it's not the only way, not always an option, and in any event anybody with enough experience dealing with "glitches" in cloud services understands that at least in terms of QoS there's no substitute for well-written, reliable service instances.
1.b) Security. It's often trivial to create memory pressure on a box. OOM killers are notorious for killing random processes, and even without overcommit the order of allocations across all processes is non-deterministic. Therefore, not handling OOM gives attackers a way to selectively kill critical services on a box. Disruption of core services can tickle bugs across the system.
2) Overcommit is an evil all its own. It leads to the equivalent of buffer bloat. Overcommit makes it difficult if not impossible to respond to memory resource backpressure. This leads to reliance on magic numbers and hand-tweaking various sorts of internal limits of programs. We've come full circle to the 1980s where enterprise software no longer scales automatically (which for a brief period in the late 1990s early 2000s was a real goal, often achieved), but instead requires lots of knob turning to become minimally reliable. Nobody questions this anymore. (Ironically, Linux helped lead the way to knob-free enterprise operating systems by making kernel data structures like process tables and file descriptor tables fully dynamic rather than statically sized at compile or boot time, so the kernel automatically scaled from PCs to huge servers without requiring a sysadmin to tweak this stuff. Notably, Linux doesn't just crash if it can't allocate a new file descriptor entry, nor do most programs immediately crash when open fails. Even more ironically, overcommit on Linux was originally justified to support old software which preallocated huge buffers; but such software was written in an era where sysadmins were expected to tailor hardware and system resources to the software application. Overcommit has had perpetuated the original sin.)
Not all software needs to handle OOM. Not event all C, C++, or Rust components. But for infrastructure software OOM should be handled no different than file or network errors--subcomponents should be capable of maintaining consistent state and bubbling the error upward to let the larger application make the decision. And if you're not writing critical infrastructure software, why are you using these languages anyhow?[2] If a language or framework doesn't permit components to do this, then they're fundamentally flawed, at least in so far as they're used (directly or indirectly) for critical services. You wouldn't expect the Linux kernel to panic on OOM (although some poorly written parts will, causing no end up headaches). You wouldn't expect libc to panic on OOM. There's no categorical dividing line beyond which developers are excused from caring about such issues.
[1] Granted, Rust is a little more difficult as it's hostile to allocation-free linked-lists and trees, such as in BSD <sys/queue.h> and <sys/tree.h>. Hash tables require [potential] insertion-time allocation. Still, it's not insurmountable. Many forms of dynamic allocation that can't be rolled into a task-specific initialization phase, such as buffers and caches, are colocated with other operations, like file operations, which already must handle spurious runtime failures, and so the failure paths can be shared.
[2] Maybe for performance? Performance critical tasks are easily farmed out to libraries, libraries are particularly suited to handling OOM gracefully by unwinding state up to the interface boundary.