> Crashing and losing all existing connections often merely leads to an avalanch...

wahern · on Oct 13, 2022

In Erlang a "process" is more akin to a thread or fiber in most other languages, and "crashing" more like an exception that is caught just across a logical task boundary. Importantly, in Erlang a process "crashing" doesn't kill all other connections and tasks within the kernel-level Erlang process.

And that's why Erlang is so resilient, precisely because the semantics of the language make it easier to isolate tasks and subtasks, minimizing blast radius and the risk of reaching an inconsistent or unrecoverable state. I often use Lua (as a high-level glue language) to accomplish something similar: I can run a task on a coroutine, and if a constraint or assertion throws an error (exceptions in Lua are idiomatically used sparingly, similar to how they're used in Go), it's much easier to recover. This is true even for malloc failures, which can be caught at the same boundaries (pcall, coroutine.resume, etc) as any other error.[1] It's also common to use multiple separate Lua VM states in the same kernel process, communicating using sockets, data-copying channels, etc, achieving isolation behaviors even closer to what Erlang provides, along with the potential performance costs.

[1] While more tricky in a language like Lua, as long as your steady-state--i.e. event loop, etc--is free of dynamic allocations, then malloc failure is trivially recoverable. The Lua authors are careful to document which interfaces and operations might allocate, and to keep a core set of operations allocation free. Of course, you still need to design your own components likewise, and ideally such that allocations for connection request, subtasks, etc can be front-loaded (RAII style), or if not then isolated behind convenient recovery points. Erlang makes much of this discipline perfunctory.

lmm · on Oct 14, 2022

Isolation is indeed a cornerstone of what makes this style work, but "let it crash" is another one, and just as important IMO. "In this language crashing would take down other connections" does not make it safe to continue without crashing and doesn't remove the need to crash and recover, or something equivalent to that - rather it's an argument for finding a way to separate connection tracking from more complicated logic that might get into unexpected states.

throwawaymaths · on Oct 14, 2022

Whoa. In the Erlang equivalent here, the client would crash (GenServer.call will timeout if the called server is overloaded), not the service.