Hacker News new | past | comments | ask | show | jobs | submit login

> Crashing and losing all existing connections often merely leads to an avalanche of subsequent requests, accelerating overload of your systems. I've seen this countless times.

Sure, but again, that's a scenario that you need to handle anyway.

> Being crash tolerant is one thing. But "crash early, crash often" is absolutely horrible advice.

I've found it to be good advice. It's a big part of why Erlang systems have been so reliable for decades.




In Erlang a "process" is more akin to a thread or fiber in most other languages, and "crashing" more like an exception that is caught just across a logical task boundary. Importantly, in Erlang a process "crashing" doesn't kill all other connections and tasks within the kernel-level Erlang process.

And that's why Erlang is so resilient, precisely because the semantics of the language make it easier to isolate tasks and subtasks, minimizing blast radius and the risk of reaching an inconsistent or unrecoverable state. I often use Lua (as a high-level glue language) to accomplish something similar: I can run a task on a coroutine, and if a constraint or assertion throws an error (exceptions in Lua are idiomatically used sparingly, similar to how they're used in Go), it's much easier to recover. This is true even for malloc failures, which can be caught at the same boundaries (pcall, coroutine.resume, etc) as any other error.[1] It's also common to use multiple separate Lua VM states in the same kernel process, communicating using sockets, data-copying channels, etc, achieving isolation behaviors even closer to what Erlang provides, along with the potential performance costs.

[1] While more tricky in a language like Lua, as long as your steady-state--i.e. event loop, etc--is free of dynamic allocations, then malloc failure is trivially recoverable. The Lua authors are careful to document which interfaces and operations might allocate, and to keep a core set of operations allocation free. Of course, you still need to design your own components likewise, and ideally such that allocations for connection request, subtasks, etc can be front-loaded (RAII style), or if not then isolated behind convenient recovery points. Erlang makes much of this discipline perfunctory.


Isolation is indeed a cornerstone of what makes this style work, but "let it crash" is another one, and just as important IMO. "In this language crashing would take down other connections" does not make it safe to continue without crashing and doesn't remove the need to crash and recover, or something equivalent to that - rather it's an argument for finding a way to separate connection tracking from more complicated logic that might get into unexpected states.


Whoa. In the Erlang equivalent here, the client would crash (GenServer.call will timeout if the called server is overloaded), not the service.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: