Graceful Shutdown

zimbatm · on Feb 17, 2019

Any sufficiently-complex program is bound to re-implement an inferior version of Erlang OTP :-p

Quickly you find out that recording the list of process / task / ... , including the relationship between each is a useful thing to have. With that first-class registry, it's possible to implement all sorts of shutdown and restart logics.

rumcajz · on Feb 17, 2019

Disclaimer: I have almost no experience with Erlang/OTP.

But the question here is not whether you can implement different kinds of shutdown, it's rather which kind covers the common graceful shutdown use cases.

If Erlang/OTP has such algorithm, I'd love to hear about it.

_asummers · on Feb 17, 2019

Erlang has an explicit concept of processes and supervisors,and communication by message passing. When a worker process dies, the supervisor is made aware of it, and upon receipt of the message, can decide what to do. In some cases that means restarting the worker to some known good state (turn it off and on), in others just restarting a generic worker (a web server, for example), and others, the supervisor cannot recover on its own so it crashes itself and lets its own supervisor do the same exercise. It also has a notion of hot code reloading, which can go and change process state to satisfy the new software requirements of the new app version (think upgrading a telephone pole remotely).

rumcajz · on Feb 17, 2019

I meant it the other way round. The question was not about what happens when the child dies. It was rather how does the parent let the child know that it should terminate. Or that it should eventually terminate, but that it should attempt a clean shut down.

_asummers · on Feb 17, 2019

The supervisor can also terminate its children. The communication goes both ways. But the children may also be supervisors with their own children. which all have their own termination behavior. The VM is designed around reliability and addresses this sort of behavior with language constructs.

rumcajz · on Feb 17, 2019

Ok, I guess I'll have to give it a closer look. But are you positive they have a good story about interactions between hard and soft shutdowns? E.g. when grandparent requests a hard shutdown but parent a soft shutdown. Or vice versa. Or if timeouts of cancellations on different level of the hierarchy don't match?

_asummers · on Feb 17, 2019

It was designed for dealing with telephone switches, which could go offline for any reason, and need to be worked around. I recommend the entirety of LYSE, but [0] this chapter should give you some footing. To your specific question, upon receipt of the DOWN message, the processes can respond according to what makes sense to the domain.

[0] https://learnyousomeerlang.com/supervisors

rumcajz · on Feb 17, 2019

Thanks!

keypusher · on Feb 17, 2019

Is this not a solved problem? You have the load balancer stop sending new connections to the host. You send sigterm to the process. You wait until timeout, then you sigkill the process. This is how it's done in Kubernetes, ECS, and other systems I've worked with. Trying to engineer the entire lifecycle from within the webserver has a wide array of problems that the author is bending over backwards to end up only partially solving.

rumcajz · on Feb 17, 2019

Here's a scenario: You webserver has an open long-lived WS connection. You stop load balancer. The connections still lives. Then you send SIGTERM to the webserver. The connections still lives. The main coroutine of the webserver catches the SIGTERM and wants to do gracefull shutdown. But the connection in question in running in a separate coroutine. So it has to send a graceful shutdown signal to that coroutine. Etc. You end up with having to deal with the problems described in the article.

ninkendo · on Feb 17, 2019

If you have a websocket architecture, the client needs to tolerate a severed connection and transparently reconnect, without user-visible impact. Anything else is going to lead to disaster... the server end can't stay up forever.

The right thing to do in this case is for the server to also disconnect any websockets when it gets the SIGTERM, in addition to stopping the accept loop and the other things it would normally do. Clients simply have to reconnect (and have enough tolerance in the socket's in-band protocol to recover state, roll back transactions, etc.) It's a scenario that's bound to happen to a certain percentage of your clients anyway, due to all sorts of external factors, and must be handled regardless.

rumcajz · on Feb 17, 2019

CLOSE frame exists in the WS protocol for a reason. It's there so that both sides know that all the messages were delivered before the connection was closed.

adontz · on Feb 17, 2019

Why coroutine is long lived? I image this as event loop machinery just stops processing events and all sockets are closed [by OS].

Twirrim · on Feb 17, 2019

One note of caution: One way to handle this is by having your application stop responding to health checks from the load-balancer. The LB marks the host as dead and stops sending new connections to it. However there are some load-balancers that see a failed health-check and immediately flush all connections to a "failed" host. Be sure you know what your LB's behaviour is, and how that fits in to your graceful shutdown model.

adontz · on Feb 17, 2019

Actually you do not need even load balancer node or software. SO_REUSEPORT transforms Linux kernel into load balancer. You just run new server alongside old one and ask old one to stop serving requests.

maxxxxx · on Feb 17, 2019

As a long time desktop developer I always find it interesting how little thought is put into shutdown server side. The only way seems to take it off the load balancer and the hope that everything has finished after some time.

I have spent numerous hours on figuring out how to shut down multi threaded desktop apps within reasonable time. It’s not easy.

skybrian · on Feb 17, 2019

This problem seems similar to processing a request that has a deadline, after which the request is discarded and any resources should be freed.

In Go, you can keep track of deadlines using a Context [1]. Subrequests also need to know the deadline (or perhaps a shorter deadline), so they get passed a derived context. The same mechanism also supports cancellation.

[1] https://golang.org/pkg/context/

latchkey · on Feb 18, 2019

I deployed a binary golang agent [1] across about 2000 raspberry pi class machines (for litecoin mining) using overseer [2].

It allowed me to implement zero downtime upgrades, which was pretty cool. Internally within the agent, I used the context pkg to implement with a 'Manager' which provided cron like functionality where each thing could be shutdown gracefully [3]. Used the same mechanism to shut the internal webserver down gracefully as well [4].

[1] https://github.com/jpillora/overseer

[2] https://github.com/blockassets/bam_agent

[3] https://github.com/blockassets/bam_agent/blob/f7edce599b770b...

[4] https://github.com/blockassets/bam_agent/blob/f7edce599b770b...

rumcajz · on Feb 17, 2019

It was proposed to introduce structured concurrency to Go, but it's not easy, not least because of the context and how it would play along. The issue was closed with "we need a larger discussion" comment.

https://github.com/golang/go/issues/29011

If you are interested in that you may want to participate.

jkcxn · on Feb 17, 2019

Kotlin is doing something like this with coroutine cancellation. Not quite the same because the cancelation requires cooperation but I reckon you could add a timeout and get the composability that the article talks about

https://kotlinlang.org/docs/reference/coroutines/cancellatio...

protomyth · on Feb 17, 2019

Its interesting that very few new languages actually take into account the maintenance of programs coded in the language. I still think that something like tuple spaces (Linda) that can be "frozen" would actually help in breaking apart programs so shutdowns and, more likely, partial shutdowns are possible.

dnautics · on Feb 17, 2019

Erlang-OTP

protomyth · on Feb 17, 2019

Yeah, that's one language, and I'm really not sure they went far enough.

dnautics · on Feb 17, 2019

OTP is still evolving. It's more like a standard library than a first class language or vm construct. What do you want? Perhaps some additional libraries could help