Make resilient Go servers using timeouts, deadlines and context cancellation

JyB · on Jan 26, 2020

Almost all articles I've seen explaining the context pkg are done with net/http examples.

That's fine but I feel like it might not be the best introduction for a novice as a lot of concepts are mixed together and they might miss the bigger picture. Context is not just for web servers. You don't have to know how net/http works.

You could simply demonstrate the usefulness of the context package with showing how to properly clean up your program on a sigterm. Or even gracefully stop a long running operation so you are not afraid to stop/start your program at the "wrong" time.

morelisp · on Jan 26, 2020

> gracefully stop a long running operation so you are not afraid to stop/start your program at the "wrong" time.

This is one of my major pain points with Go's contexts. Where I work we do have an "application wrapper" that gets cancelled on various signals, and it's very handy for some things. But one thing it's not good at is shutting things down safely!

Something we do pretty commonly is start multiple servers (e.g. a public and a private HTTP server, or an HTTP server and DB background task, etc.) When we get SIGINT, we want to cancel both's context (easy), then wait for both to stop before continuing with our exit process (hard). Yes, this is the canonical case for sync.WaitGroup, but those are hard to use correctly when you need to transition into and out of "acceptable" states rather than just count down (well, hard to use correctly period, to be honest - probably half the time I see a junior dev using them they increment within the goroutine). And hard to timeout waiting for the waitgroup when you want to continue despite unclean shutdown - WaitGroup.Done doesn't itself take a context.

This is further complicated by the fact that for servers you often don't really want to use the "application context" as the parent of the request contexts. Rather you want the server to shut down cleanly when that context is cancelled, processing all pending requests to completion but without immediately cancelling them. So the base request context is ideally something like "cancelling X seconds after the application context" which is not part of the standard context toolbox.

And of course different libraries are not really consistent in how they shut down - http.Server lets you close it with a timeout and so returns an error but also you need to check the error return of the Serve method (and you can't distinguish a graceful stop from a hard stop from try to restart a stopped server), grpc.Server offers only hard and graceful stops with no timeout and only the Serve method returns an error, and sarama.Client provides only a synchronous close that returns an error.

I've not used them but I'm told C#'s cancellation tokens akin to Go context but are closely integrated with their async task state machine, such that it's easy to hand out cancellation tokens and wait for the tasks waiting for those tokens to finish.

K0SM0S · on Jan 26, 2020

> “Almost all articles I've seen explaining the context pkg are done with net/http examples. That's fine but I feel like it might not be the best introduction for a novice as a lot of concepts are mixed together and they might miss the bigger picture. Context is not just for web servers. You don't have to know how net/http works.”

Indeed. But I think the seminal blog post¹ on the matter has somewhat skewed perception initially. The first sentence reads:

> In Go servers, each incoming request is handled in its own goroutine. Request handlers often start additional goroutines to access backends such as databases and RPC services.

That pretty much sets a mental context (pun unintended) within the server paradigm. People then just overlook the fact that any Go app can (and often probably should) make use of `context` — CLI anyone? Unless I'm mistaken, `context` is fundamentally about the underlying machine, not network (it just adds to confusion that the latter is represented as "context" within the former, i.e. files and folders and variables in a POSIX environment).

> You could simply demonstrate the usefulness of the context package with showing how to properly clean up your program on a sigterm.

Very nice one! I never saw a book titled "Linux Mechanics" but this would be akin to applying F⁻¹ to some object whose inherent motion comes from F. Like, the basics, "how do I stop this thing?" This is a cultural approach to computers that I think is fading along with the desktop culture, which itself got blurred by tons of layers of abstraction (think 1980-2020).

In that vein I think it's very interesting to experiment with Linux namespaces² (notably cgroups and network namespaces). Create elementary "container components" (i.e. network namespace, UTS..) with Go and see from there how context applies (or not). You'll be effectively clean-rooming³ Docker, in some abstract sense. Distribute it and you've got a skeleton for Kubernetes. It's a really logical conclusion/next step when diving deep into Linux in the current context — DevOps etc. You just see that culture emerging from the tech.

I think Go is incredibly well-positioned to learn these things as we speak: simple, efficient, easy to read, asks you to consider every step carefully (error handling), fluent with OS calls, etc.

[1]: https://blog.golang.org/context

[2]: http://man7.org/linux/man-pages/man7/namespaces.7.html

[3]: Ignoring common licensing or business concerns atrributed this approach, playing with "clean room" work is excellent technical training to master topics beyond intermediary level, well into expertise — that which no one can really directly impart on you, that which you must build for yourself, quite literally so at times. Can you build it, given Search and enough time?

https://en.wikipedia.org/wiki/Clean_room_design

regecks · on Jan 26, 2020

I tried running Go HTTP servers bare to the internet (after Cloudflare promoted doing so in a blog post), but went back to using a reverse proxy the next time.

The main benefit seems to be convenience. I can upgrade and graceful-restart nginx instead of having to rebuild and redeploy the Go server (involving a full app restart). Not having to worry about goroutine leaks because some jerk decided to send the request line @ 1 byte/sec is just an added bonus.

jrockway · on Jan 26, 2020

You have to worry about the jerk sending requests at 1 byte per second no matter which webserver you use. It's always a problem to let an unlimited number of people ask for an unlimited amount of resources; it's just that things like goroutines are heavier than a file descriptor or a few bytes of RAM, so you'll notice wasted goroutines more quickly than wasted fds or memory.

Typically, you need to consider the total amount of memory you want your web server to use, how much of that memory one request can use, and how long a request can use that memory. (File descriptors must also be considered.)

Envoy has a section in their documentation about this here: https://www.envoyproxy.io/docs/envoy/latest/configuration/be...

nginx similarly has a number of knobs to turn: https://www.nginx.com/blog/tuning-nginx/

I use Envoy as my web proxy and nginx to serve static content. My envoy configuration is complicated and my nginx configuration is simple, as a result. I imagine that if you are hosting a serious amount of traffic with Nginx as the edge proxy, more tuning is required. I've never tried, so I don't really know.

earthboundkid · on Jan 26, 2020

You have to worry about slow requests somewhere in your stack, but with a good network architecture, you can assume the problem is solved at the open internet interface and ignore it inside your trusted zone.

toast0 · on Jan 26, 2020

File descriptors are mostly a soft limit --- you can usually easily set the os and process limit higher than what your stack can process. The maximum setting for stock FreeBSD is the number of pages divided by four (so one fd per 16kB of ram on x86). Most systems will run out of ram much before FDs if the FD limit is all the way up.

If you have a reasonable amount of ram, and a reasonable way to manage the slow connections, chances are a Slowloris attack is going to use more resources on the attackers side and not be effective. Async i/o in C based servers works pretty well. FreeBSD accept filters can work if protocol appropriate; the kernel doesn't return the socket to accept until data matching a pattern has been sent, see accf_http; but that doesn't work if the client sends the handshake quickly and further data slowly. If you really need to use a stack that doesn't work for this, putting a proxy that captures whole requests at whatever speed and then sends them as fast requests works too.

zzzcpan · on Jan 26, 2020

> You have to worry about the jerk sending requests at 1 byte per second no matter which webserver you use.

Not necessarily. It's just free webservers don't bother dealing with it, but there are plenty of simple approaches. Like just dropping connections that are sending requests slower than some threshold or dropping the slowest connection when some total number of connections is reached. Or more complicated, which also works to protect from all kinds of attacks, dropping the highest malicious score or the lowest reputation score client when some resource usage threshold is reached.

None of these are easy to implement with synchronous multithreaded networking code though, like in Go. Realistically it's only viable with asynchronous single threaded programming models or an actor model.

jlokier · on Jan 26, 2020

> None of these are easy to implement with synchronous multithreaded networking code though, like in Go. Realistically it's only viable with asynchronous single threaded programming models or an actor model.

It's hard to see why synchronous multi-threaded code would find these things any more difficult than async or actor models.

All three models are equally able to access shared data structures to keep track of resource usage statistics, per-connection statistics, and timers.

OS kernels do this routinely, and are essentially multi-threaded on SMP architectures or with kernel pre-emption.

zzzcpan · on Jan 26, 2020

Basically the reason is you can't just kill a thread that shares memory with other threads. Go doesn't even have an ability to kill goroutines, so your only choices is manual context tracking and manual cancellation in every piece of code. But if you are in a an event loop, for example, you can just destroy any client at any point. Same with actors, if you are in an actor, you can just kill other actors.

jlokier · on Jan 28, 2020

Thanks, that's an interesting point of view.

Unfortunately, with event loops and async programming, including async-await models, cancellation is just as fiddly and needing to be explicitly handled by client event handlers/awaiters.

For example, think of JavaScript and its promises or their async-await equivalent.

There is no standard, generic way to cancel those operations in progress, because it's a tricky problem.

zzzcpan · on Jan 28, 2020

> cancellation is just as fiddly and needing to be explicitly handled by client event handlers/awaiters

That's not true. In event loops to do cancellation you simply remove event handlers for associated client from whatever event notification mechanism you are using and delete (free) client's data structured, including futures, promises or whatever you are using. Since references to all of them are necessary for event loops to be able to even call event handlers, no awareness of any of it on event handlers' side is required.

jlokier · on Jan 28, 2020

That's not true; it only applies to a subclass of simpler event scenarios.

For example, in an event loop system you may have some code that operates on two shared resources by obtaining a lock on the first, doing some work, then obtaining a lock on the second, then intending to do more work and then release both locks. All asynchronously non-blocking, using events (or awaits).

While waiting for the second lock, the client will have a registered an event handler to be called when the second lock is acquired.

("Lock" here doesn't have to mean a mutex. It can also mean other kinds of exclusive state or temporary ownership over a resource.)

If the client is then cancelled, it is essential to run a client-specific code path which cleans up whatever was performed after the first lock was obtained, otherwise the system will remain in an inconsistent state.

Simply removing all the client's event handlers (assuming you kept track of them all) and freeing unreferenced memory will result in an inconsistent state that breaks other clients.

This is the same basic problem as with cancelling threads. And just like with event/await systems, some thread systems do let you cancel threads, and it is safe in simple cases, but an unsafe pattern in more general cases like the above example. Which is why thread systems tend to discourage it.

zzzcpan · on Jan 28, 2020

Nope, event loops and asynchronous programming in general don't have a concept of taking a lock, because the code in any event handler already has exclusive access to everything. I.e. everything is effectively sequentially consistent.

There are some broken ideas out there that mix different concurrency models, in particular async programming with shared memory multithreading, not realizing they are bounding themselves to the lowest common denominator, but I was never talking about any of them.

jlokier · on Feb 4, 2020

We are clearly working with very different kinds of event loops and asynchronous programming then.

I think you use "in general" to mean "in a specific subset" here...

It is not true that every step in async programming is sequentially consistent, except in a particular subset of async programming styles.

The concept of taking an async mutex is not that unusual. Consider taking a lock on a file in a filesystem, in order to modify other files consistently as seen by other processes.

In your model where everything is fully consistent between events, assuming you don't freeze the event loop waiting for filesystem operations, you've ruled out this sort of consistent file updating entirely! That's a quite an extreme limitation.

In actual generality, where things like async I/O takes place, you must deal with consistency cleanup when destroying event-driven tasks.

For an example that I would think this fits in what you consider a reasonable model:

You open a connection to a database (requiring an event because it has a time delay), submit your read and writes transaction (more events because of time to read or to stream large writes), then commit and close (a third event). If you kill the task between steps 2 and 3 by simply deleting the pending callback, what happens?

What should happen when you kill this task is the transaction is aborted.

But in garbage collected environments, immediate RAII is not available and the transaction will linger, taking resources until it's collected. A lingering connection containing transaction data; this is often a problem with database connections.

In a less data-laden version, you simple opened, read, and closed a file. This time, it's a file handle that lingers until collected.

You can call the more general style "broken" if you like, but it doesn't make problems like this go away.

These problem are typically solved by having a cancellation-cleanup handler run when the task is killed, either inline in the task (its callback is called with an error meaning it has been cancelled), or registered separately.

They can also be solved by keeping track of all resources to clean up, including database and file handles, and anything else. That is just another kind of cleanup handler, but it's a nice model to work with; Erlang does this, as do unix processes. C++ does it via RAII.

In any case, all of them have to do something to handle the cancellation, in addition to just deleting the task's event handlers.

Twirrim · on Jan 26, 2020

One cloud service I worked for had no end of quirks exposed by someone using a 28.8k modem to upload data (if it'd been 36.6k modem it'd have been fine). It wasn't causing impact on other customers (we already had stuff to handle "slow PUTters"), they were the only one getting a bunch of 500s, but it did expose a series of unrealised assumptions in the service components.

ronsor · on Jan 26, 2020

Can you share more information? I can't imagine anyone using a 28.8k modem within the past 5 years.

Twirrim · on Jan 26, 2020

There isn't much to share. We made numerous attempts to reach out to the customer but never got a response, so we don't know exactly what they were doing.

We don't know for sure it was a 28.8K modem, it just appeared to be, given the speed they uploaded and the slight variances we saw in the speed (if it was a throttled upload, they tend to be pretty rigid in performance).

One of the main things it exposed was that certain libraries we used had buffers in them. We'd proxy the data from the customer to another back end service. By default the library would open the connection to the back end, and wait for the small buffer to fill before sending the data. The back end service would terminate a connection if the connection was open but idle for $x number of seconds. The user was on the threshold of that timeout. Probably half the time they PUT, they'd be slow enough to trigger that back-end timeout, resulting in them getting a 500. I believe eventually they put a small buffer on the ingestion path too before pushing along to the back end, but given those PUTs could get really large, we couldn't buffer the entire content before sending along.

tills13 · on Jan 26, 2020

CloudFlare itself offers request buffering, so you should be good on that front.

telendt · on Jan 26, 2020

There's an ugly bug in http.TimeoutHandler though - it obscures stack traces so that it's impossible to use them to locate panic in decorated handler: https://github.com/golang/go/issues/27375

diamondo25 · on Jan 26, 2020

Ok, good, contexts are now making sure you can handle upcoming timeouts decided by an upper layer (caller function). But how about the time.After function? It'll still be running in the background? So you can still have a memory or 'processing power' leak?

rakoo · on Jan 26, 2020

The function you write after time.After should use the same context, and check its Done channel before continuing execution

cpuguy83 · on Jan 26, 2020

Probably best not to use `time.After`, because it indeed starts a timer that you have no control over unless you are waiting for the full time.

telendt · on Jan 26, 2020

Good point. To those unaware, the time.After is equivalent to time.NewTimer(d).C, but "the underlying Timer is not recovered by the garbage collector until the timer fires" (quote from the doc).

That slowAPICall function should look like:

  func slowAPICall(ctx context.Context) string {
     d := rand.Intn(5)
     t := time.NewTimer(time.Duration(d) * time.Second)
     defer t.Stop()
     ...
  }

diamondo25 · on Jan 26, 2020

So you have to propagate the Context, hm. IIRC, go test will panic the test case when it times out. Not exactly sure tho. It would be nice if there was a kind of 'abort' feature to clean up subroutines spun off this thread

JyB · on Jan 26, 2020

In most cases, at the inner-most level you end up calling some sort of external library (sql, api-client, ...) that will handle the Done() channel itself.

All you have to do is make sure is to pass to the library the context that carries your timeout or cancellation signal. The "rule" that everyone seems to follow is to always take as first argument a context.Context if your library handles cancellation.

thwarted · on Jan 26, 2020

The best there is with the context package is to make sure to call the cancel function given to you by contexts that have cancelation. Usually you do this via defer. The cancel function is a no-op if the context is finished otherwise. All this ends up doing though is making sure that things that clean themselves up know to clean themselves up eventually.

morelisp · on Jan 26, 2020

> Usually you do this via defer.

I agree this is usually done by defer, but you probably should not do it that way unless your code is very simple. Consider a function body which I've seen variants of many times:

    ctx, cancel = context.WithTimeout(pctx, timeout)
    defer cancel()
    if resp, err := do(ctx, req); err == nil {
        process(resp)
    } 
    return err

Safe yes, but optimal? process doesn't use the context, and may take longer than the timeout. The context will continue running, with some associated resource cost (at the very least, the context's goroutine and timer). A minimal change is:

    ctx, cancel = context.WithTimeout(pctx, timeout)
    resp, err := do(ctx, req)
    cancel()
    if err == nil {
        process(resp)
    } 
    return err

Which disposes of those resources much earlier.

(Depending on your Go compiler version there is also a potential cost associated simply with using defer; this is independent of that.)

mjpuser · on Jan 26, 2020

Nice intro to timeouts and context. Next step would be dealing with state changes that happen in a cancelled request.

awinter-py · on Jan 26, 2020

absolutely agree with the risk of slow clients saturating your connection limit

when doing DB work with these, I'm a little shakier -- once I start a multistep DB write, I probably want it to finish. Yes I can use a transaction to roll back the whole thing, but I think there are cases where rollback is wrong and I'd rather keep the write.

so while cancellation is cool, it's also a little fraught and hard to test.

shhsshs · on Jan 26, 2020

In those rare cases you can choose not to propagate the same context through those operations. Only check for cancellation once the operations have all finished.

namanaggarwal · on Jan 26, 2020

Shouldn't read header timeout be less than read timeout?

gigatexal · on Jan 26, 2020

I scrolled through the whole article and didn’t get a blaring ad like that.

silisili · on Jan 26, 2020

Yeah, I quit reading when I get the unblockable full page ad asking me about paying 50 dollars for a Go course. Good spam bot.

Operyl · on Jan 26, 2020

I did not get such an ad on my end? And this browser has no blockers of any kind. I scrolled through it in its entirety.

silisili · on Jan 26, 2020

So you know I'm not lying... https://ibb.co/v43sMXv

It's not actually unskippable, but on mobile you have to zoom out to click the X at top right.

Operyl · on Jan 26, 2020

I wouldn’t call that an ad, but even then I never had it trigger. Must’ve hit a JavaScript error on my end or something.

teichmann · on Jan 26, 2020

I’m not sure if I consider that an ad, but I got the same after scrolling way down. I’m on Mobile Safari.

atombender · on Jan 26, 2020

Appears after a while. Not based on scrolling, as far as I can tell.

yiyus · on Jan 26, 2020

That makes sense. Using a timeout is much more appropriate for this specific article.