Yes, properly handling timeouts and cancellation is the next frontier for programmers to conquer. I was just thinking about this the other day because some program I was using locked up, and of course it worked fine when restarted, and I began to wonder why this happens so frequently. A lot of obscure things can cause hangs, but if every blocking operation has a timeout, the number goes way down.
I think it's unfortunate that even new languages still treat timeouts and cancellation as an afterthought. For example, every Go program I've ever written says:
select {
case <-ctx.Done():
return nil, ctx.Err()
case thing := <-thingICareAboutCh:
return thing, nil
}
Instead of:
return <-thingICareAboutCh, nil
The language designers thought about needing to give up on blocking operations, and then said "meh, let the programmer decide on a case-by-case basis". And that's the state of the art.
(Getting off topic, this is why I avoid mutexes and other concurrency operations that aren't channels; you can't cancel your wait on them. Not being able to cancel something means that if there are any bugs in your program, you'll find out when the program leaks thousands of goroutines that are stuck waiting for something that will never happen and runs out of memory. Even if the thing they're waiting for does happen, the browser that's going to tell someone about it has long been closed, and so you'll just die with a write error when you finally generate a response. If you have a timeout and a cancellation on every blocking task, your program gives up when the user gives up, and will run unattended for a lot longer.)
> this is why I avoid mutexes and other concurrency operations that aren't channels; you can't cancel your wait on them
Mutexes are designed to solve different problems -- albeit with overlap. Channels are great if you have a one to one relationship or even a many to one but they're not so good with many to many relationships. That is where you need several routines to have read access and several routines to have write. I'm not saying they can't be used in that way but if you're not careful it's even easier to have goroutines "stuck waiting for something that will never happen" with channels than it is with mutexes. So I'd be very nervous about shoehorning channels into a design they're not well suited for.
Also you can cancel your wait on mutex if there is another goroutine which unlocks that mutex. In fact if you browse the Go source you'd see that channels use mutexes themselves.
I like how Rust handles cancellations. Async is polling based. If you stop polling and drop the Future object, it's cancelled. All of it. Like in the library from the article, every await point is a point where an async task can end.
I think that this is a good example of where dynamic scope is helpful.
We’re used to lexical scope: it’s easy to reason about, and it is a really good default. But sometimes it makes sense for one function to apply settings for all the functions it calls, without interfering with other functions, scopes, threads or processes (like setting a global would).
It’d be nice to be able to say ‘this function should timeout within 10 ms’ and then any function called will just automatically timeout.
Go’s contexts integrate timeouts and cancellation, and permit one to add any value, should one wish to, but you have to be disciplined and add a context argument to every single function. It’d be better, I think, to support it natively in the language. Lisp does this: any variable declared with DEFPARAMETER or DEFVAR is dynamic, and you can locally declare something dynamic too.
One can fake dynamic scoping with thread-local storage and stacks or linked lists, if one needs it, but it can get ugly.
Dynamic scoping doesn’t get the attention or respect I think it deserves. It’s arguably the wrong thing by default, but when it’s useful, it’s really useful.
I think that what you're describing is how Trio's cancellation scopes work [1]. Trio is an alternative async library for Python [2] (as opposed to asyncio). It's pretty neat.
I'm glad I'm not the only one who noticed a context is basically a form of dynamic scoping, but with a tendency towards weaker type guarantees. Dynamic scope is incredibly useful for cross-cutting concerns. However, I implemented this but found several problems in practice I've never reconciled.
The two big issues where:
1) "User managed" namespaces
2) Multi-process programs, or perhaps multi scope programs
The issue with "user-managed" namespaces is basically one of lexical scoping. In traditional dynamic scope, all variables in the dynamic scope are global. As such people tend to accidentally overwrite other people's variables. I've thought through some ways around this, but none have been elegant.
The more important, and foundational one takes some explaining. Consider some code:
type Work struct {
Ctx context.Context
Result chan<- int
}
func NewWork(ctx context.Context, c chan<- int) *Work {
return &Work{
Ctx: ctx,
Result: c,
}
}
type Worker struct {
c <-chan *Work
}
func NewWorker(c <-chan *Work) {
return &Worker{
c: c,
}
}
func (w *Worker) Work(ctx context.Context) {
for work := range w.c {
// We now have two contexts,
// the context of the work call, and
// our "local" context"
// Both are usefull!
}
}
func main() {
workers := make([]*Worker, 0, 10)
workChan := make(chan *Work, len(workers))
for i, _ := range workers {
workerContext := context.Default()
workers[i] = NewWorker(workChan)
go w.Work(workerContext)
}
http.HandleFunc("/bar", func(w http.ResponseWriter, r *http.Request) {
httpContext := context.Default()
work := NewWork(ctx, result)
fmt.Fprintf(w, "Got Result, %q", <-result)
})
log.Fatal(http.ListenAndServe(":8080", nil))
}
As we can see in the work function above, context is more than just your function chain. Now arguably dynamic scope doesn't need to handle this, but it is very commonly the case in CSP/etc. based languages our most relevant scope does not come from dynamic scope.
Front-end code is a different beast, but this immediately reminded me of React's Context API: is you squint, it sort of looks like dynamic scope. It's also nestable, and since you can wrap the context provider in a custom component, you can have it consume the parent context (if any) and compose them however you like. Plus since accessing that context is explicit, you avoid the issue of dynamic scope being harder to reason about in program execution.
I wonder if there's a reasonable way to apply this approach to more general-purpose code.
This is something that is really nice in gevent. Under the hood it's doing something similar to what the article says - every time you make a blocking call (in gevent, this means yielding to the event loop until your event occurs), you might have a gevent.Timeout raised.
Since gevent is generally used by monkey-patching all standard IO, most code doesn't even need to be aware of this feature - it just treats a timeout as an unhandled exception.
From the user's perspective, it can be used simply as a context manager that cancels the timeout on exit from the block:
with gevent.Timeout(10):
requests.get(...)
By default this will cause the Timeout to be raised, which you can then catch and handle. As a shorthand, you can also give it an option to suppress the exception, effectively jumping to the end of the with block upon timeout:
response = None
with gevent.Timeout(10, False):
response = requests.get(...)
if response is None:
# handle timeout
I often wonder why SO_RCVTIMEO/SO_SNDTIMEO avaiable for sockets but not for file descriptors. Setting timeout once then using classic read/write on blocking FDs is easy, meaning error code handling is appropriate.
Disk file descriptors are so bizarre, if you think about it — why they still don't support any non-blocking primitive to this date? You can sorta simulate it with spawning a helper process that does "while (true) { write(stdout, read(fd)); }" or "while (true) { write(fd, read(stdin)); }" and giving it a pipe (you can epoll a pipe), but that's horrible.
All single-threaded epoll-or-equivalent-based network servers hit this problem sooner or later and become multi-threaded network servers (also, there is async DNS resolution, but that's a whole separate barrel of worms).
I think it is because network is inherently different from disk. A disk might be slow, but data will eventually arrive (bad disk or network mounts being the exception of course).
> bad disk or network mounts being the exception of course
That's kind of the point. Programs shouldn't assume that any particular file is local. Any file may be hosted on a network mount, and that means network-style asynchronous interfaces should be used to access the data.
I definitely assumed this would contain clever tips about how to handle it when coworkers don't respond to your emails, don't accomplish things they promised, or don't follow-through. Maybe some automation methods to handle those situations.
Very systematic and accessible description of the problem and various alternatives including their origins. I learned a lot reading the post, thank you!
I'm not convinced the task-based approach doesn't work. Perf-wise there's no reason that tasks have to have the overhead of threads.
Syntactically, I think it is worth distinguishing between things that can time out and things that can't, because you usually need to do some sort of cleanup on timeout.
In fact as far as I can tell, the cancel scopes provided by Trio with the async await syntax are exactly isomorphic to Scala's tasks from the cats library (where they are called IO).
Also I'm not sure I understand the author's preference for thinking of timeouts as level-triggered rather than edge-triggered. While it's an interesting way of thinking about the problem, and would be the natural way a timeout is implemented in an e.g. FRP system (a lot of flavors of FRP are essentially entirely level-triggered systems), it doesn't seem like the way you'd implement things in a non-reactive system. What's wrong with just killing the entire tree of operations (as is usual when you propagate say, an exception) on a timeout, or from a top-down manner when you put a timeout on a task?
Timeouts are fundamentally tied with concurrency (they are a concurrent operation: you're racing the clock against your code and seeing who wins) and to me the tricky thing about timeouts is exactly the same trickiness that you face with concurrency, namely shielding critical sections. How you decide to pass timeout arguments seems like a secondary concern. Just like with normal concurrency, you need to make sure that certain critical sections are atomic with respect to timing out, either by disallowing timeouts during that critical section (you therefore need to make the critical section as small as possible, ideally a single atomic swap operation) or implementing a reasonable form of rollback. (Of course you can always take the poll-based approach where you poll for timeout status, but again this is just a specialization of a general concurrency strategy)
FP libraries have pretty much solved this IMO. You create a value that describes what you want to happen and that description can include cancellation if some condition is met (e.g. it takes too long). There are limitations imposed by the runtime on what you can actually cancel (e.g. I don't believe all OS calls can be interrupted) but beyond that it works as specified.
Here's one example of such a library, though without a bit of FP background it probably doesn't make a great deal of sense:
I'm not sure what the situation is on Haskell. In Scala there are at least 3 such libraries.
I feel your objection is making the great the enemy of the good. These libraries are vastly better than the default way of working in imperative languages (I know from experience with both!) and we should try to make the innovations more widely known rather than nitpicking.
Boost.ASIO (C++) does not expose the SO_RCVTIMEO socket options and instead makes you use a deadline_timer explicity. It's very annoying but this article kind of explains why it is that way.
I have spent way too much of my time as a developer over the years hacking on software to remove ill-conceived timeouts where some developer said--sometimes not even in one place but for some insane reason at every single level of the entire codebase--"this operation couldn't possibly take longer than 10 seconds"... and then it does, because my video is longer than they expected or I have more files in a single directory than they expected or my network is slower than they expected (whether because I have more packet loss or more competition or more indirection) or my filesystem had more errors to fix during fsck than they expected or I had activated more plugins than they expected or I had installed more fonts than they expected or I had more email that matches my search query than they expected or more people tried to follow me than then expected (for months back when Instagram was new I seriously couldn't open the Instagram app because it usually took more than the magic 10 seconds--an arbitrary timeout from Apple--to load my pending follower request list for my private account; the information would get increasingly cached every load so if I ran the app over and over again eventually it would work) or my DNS configuration was more broken than they expected or I had a more difficult-to-use keyboard than they expected or I had more layers of security on my credit card than they expected or any number things that they didn't expect (can you appreciate how increasingly specific these examples started becoming, as I started having horrifying flashbacks of timeouts I had to remove because some idiot developer decided they could predict how long something could take and then aborted the operation, which seems like the worst possible way of handling that situation? :/). Providing the user a way to cancel something is great, but programming environments should make timeouts maximally difficult to implement, preferably so complex that no one ever implements them at all (and yes, I appreciate that this is a pipe dream, as a powerful abstraction tends to make timeouts sadly so easy people strew them around liberally... but certainly no timeout arguments should be provided on any APIs lest someone arbitrarily guess "10 seconds"): if the user, all the way up at the top of the stack, wants to give up, they can press a cancel button. And to be clear: I don't think timeouts are something mostly just amateur programmers tend to get wrong and which can be used effectively by experts (as is the case with goto statements or random access memory or multiple inheritance)... I have never seen a timeout--a true "timeout" mind you, as opposed to an idempotent retry (where the first operation is allowed to still be in flight and the second will, without restating, merge with the first attempt as opposed to causing a stampede; these make sense when you have lossy networks, for example)--in a piece of software that was a feature instead of a bug, where the software would not have been trivially improved by nothing more than simply deleting the timeout, and I would almost go so far as to say they are theoretically unsound.
It's a double edged sword. A lot of system failures happen because every single thread winds up accidentally blocking on something that is never going to return for one reason or another. In such cases a timeout is able to unblock the threads.
Maybe the developers collected lots of data and said p99 latency is X, so let's set the timeout X+alpha. If something takes drastically longer than most requests it's probably something going wrong. Maybe your latency was the 99.99999 and the timed you out.
Or yeah the devs just guessed and guessed badly and the timeouts cause more problems than they solve.
Consider load balancers that redirect requests to any other server if one of them is taking too much to respond (no matter whether it's slow or gone altogether). Eliminating timeouts there and pushing the problem waaaay back onto the client side, with the user constantly hovering over "cancel" or "retry" button, would actually lead to worse user experience. Imagine pressing "refresh" in you browser every now and then on random internet pages just because they occasionally take soooo much more time to load! (Also, unlike in typical client-side software, timeouts on the server side are often configurable, so that software can better fit in a variety of environments and/or requirements.)
Sometimes there is no interactive user as such who is capable of manually cancelling an operation. Back and data processing and real-time systems both really need to have reasonable timeouts set on every blocking operation.
I do agree that timeouts are often used as a crutch for bad code, but the use case discussed in the article (preventing a potentially malicious bad client from hogging resources) is legitimate.
I think it's unfortunate that even new languages still treat timeouts and cancellation as an afterthought. For example, every Go program I've ever written says:
Instead of: The language designers thought about needing to give up on blocking operations, and then said "meh, let the programmer decide on a case-by-case basis". And that's the state of the art.(Getting off topic, this is why I avoid mutexes and other concurrency operations that aren't channels; you can't cancel your wait on them. Not being able to cancel something means that if there are any bugs in your program, you'll find out when the program leaks thousands of goroutines that are stuck waiting for something that will never happen and runs out of memory. Even if the thing they're waiting for does happen, the browser that's going to tell someone about it has long been closed, and so you'll just die with a write error when you finally generate a response. If you have a timeout and a cancellation on every blocking task, your program gives up when the user gives up, and will run unattended for a lot longer.)