Hacker News new | past | comments | ask | show | jobs | submit login
Ruby's Timeout is dangerous and Thread.raise is terrifying (2015) (jvns.ca)
160 points by thunderbong 31 days ago | hide | past | favorite | 126 comments



I'm fascinated by this general problem - I see it as a close relative of the "sandbox" problem, where I want to be able to safely run user-defined untrusted operations.

For timeouts, I want to be able to build web application features which are guaranteed to terminate early after eg 2s in a way that resets all resources related to the operation.

The best way I've found of doing this so far involves proceses: run the time-bound operation in an entirely separate process and terminate it early if necessary.

I do that with "rg" here: https://github.com/simonw/datasette-ripgrep/blob/a40161f6273...


Even processes are not safe if they modify state outside of themselves (files on the filesystem or whatever else).

It seems to me that the root issue is that encapsulation becomes our enemy here. We want to be able to call some code and not care about the details of how it is implemented. But reliable cancellation requires some sort of design that lets us know exactly what side-effects code has, control them, and force them to be done in a way that's transactional, while still letting us cancel the operation at the right time. We don't want a black box.

I suspect the two general solutions are: 1) elbow grease: manually inspecting the code that you're calling to understand whether it's safe to cancel and designing a bespoke sandbox: a thread in some cases, a process in others, re-architecting the code you're calling in others... 2) something about algebraic effects or capability systems that I can't speak to because I only have vague ideas how they work and haven't applied them in anger.


> Even processes are not safe if they modify state outside of themselves (files on the filesystem or whatever else).

All processes should be prepared for a sudden crash without corrupting state. Things get OOM killed, machines get unplugged, networks go offline. We have things like journals and write-ahead-logs and at-least-one messaging precisely because of these kinds of problems. If your process can't handle being terminated at any time then you have problems regardless of if it's wrapped in timeout code.

Doing this for a block of code within a process is a lot harder, because a) there's generally more surface area of externally-observable state, and b) code does not normally need to be prepared to handle it.


Your comment boils down to 'all code should be perfect'. Which is a lovely request, but doesn't really help.

In particular, I'd challenge you to find one large program that handles OOM situations 100% correctly, especially since most code runs atop an OS configured to over-allocate memory. But even if it wasn't, I doubt there's any sizeable codebase that handles every memory allocation failure correctly and gracefully.


I don't think GP's statement was that all code everywhere must be perfect.

Just that code which is designed to be run in a separate process with the express intent of allowing termination at timeout should also be designed to not change state outside of itself. If somehow it really needs to (e.g. a 1TB working file), either that process or its invoker needs to have cleanup code that assumes processes have been terminated.

Doesn't mean that ALL code needs to be perfect, or even that this code will be, just that a thoughtful design will at least create the requirements and tests that make this model work.


Not perfect, but “crash-only” or at least as robust as such. Probably involving transactions of some sort. It is indeed a tall order, but if you’re sending kill signals to threads from the outside, that’s the reality you’re in. Find another abort mechanism if that’s too big an ask (and in most cases it justifiably is, that's why Java doesn't even allow it anymore)


You can never be perfect enough where you won’t drop crumbs on your keyboard but if you make a rule to never eat near your computer, you will never drop crumbs on your keyboard.

Writing high quality code is not about being perfect — it’s about changing how you write code.

Writing code that can recover from unexpected errors at any point is simply by being mindful about ordering your instructions and persisting data precisely.


> Your comment boils down to 'all code should be perfect'.

This is clearly false from a cursory skim of the parent comment, which says

> All processes should be prepared for a sudden crash without corrupting state.

...which is leagues away from "all code should be perfect" - for instance, an obvious example is programs that can crash at any time without state corruption, but contain logic bugs that result in invalid data being (correctly) written to disk, independent of crashing or being killed.


The convo has been a bit light on examples - I think a canonical example of how to achieve this can be found in ACID databases and high uptime mainframes. Definitely doable, but if performance and large scale data is also a concern, doing it from scratch (eg. starting with file handles) is probably a 5-10+ year process to write your own stable db.


This is generally pretty slow and complicated.


I've become quite partial to Go's implementation. It uses a context.Context that may or may not have one of a few ways of communicating that whatever has this context should stop processing (I.e. a timeout, a deadline, or I believe one or two others).

That context then has a .Done() method that returns a channel; when a value is written to that channel, whatever functions are using that context are expected to stop themselves at the soonest point that makes sense to them.

Typically this is done inside a for loop in long-running processes. I.e. for something that copies, it looks like

    for {
        select {
            case <-ctx.Done():
                 // we should stop and return a timeout error or something
            default:
                 // copy some number of bytes, or check if a network call is done
        }
    }
It does require all of the involved functions to implement support for this, though I think most things do at this point. I wouldn't call a library high quality unless it supports context.Context for long-running operations.

It gives library authors the ability to determine at what points their code can be interrupted, run cleanup code as part of the timeout, etc.

> The best way I've found of doing this so far involves proceses: run the time-bound operation in an entirely separate process and terminate it early if necessary.

This doesn't handle remote resources cleanly, does it? E.g. if I were to lock a Postgres table for a query, and that query times out, will that correctly unlock the table and close the client? Or e.g. lock files? I'm sure some of that can be handled very carefully by managing it in the main process, but that seems error prone.


> It does require all of the involved functions to implement support for this, though I think most things do at this point.

Except for reading and writing data from a file using 'os.File', or reading and writing data from a network socket using a 'net.Conn'.

Support for contexts is quite lacking in that the 'io.Writer' and 'io.Reader' interface don't have it, and those are the most important places to have it.

Context also has the problem of waiting for cancellation to complete.

Once you call "cancel()", it async tells a lot of goroutines to teardown, but it's painfully hard to know when they've noticed the cancellation and halted work, which in practice often leads to very subtle data-races.

> [Terminating processes] doesn't handle remote resources cleanly, does it? E.g. if I were to lock a Postgres table for a query, and that query times out, will that correctly unlock the table and close the client? Or e.g. lock files?

Both postgres and file locks will correctly handle cleanup if the process dies (postgres notices the connection is dead and ends the transaction, the kernel releases filesystem locks a process is holding when it terminates).

This is necessary because a process may exit basically at any time for any number of reasons, such as the kernel OOM-killing it.


> Except for reading and writing data from a file using 'os.File', or reading and writing data from a network socket using a 'net.Conn'.

> Support for contexts is quite lacking in that the 'io.Writer' and 'io.Reader' interface don't have it, and those are the most important places to have it.

In a context world, you would use io.Writer/Reader or net.Conn to write small bits of data and check whether the context is cancelled in between 1KB writes (or whatever size).

There is an edge case where it hangs (e.g. on writing to a crappy NFS share) but to the best of my knowledge, that stems from the kernel not being able to interrupt already-queued IO and some knock-on effects related to PIDs owning FDs. E.g. `ls` can't be interrupted when trying to list an NFS dir that's unstable.

Would love to be told I'm wrong there if I am.

> Once you call "cancel()", it async tells a lot of goroutines to teardown, but it's painfully hard to know when they've noticed the cancellation and halted work, which in practice often leads to very subtle data-races.

I typically just defer a function in the goroutine that either writes to an "IsDead" channel or sets a mutex-protected boolean (depending on whether I need a single notification that it's dead, or a persistent way to check whether it's dead). It's not as simple as I'd like, but it's also not terribly hard.

> Both postgres and file locks will correctly handle cleanup if the process dies (postgres notices the connection is dead and ends the transaction, the kernel releases filesystem locks a process is holding when it terminates).

I was under the impression that it takes time for Postgres to notice the connection is dead; am I incorrect there? I thought that if a process terminates unexpectedly, Postgres would wait for its own timeout before terminating the client and freeing any resources used by it. I know it won't leak memory for forever, but having a table locked for 30 extra seconds could be a big problem in some situations (i.e. a monolithic DB that practically the whole company uses).


> In a context world, you would use io.Writer/Reader or net.Conn to write small bits of data and check whether the context is cancelled in between 1KB writes (or whatever size).

So don't use 'io.ReadAll' or 'io.Copy' since they don't take a context thus don't internally do what you're suggesting. I guess the stdlib authors don't know how to use context either.

Anyway, `reader.Read()`, even with just 1KB, can still take arbitrarily long. There's plenty of cases where you wait minutes or hours for data on a socket, and waiting that long to respect a context cancellation is of course unacceptable.

> Postgres .. connection timeout

Killing a process closes all its file descriptors, including sockets, and closing the tcp socket should cause the kernel to send a FIN to the server. Postgres should react to the client end of the socket closing pretty quickly.

This does rely on you using the linux kernel tcp stack, not a userspace tcp stack (in which case all bets are off), but in practice that's pretty much always the case.


> In a context world, you would use io.Writer/Reader or net.Conn to write small bits of data and check whether the context is cancelled in between 1KB writes (or whatever size).

That can still block pretty much indefinitely. Imagine you're a client reading from a server, but the server isn't in any hurry to send anything, and keepalives are keeping the TCP connection open, and no network blips occur for months, so your goroutine is blocked on that read for months.

The much simpler and more robust thing is to propagate context cancellation to socket close. The close will abort any blocked reads or writes.

e.g.

    go func() {
      <-ctx.Done()
      _ = conn.Close()
    }()
You'll still observe and return an error in the read/write call, and close is idempotent, so this doesn't take anything away from your existing logic and really just acts as a way to propagate cancellation.

I don't know how well this works for other types of closeable reader/writer implementations. It may not even be thread-safe for some of them. But this worked great when I tried it for sockets.

> I typically just defer a function in the goroutine that either writes to an "IsDead" channel or sets a mutex-protected boolean

I try to just use `errgroup` whenever possible, even if no error can be returned. It's just the most fool-proof way I've found to make sure you return only when all nested goroutines are returned, and if you're consistent about it then this applies recursively too. It's a way to fake "structured concurrency" with quite readable code and very few downsides.


Sockets and pipes generally have SetReadDeadline() and SetWriteDeadline(). With io.Reader and io.Writer in general you have to resort to a separate goroutine and a channel, otherwise they would have to conform to more restricted interfaces, say ReadDeadliner/WriteDeadliner, which is not always possible.


At least two correctness risks remain with Go's approach:

goroutines observe this cancellation asynchronously. You can cancel an operation from your point of view, and begin another one (a retry of the first, or another operation altogether), but the original one is still running, creating side effects that get you into unintended states. If one can be running, potentially any number can be. You have to make sure to actually join on all past operations completing before beginning any new ones, and not all libraries give you a way to synchronously join on asynchronous operations. If you write your own, it's very possible, it just takes a lot of care.

When you select { } multiple non-default arms like this, and more than one of them is "ready", which one gets selected is random. This avoids starvation and is the right way to implement select { }, but most code that checks for cancellation incorrectly pretends this is not the case and that it will observe cancellation at the earliest possible time. It actually has an exponential probability series of observing cancellation later and later, compounding with the above issue. If the work done between select is long (e.g. CPU or IO work) this compounds even further. The correct solution is to select for cancellation again on just one non-default arm, but that is not "idiomatic" so nobody does it.

All of this is manageable with due care. Some libraries make it impossible because they kindly encapsulate not just what you don't need to know but what you actually do need to know if you want correct deterministic behavior. In my experience, very few deveopers, even of popular libraries, actually understand the semantics Go promises and how to build correct mechanisms out of them.


The context done channels are clearly the way when dealing with all native Go code.

Allthough to the grandparent's point, whne you're dealing with executables or libraries outside of your control, the only true way I know of to get a "handle" on them is to create a process, with its pid becoming your handle.

In situations like image processing, conversion, video transcoding, document conversion, etc. you're often dealing with non-Go-native libraries (although this problem transcends language), and there's no way to time-bound processes. That is to say that you often need to consider the Halting Problem and putting time bounds and preemption around execution. So what I've had good success with is adding a process manager around those external processes, and when a timeout or deadline is missed, kill the pid. You can also give users controls to kill processes.

Obviously there are considerations with resource cleanup and all sorts of negative consequences to this, depending on your use case, but it does provide options for time bounding and preempting things that are otherwise non-preemptable.


Ahh, I hadn't considered operating across languages. That does make it awkward if you can't inject some Go (or other) controls in the middle by having Go manage the loop and only calling incremental processing in the other library.

That is awkward. My first thought is "just don't use the library" but that's obviously a non-starter for a lot of things, and my second thought was "transpile it" which sounds worse.

I suppose the signals do allow the binary/library to do its own cleanup if it's well-behaved, so it's really a binary/library quality issue at the end of the day as is something Go/Python/whatever native. There isn't a massive semantic difference between ctx.Done() and a SIGHUP handler; a SIGHUP handler can also defer killing the process until a sane point after cleanup.


Exactly!


All processes can crash at any time due to out-of-memory, bugs, hardware failures, etc. so this should not introduce additional inter-process failure modes. It may reveal existing failure modes, of course!


The general issue is isolation, for sure: you need to be able to cleanup the operation regardless of what state it is in when you abort it, and that means that anything you share with it needs to be designed for this. Threads in most OSs simply don't provide enough isolation to do this, two threads share enough resources that you can't effectively do this (in part because many other parts of the system assume that that threads are not an isolation barrier), wheras processes and the means of sharing things between processes are specifically designed so that this is possible.


I recently had to impose a timeout on a dbus operation using python and the only way I could get it working reliably was with a subprocess. Everything I tried using threads was crazily unreliable.


> the only way I could get it working reliably was with a subprocess

Totally agree.

I've worked in the past on a multi-camera computer vision system in which I used video4linux to capture images from USB cameras which would sometimes misbehave and block on an ioctl() forever. Not even SIGALRM helped.

The only mechanism that could be properly interrupt this situation was a Process, via SIGKILL, which caused OS to clean up all the resources afterwards. Eventually, we connected the cameras to a USB-controlled relay that mechanically replugged the cameras after a certain number of timeouts (ioctl freezes) were exceeded one after another.


> multi-camera computer vision system

That sounds interesting too.


You can sandbox time limits if you firewall state. If nothing in the code being timed out can affect its caller other than raising a "thing has timed out" error, then timing out is safe. Process boundaries provide this firewall on most systems, but they don't have to be the only boundary. I once read about an academic language that had this feature - I don't remember what it was called. It had checked and unchecked exceptions like Java. Checked exceptions are thrown, declared and caught like regular exception control flow. Unchecked exceptions can be raised anywhere, and can only be caught in a way that creates a state firewall. I don't remember how that was enforced.


In such a system each component still needs to be interruptible. If you're waiting on a blocking operation like disk or network I/O, the caller might give up but the request will still hang around until completion at which point it gets thrown out.

In the case of a blocked process, you may be able to force kill it at a system level but then you risk uncleaned up state (leaked connections/resources)

For instance, this is the default behavior with Postgres (queries complete even if the connection is closed)


In cooperative settings, say when writing your own web server, .Net Core does this quite well with CancellationToken. Basically it's just a convenient synchronized bool that everyone passes to their callees and occasionally checks to see if they should abort what they are doing. Most async APIs have overloads that take an CancellationToken, so even those operations are cancelled as soon as you cancel the token. They are really useful to impose time limits and for making sure you stop processing a request after it was aborted by the user. And because there isn't much magic it isn't too hard to make sure you reset your resources.

But that only works if you trust the code you are executing. If you don't you pretty much have to either use the primitives provided by your OS or run your own interpreter (LUA and WASM are popular for a reason)


In go there's contexts that do something similar

.NET and go seem to fairly good around request cancellation facilities


I've had a similar problem before.

I wanted to add a "!calc" trigger to an IRC bot written in Python. Rather than implement the Shunting Yard Algorithm [0] to make sure PEMDAS was handled correctly, I wanted to use `eval` (With some extra steps with the AST library to make a whitelist of AST nodes to prevent code injection), but with Python supporting essentially infinite integer precision, someone writing "!calc 9 * 9 * 9 * 9 * 9 * 9 * [..]" would eat up my entire CPU.

I solved it the same way. I execute it in a separate process and terminate it if it takes more than a couple seconds.

[0] https://en.wikipedia.org/wiki/Shunting_yard_algorithm


Interesting, there's a ruby feature request[0] for a safer thread api (this has been a known issue for a long time), and I just saw that it got assigned a couple of months ago. Maybe this'll get addressed in the next ruby version.

[0] https://bugs.ruby-lang.org/issues/17849


I hit a pretty brutal (and fun, if you are into this sort of thing) problem with Timeout 11 years ago: https://stackoverflow.com/questions/17237743/timeout-within-...


Reminds me of early in my career as I was struggling with turning errors in my code into useful messages to the user. I thought exceptions were so cool and that I could just pop them up the stack and voila `print(err.msg)`.

It took me awhile to realize that good error reporting was part of the UX as much as any other feature I wanted to create for users, and that it deserved a seat at the table of the API interfaces.

I think handling timeouts properly is exactly the same case; like TFA mentioned, there is no such thing as a safe/general way to externally terminate a thread of execution. If you want timeouts to be part of your UX, you need to build it just like any other feature.


>general way to externally terminate a thread of execution

There's really no safe way to cover all cases unless you create a really scoped down interface that are explicit in what they do.

One use case might allow abrupt interruption with no cleanup (like in the case the OS is powering down) and another interrupt it might be imperative to cleanup (a long running program that can leak resources)


Some notes about this:

* This issue plays out differently for compiled languages vs interpreted languages, but does apply to both. In particular, in many interpreted languages, the unfixable (for user code, at least) issue is "what happens if I get an exception after constructing the object but before assigning it to a local/member variable"? For compiled languages a similar issue may be "what if it's still in a register", though they generally handle this.

* Even for languages without dangerous thread APIs, it usually applies to MemoryError, and possibly others (e.g. signals, at least "interrupt" and "quit") as well (e.g. Python, despite having increasingly broken signal handling in general, still causes this for the main thread)

* True asynchronous exceptions aren't actually necessary; it suffices to manually check for exceptions on every non-finite backward jump (that is, every non-unrollable loop iteration except the first), every (generally, uninterruptible) syscall (or call to native code), and every memory allocation. For some particular language design choices, you might also want to check every memory access, though this is unlikely to be the same.

* This is much less of an issue in a language with destructors; try-finally, Python-style-with, try-with-resources, defer, etc. are all fatally flawed. It's still possible to write bad code in a good language if you try hard enough, but at least it is possible to write good code as well.


One of the things I've come to appreciate about Reactive Streams is the ability to catch a cancel event and handle it appropriately in your stream. A cancel event can be a timeout or a manual disposal of your async process, but whatever the case you are given the opportunity to deal with it.

    fun doLongRunning(): Single<Result> {
        return Single.create { emitter ->
            val longProcess = /* whatever */
            emitter.onSuccess(longProcess_result)
            emitter.setCancellable {
                cleanup(longProcess)
            }
        }
    }

    doLongRunning()
    .timeout(5_secs)
    .subscribe({ result ->

    }, { error ->
    
    })


What language is this? It looks like some sort of Java/JVM.


Probably RxJava https://github.com/ReactiveX/RxJava (with Kotlin?)


That's kotlin


Bear in mind that this article is old. Since then, ruby support a [handle_interrupt mechanism not far from the method described as working in java](https://rubyapi.org/3.3/o/thread#method-c-handle_interrupt)


The same kind of issue prevents Celery (a python task queue) "soft timeout" feature from working reliably.

It throws an exception _anywhere_ in the worker process to signal the timeout, but in practice the exception is often eaten by too-generic except blocks in library code and never reaches user code.

An interesting solution is structured concurrency because it introduces "raise-safe" points (the "await"s), but the ecosystem is not there yet.


>but in practice the exception is often eaten by too-generic except blocks in library code and never reaches user code

Interesting, C# has a special case for this:

>ThreadAbortException is a special exception that can be caught, but it will automatically be raised again at the end of the catch block.


I like Python but it's definitely a pain when it comes to request cancellation

I saw a StackOverflow post about some code that only worked on Linux and not Windows and it turns out sleeps on Windows are implemented using an uninterruptible API (I think there are other APIs on Windows but the developers of Python picked the one they did since it was simple/low overhead)

* Maybe not a perfect description but it was something about a non interruptible Windows API

Not sure if they changed it but iirc Python was using a less precise timer on Windows that couldn't sleep less than 14-15ms as well


> C++: std::threads are not interruptible.

This is kinda half-true, since C++ inherits aspects of the C environment, in this case on POSIX systems pthreads async cancellation [https://pubs.opengroup.org/onlinepubs/7908799/xsh/pthread_ca..., https://www.man7.org/linux/man-pages/man3/pthread_cancel.3.h...]

You can control this through pthread_setcancelstate(), and if it is enabled it will only trigger on specific documented places in C library / system API calls. It is vaguely saner than this generic "throw exception on another thread" concept, but still pretty insane IMHO. Haven't used it myself, not sure if there are circumstances/use cases where I'd enable it.

(To be fair, a lot of other languages are built on top of the C library too and inherit this same behavior, so this isn't quite C++ specific.)


C++ also has stop_token, so although they're not natively interruptable a mechanism for implementing it exists in the standard.


C has this problem too: see signal-safety(7).

> * If a signal handler interrupts the execution of an unsafe function, and the handler terminates via a call to longjmp(3) or siglongjmp(3) and the program subsequently calls an unsafe function, then the behavior of the program is undefined.

Also note that pthread_cancel() provides a way to block asynchronous cancellations via pthread_setcancelstate().


Am I weird that I've been programming in C for decades and have never used longjmp? For the longest time I thought it was some legacy function that was only used on ancient segmented memory systems, but it instead turns out to be a complete program flow destroyer that I can't see a good use for.

I have seen some libraries that do error handling via the user passing a callback for a longjmp, but even that seems ill advised. Fortunately it is always optional.


LambdaMOO which dates from the early 90’s uses setjmp/longjmp to implement structured exceptions in C, which it throws on task timeouts (though not for anything else afaik). Even has try/throw/catch macros.


Longjmp is used by Postgres for transaction aborts. With C, there's not really a better option available.


That's not from within signal handlers, though. (i.e. it relates to this specific longjmp discussion but not the root post re. exceptions on other threads.)


It unfortunately is used from within signal handlers, albeit only in specific cases (SIGFPE). There used to several more, but we luckily largely cleaned that up over the last few years.


Meh. Well. Good to hear on the cleanup. Didn't know it used to be different :/

Re. SIGFPE, to be fair, it feels a bit like the "asynchronous vs. synchronous abort¹" thing on CPUs; synchronous aborts are reasonably doable while on asynchronous aborts you're pretty much left with torching things down far and wide.

(SIGFPE should hopefully be synchronous; it's in fact closely connected to sync/async CPU aborts...)

[¹ frequently also called exceptions, depending on the CPU architecture, but this post already uses "exception" for the language level concept]


> Meh. Well. Good to hear on the cleanup. Didn't know it used to be different :/

If you want to be scared: Until not too long ago postgres' supervisor process would start some types of subprocesses from within a signal handler... Not entirely surprisingly, that found bugs in various debugging tools (IIRC at least valgrind, rr, one of the sanitizer libs).

> Re. SIGFPE, to be fair, it feels a bit like the "asynchronous vs. synchronous abort¹" thing on CPUs; synchronous aborts are reasonably doable while on asynchronous aborts you're pretty much left with torching things down far and wide.

Agreed, I think it's quite reasonable to use signals + longjmp() for the FP error case. In fact, I think we should do so more widely - we loose a fair bit of performance due to all kinds of floating point error checking that we could set up to instead signal.


To qualify this a bit, the reference to longjmp() is kind of a red herring; in general the list of things you can do in a signal handler is very limited (as documented by signal-safety(7) that you reference.)

longjmp() is just one prominent item on that list of things you can't (or shouldn't do.) None of the items on that list allow you to cleanly terminate the thread (you can send another signal to yourself or another thread, but that's only very conditionally helpful.)

pthread_cancel() does the one and only sane thing - implement a flag mechanism that is checked in a bunch of well-specified places, and gated with an enable. Whether these well-specified places work for a particular application is definitely a tough question, but either way anything sane will be some type of "set flag and interrupt long-running (I/O) operations" combination.


This post got me curious about similar scenarios in Elixir, and despite working with Elixir every day, I'm a bit surprised by one of the results I found:

  # Recursive function that never terminates:
  iex> f = fn i, f -> if rem(i, 100000) == 0 do IO.inspect(i) end; f.(i+1, f) end
  # Start the function in a task with a 1 ms timeout
  iex> Task.async(fn -> f.(0, f) end) |> Task.await(1)
My expectation here is that the task would output 0, then get killed when it hits the timeout. And I do get a timeout "exit" message logged with the child pid. But ALSO, the numbers keep printing as though the child task is still running! It appears to be specific to the configuration of the iex process but I'm not sure what it is - any Elixir/Erlang folks who can explain exactly what is happening here?


The source code for Task is very readable but also kind of subtle, and makes for a good study. I would say definitely give it a shot to trace the flow from Task.async[0] to Task.await[1] to Task.Supervised.start_link[2] to Task.Supervised.reply[3]. There is some subtle interplay with regard to waiting for messages/timeouts and process links.

[0] - https://github.com/elixir-lang/elixir/blob/v1.16.3/lib/elixi... [1] - https://github.com/elixir-lang/elixir/blob/v1.16.3/lib/elixi... [2] - https://github.com/elixir-lang/elixir/blob/v1.16.3/lib/elixi... [3] - https://github.com/elixir-lang/elixir/blob/v1.16.3/lib/elixi...


Task.await tries to exit the calling process when the timeout hits, but IEx traps the exit in that process, so it doesn't terminate and thus the linked task process doesn't either, I think? If I do all of this wrapped in another task, rather than directly in IEx, then I observe the innermost process get terminated by the process link after the intervening one doesn't trap the exit.

Relevant from https://hexdocs.pm/elixir/1.4.5/Task.html, which you've probably already seen:

> If the timeout is exceeded, await will exit; however, the task will continue to run. When the calling process exits, its exit signal will terminate the task if it is not trapping exits.


Bletch, I had the wrong version of the documentation bookmarked—here's the revised relevant sentences from https://hexdocs.pm/elixir/1.16.2/Task.html#await/2 (my system has 1.16.2):

> If the timeout is exceeded, then the caller process will exit. If the task process is linked to the caller process which is the case when a task is started with async, then the task process will also exit.

Reasoning is the same though; self() preceding/following it in the IEx session still shows the same evaluator process alive.


Good call - I think you're right about IEx trapping the exit. The confusing part is that it still logs out this message:

  ** (exit) exited in: Task.await(%Task{mfa: {:erlang, :apply, 2}, owner: #PID<0.954.0>, pid: #PID<0.964.0>, ref: #Reference<0.1455049351.208994307.4788>}, 1)
    ** (EXIT) time out
    (elixir 1.14.3) lib/task.ex:830: Task.await/2


Yeah, I agree that the way that comes out is really awkward. The first line there is reporting the context from in the Task module (because task.ex includes it explicitly in the exit call), and then the second is reporting a strerror-like translation of the :timeout reason, but the lines aren't clearly linked together that way and look more like chained events. All of the %Task{} stuff of course is just the inspection of the argument, but if your eye jumps to the “pid” part it can look like it's reporting that that's the exiting process even though it's not. And then the part where the first “exited” is written in past tense as though the exit happened, when in fact it's describing what the trap just caused to not actually fully happen, is probably the most confusing of all.


I never learned Elixir, but my first guess is IO.inspect is sending a message that is printed by a different process. Then the prints after exit are just the IO process working through its mailbox.

Alternatively, the await might killing the waiting process, not the process being waited on.


Good thoughts, but the printing continues indefinitely, and the documentation for Task.await explicitly says the child process will be killed: "If the timeout is exceeded, then the caller process will exit. If the task process is linked to the caller process which is the case when a task is started with async, then the task process will also exit". Processes can be configured with the behavior you describe, but it's not the case with Task.await.


As mentioned in the article, this isn't limited to ruby. It just shows that the ruby timeout can be dangerous, which a great reminder to those who use it. It would be great if the danger was mentioned in ruby's Timeout class documentation.

The same type of danger can be encountered with the linux `kill` command, depending on which signal you send, losing power, operating system terminate, etc. If you have state in flux, hard killing a process can have unexpected outcomes.

I'd also like to say that just because something is dangerous, doesn't mean it should not be used. Just use it carefully. Think of it as dynamite. Dynamite is very useful in some cases, and always needs to be handled with care.


I recently had to debug an issue due to the Java equivalent; the analysis is basically the same. Raising an exception anywhere including code that is not expecting it tends to cause subtle and unrepeatable bugs.


But you can't do that in Java. Did you use the Thread.stop() method they mentioned in the blog post (removed in Java 8)??


This is a minor inaccuracy in Julia's post, Thread.stop() is deprecated, but it has not been disabled or removed: https://docs.oracle.com/en%2Fjava%2Fjavase%2F17%2Fdocs%2Fapi...().

InterruptedException is better, but requires something closer to cooperative multi-tasking. If the thread is in a long-running loop that's not doing IO or it swallows the InterruptedException, you won't actually be able to kill the thread.

At work, someone had implemented an interface that let you kill a thread prior to my joining the company, so I found out that the warnings are very real when someone did that and it left the application unable to fetch DB connections.


ThreadGroup.stop() was removed in 23, so maybe in the future it'll actually be removed.


InterruptedException is better but still handled incorrectly/badly in user code all the time. Sort of like interrupting processes: one process dying won’t corrupt the state of another, but the other process can continue on with bad assumptions.


The underlying problem is lack of isolation, and this is an area where the Erlang process model shines: processes are fully isolated and you can send a signal externally to terminate any process you want.


Rust has a similar problem, where any future can be dropped at any time in async code, effectively cancelling the future. This means all async code has to be ready to be unexpectedly cancelled at any await point, and implement any defensive cleanup code, etc

Some more discussion of the problems this creates in Rust: https://without.boats/blog/asynchronous-clean-up/


It's definitely an issue, but it's also an improvement to the situation in other languages as the await points were cancellation is possible are visible in the code.

One solution would be linear types that can't be dropped, but interactions with generics and panics make it hard.


If a future is canceled it will run its destructors, so cleanup will usually happen correctly even if the developer didn’t think about it. Connections and files will be closed, memory released, locks unlocked. There are exceptions to that of course, bacause not running something to completion may break business logic - but no language can protect from those kinds of errors. And cancellation can happen only in await points which makes it much easier to analyze than being interrupted at any place like in Java.


This is not a bug, it's a feature. You have the ability to handle errors at defined points. As reality is such that errors can happen at any of these points, being able to gracefully handle them is as good as it gets. If you want to compose error handling along the type-of-error axis or the await-point-axis, the language already provides tools for you to do so. Again, not being able to blatantly ignore these errors is a feature, not a bug.

In NodeJS/Browser environments, you have the exact same behavior where promises you are awaiting can get rejected.


Maybe there is a nuance that I am missing, but having in mind that an exception can be thrown "in any of your code, regardless of whether it could have possibly raised an exception before" when writing code seems sane to me. It seems like the issue being raised is that people are expecting code snippets to be perfectly transactional, which also seems wrong.


Yes. This is why you should always use context managers and finally blocks in languages with exceptions. Always assume that anything can fail at any time and accept that not everything can nor should be recovered from.

On the other end of the spectrum you have go, where in fact every line does need to have their error conditions explicitly checked. How do they handle this situation?

What is dangerous with finally-blocks is if they start consuming a lot of time, such that the timeout interruption itself makes the timeout miss its deadline. Usually you need 2 timeouts, one for the main operation and a second for the cleanup.


So what's the alternative to Timeout? Or does it mean that if I think I Timeout is the solution then my approach is wrong?


The other answer is correct I believe, but I think the most common recommendation I've seen for rails boils down to "Use Timeout, and then kill and restart the process" (ref: [0], [1]), which obviously doesn't feel great for performance when a timeout in one thread can cause all other threads to need to be killed.

[0] https://www.schneems.com/2017/02/21/the-oldest-bug-in-ruby-w...

[1] https://github.com/zombocom/rack-timeout (c.f. term_on_timeout)


In .Net, use CancellationToken ( https://learn.microsoft.com/en-us/dotnet/api/system.threadin... , https://learn.microsoft.com/en-us/dotnet/api/system.threadin... ).

The two big benefits are (1) if you pass the cancellation token to the task that might be canceled, hopefully it will be obvious that the task might be canceled; and (2) the task that might be canceled has a way to find out it’s being canceled and, if that ever happens, it can clean up.


And in Rubys case?


(Almost) Every OS kernel has an API to handle timeouts. poll, select, kqueue, epoll etc. just use that.


The word "just" is doing a lot of work there!


You can replace every blocking call with a timed one. In Python you can just call settimeout on a socket. In C you might want to write your own recv_timeout function, but it's not that difficult.


I have no idea how I would use settimeout with a socket to solve any of my problems that involve causing a piece of code to terminate early if it takes longer than a specific amount of time.


If you are using a piece of software which has a part that can take a long time to execute and doesn't allow it to set timeouts for long running operations then i would send a PR to solve that issue.

When said software is closed source and doesn't have that feature then that company sells a problem and not a solution.


What is the code doing? Most of the time it's waiting for a socket. If it's not doing that, do the equivalent for whatever it is waiting for. If it's CPU-bound, add interrupt checks.


Places I've wanted to implement this in the past include:

- Run a subprocess such as "rg" and terminate early if necessary

- Run a SQLite SQL a query that errors if it takes more than a second

- Same but for other DBs - MySQL, PostgreSQL, Elastic, Mongo etc

- Execute a fragment of JavaScript in something like QuickJS with a time limit

- Brute force some kind of algorithm with a time limit - using a library I did not write myself


Even Windows has WaitForMultipleObjects() https://learn.microsoft.com/en-us/windows/win32/api/synchapi...


For web workers, use a single thread per process model, then kill the worker if there's a timeout.


Depends heavily how trusted/untrusted the process is. If you trust the process won't actively try to circumvent you, just fork and call alarm() as the first thing is often enough. You can cancel it, so it's not a precaution against malice.

If you can't trust the process, fork, wait in the parent, and kill the child if it hasn't terminated.

I don't think alarm() is exposed in Ruby, unless it's been added recently, but it's easy enough to add. However since fork + kill is more reliable anyway, it's often a better choice.


> Depends heavily how trusted/untrusted the process is.

Timeout is for threads not processes, and not everything that would be done on a thread can be conveniently rewritten to be a separate process.


If you want reliable timeouts, that is your tradeoff. There are too many awful failure scenarios to resolve with threads.


Timers / cooperative timeouts. AFAIK they’re the only way to ensure that effectful code will never end with a corrupted state.


You can manually check an "is timed out" object and ensure that you're configuring timeouts that are smaller on any blocking code (basically I/O)

Say you're looping over 1m items, maybe every 1k you check to see if you're out of time and return or raise


You want something patterned like select. Structure the application around threads blocking on queues and avoiding exceptions entirely makes things far more sane.


Discussed at the time:

Why Ruby’s Timeout is dangerous and Thread.raise is terrifying - https://news.ycombinator.com/item?id=10638629 - Nov 2015 (20 comments)


Apologies dang, unrelated.

But I hope you can take a quick look at user dopp0 that through a glance seems to have been mistakenly shadowbanned for quite some time (not affiliated in any way, just saw a comment now).


Thanks for watching out for a fellow user!


The approach taken to this problem by Cats Effect (a Scala concurrency library) is interesting. It allows cancellation of a fiber from outside, but let's blocks of code be marked as uncancelable. If a fiber is cancelled while executing one of these blocks, it will complete the block before cancelling. This protects against cancellation in between two operations which leaves the program in a broken state.

The drawback of this approach is that the onus is on anybody writing code which might be cancelled to correctly mark the uncancelable regions.


I don't think you can do better than this.

There's a fundamental tension between 'stop now' and 'close your file handles before stopping'.


From how you describe it, it sounds like it's taken the same approach as Haskell's async exceptions + mask/uninterruptibleMask, which to me seems like one of the best solutions around, so props to them


Haskell seems to be the only language where throwing exceptions at another thread isn't a box of footguns. Anyone know why that is?


I don't know why you think that it isn't. You can do so, but AFAIK almost no libraries are written with the expectation that their computations might be interrupted at any time by another thread. The best you can hope for is that you "only" interrupt some pure computation and not throw your exception into the middle of some delicately sequenced IO side effects.


> AFAIK almost no libraries are written with the expectation that their computations might be interrupted at any time by another thread.

Haskell libraries are written this way all the time. Exception-safe libraries with async-aware handling are pretty commonplace nowadays (partially thanks to safe-exceptions and unsafeio providing a nice API to deal with them.)


Absolutely false. At least not in my experience. Only novice Haskell programmers write code that can't handle computations being interrupted by another thread. A published and maintained Haskell library is unlikely to have these issues.


Erlang / Elixir is almost above that:

- the shared-nothing structuring means it’s harder to corrupt external - that processes can die at any moment is part of the culture - the supervision tree means the most likely case for killing an other process is a parent killing a misbehaving child, which they can then handle as if the child had faulted


Another key to Erlang's success here is the ports system. Basically "external" resources have a defined cleanup mechanism even if the process that created them dies. An open socket under the hood has its own thread it can use to clean up. With effort you can do this with your own things too through the linking mechanism, having some process watch another process using your resources and then have an execution thread that can clean up if the user spontaneously dies.

Some of it too is just Erlang being too, err, for lack of a better word "weak" to have some of the problematic resources. Since Erlang has no "locks", you simply can't take a lock and then kill the thread responsible for unlocking it. You can construct such locks at a higher level, but Erlang's design tends to encourage other designs. For instance, if I were designing a higher-level lock, I'd design it to link to the process taking the lock and release it if it dies. This can still theoretically get you into trouble but especially if this is documented as the semantics of the lock, you have to work a lot harder at it.

If you really worked at it you could manage to screw up your Erlang system with thread killing through your own lock implementation and such, but it still will be something you can recover from the REPL, you won't confuse BEAM itself. In an imperative language you'd be ranging from lucky to be able to reach in somehow and fix it to there simply being no way.


While I agree in the aggregate, I do want to mention that Haskell code written by novices is absolutely not correct in the face of asynchronous exceptions. Novice people write `hGetContents h <* hClose h` which obviously leaks file descriptors in the face of both synchronous and asynchronous exceptions.

There are also plenty of footguns when you attempt to catch exceptions. A piece of code might intend to catch all synchronous exceptions (raised in the current thread) but accidentally catches asynchronous exceptions. Again this is a novice issue. When that happens, cancelling a thread doesn't work because thread cancellation is implemented with async exceptions.

What makes things good is that firstly pure code can only raise exceptions but not catch them. (If you need error handling in pure code, you use the Either monad not exceptions.) This dramatically reduces chances of coding mistakes because exceptions aren't as overused as some other languages. Secondly the Haskell community more or less shuns on direct use of these low-level exception handling APIs and everyone uses the async library https://hackage.haskell.org/package/async-2.2.5/docs/Control... which solves not only this problem but also provides nice abstractions to build concurrent computations. Check out its Concurrently type: you now have an Applicative that represents concurrent computations which reuse all your intuition from all Applicative instances. You don't even need to manually create threads or kill threads. Overall insofar as Haskell being the only language where throwing exceptions at another thread isn't a box of footguns, it's only because the language allows a sufficient amount of abstraction power that frees the programmer from using the error-prone low-level APIs. The actual low-level APIs are still full of footguns.


Absolutely this. And when you look through the bug history of those low-level APIs there's a lot of evidence.

That said, the other big difference with Haskell is the low-level API actually provides functionality to solve the exact problem of an async exception being raised anywhere via `mask`. It's still hard to use correctly but at least it's possible.


Async is a bless. So many language that claim to be "great at concurrency" don't even provide 10% of the async API.


I recommend reading "Parallel and Concurrent Programming in Haskell" by Simon Marlow in which he explains that back in 1997 he decided to "build concurrency right into the core of the system rather than making it an optional extra or an add-on library".


It has good exception libraries that differentiate synchronous vs asynchronous exceptions.

It also has good learning material on the subject in the form of thorough blog posts.

It has robust, types abstractions for both throwing, catching, and masking exceptions.

Its exception runtime implementation is better built too.


bracket [1] is easy to use and ensures that the cleanup operation runs, the Control.Exception module is also well documented [2]

[1]: https://hackage.haskell.org/package/base-4.20.0.1/docs/Contr...

[2]: https://hackage.haskell.org/package/base-4.20.0.1/docs/Contr...


Not to mention the widespread safe-exceptions and unliftio provide an API that makes it hard to shoot yourself in the foot.

Pretty funny that Haskell of all languages also has best-in-class exceptions & exception handling haha.


"The finest imperative programming language"


This the main reason I insist on curb - we need reliable http timeouts


Can you share a reference to what curb is? I can only find a project containing Ruby `curl` bindings.


Yep. Multithreaded coding is hard.


I'm not surprised, the whole language (Ruby) and ecosystem favors "ease of use" (ie the ability to write code fast) over correctness in so many places. Subtle details that other languages would require you to handle in order to robustly handle non happy path conditions are often ignored, lumped together, return just nil, etc. It seems careless, but a large group of engineers love it because it gives the illusion that the world is simple.


If you read TFA, you'd have seen this problem isn't exclusive to Ruby at all, several other languages have a similar API.


This might shock you, but you don't have to have a global monopoly on making bad design choices in order to do so.


Aborting threads has issues in most languages. Timing out an operation however is a solved problem with many safe implementations.


Ruby is not alone in this. Go shares similar philosophy to just half-ass the implementation. Well, at least it's not as brittle and impossibly slow as Ruby.


Do you have an example? My impression of the Go standard library is that they picked a pretty decent compromise of abstraction levels over the various OS facilities they use under the hood.


The classic reference here is “I want off Mr. Golang’s Wild Ride”:

https://fasterthanli.me/articles/i-want-off-mr-golangs-wild-...


"Ruby is not alone in this"

Agreed. My point was rather that this is not Ruby's only dangerous shortcut.


> There is no way to safely interrupt an arbitrary block of code. Anything could be happening at the end of that 5 seconds.

IKR!? Like, I'll be click-click-clicking away on my 1982 IBM PC/XT, I mean I'm REALLY hammering it, like REALLY leaning into the keystrokes. Then, all of a sudden, WHAM! Out pops some random lines of code; right at the end of my thread.abort block. I mean, D00000D. OMFG. This can happen anywhere! Anytime! How is any programmer supposed to know what their code is executing?!

I guess I tend to believe, if you don't know what you are doing, then don't do it.

If this is that big of an issue, maybe try real estate. Say it with me, "three-brrr", "two-brrr". Good. Now together...

Also, "idempotent". Everyone keeps dancing around that. Not sure why.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: