The Problem with Threads (2006) [pdf]

jgtrosh · on May 4, 2018

This piece seems to have predicted a very active field in everyday software development since then.

What are the alternative paradigms that have actually become common use? Coroutines, async/await, that's what I hear about online but what are others? I've seen people who touted zmq-communicating-processes with standard patterns as the solution to all problems, and I'm happy not to have to maintain the results.

Have we effectively “solved” the concurrency problem, and if so what's left as an exercise for the future?

chrisseaton · on May 4, 2018

Although this paper talks initially about concurrency, you can see that really he's talking about concurrency specifically for the purpose of parallelism.

Coroutines don't solve the parallelism part, because they're concurrent but exclusive.

Async/await as implemented in JavaScript doesn't solve the parallelism part either for the same reason, and async/await as implemented in C# has exactly the same problem as threads.

There are many ideas for how to solve the problem - but I think anyone who is honest will tell you none are a perfect solution to all situations where you want parallelism or concurrency.

For example to use zmq-communicating-processes effectively you need a problem where you can divide the data or tasks cleanly a-priori. We simply don't have the mathematical understanding of how to do that to some important algorithms that people really need to run in parallel today, such as triangulation or mesh refinement.

We probably need some radical new idea, or maybe it's looking increasingly like only a mix of ideas will work.

amelius · on May 5, 2018

Triangulation and mesh refinement seem suitable for divide and conquer, except perhaps in pathological cases.

chrisseaton · on May 5, 2018

They’re the canonical example everyone uses of where it doesn’t work well!

In both of them there is tons of parallelism but you can’t work out what work is separate (divide it) until you have started the work.

amelius · on May 5, 2018

Let's say you have a domain X that you need to triangulate. You break it along a plane, into two domains A and B, about equally large.

Imagine that you have a magical black box system that can triangulate A and B.

Would this not help you to triangulate X? I can hardly believe it wouldn't. (Again, perhaps in pathological cases yes, or if a near-optimal solution is not good enough).

jgtrosh · on May 5, 2018

Iirc, triangulation iterations can replace previous cuts. So A and B might not have any meaning in the next iteration if you triangulate properly. That's why it's okay to cut at random, because it only affects outcome indirectly.

amelius · on May 5, 2018

Wait a second, are we both talking about Delaunay triangulation?

jgtrosh · on May 5, 2018

Yes. After checking Wikipedia I was referring to the edge flipping operation which is described in the “flip algorithms” section. So there's also a divide and conquer algorithm in the article, but it needs extra steps to fix the divide edge. Was that what you were talking about?

amelius · on May 5, 2018

Yes, that's what I was thinking about.

jgtrosh · on May 5, 2018

Isn't that something you could approach with work-stealing techniques? You'd need to build the work pile as you go but that seems appropriate. Maybe I'm missing something?

chrisseaton · on May 5, 2018

The problem is: given a list of tasks that you want to solve in parallel, you cannot know ahead of time which jobs use the same data. You have to start solving them to figure that out. The problem isn't distributing the work, it's knowing when it's safe to do two things in parallel.

toast0 · on May 5, 2018

At risk of being that guy, the actor model (ala Erlang) is pretty good at concurrency. If you're unfamiliar, it's basically no shared state, and communication with other actors (Erlang processes) by sending asynchronous messages to the other actor's message queue.

The code for each actor is usually pretty small and easy to reason about. However, emergent behavior of the system, and ordering between messages from multiple actors can become tricky. Also, exposure to this idea long term will warp your mind :)

chrisseaton · on May 5, 2018

The actor model is very vulnerable to race conditions - that’s a big downside to it.

toast0 · on May 5, 2018

It depends on what is racing. If you have the same/dependent information in two (or more actors), you're going to have a coordination challenge.

So try not to do that. On the other hand, everything that happens with state within an actor is inherently non-racy, because an actor is sequential code and no other actor can mess with its state.

acjohnson55 · on May 5, 2018

The actor model is a great tool, but I think it's best looked at as a low level concurrency primitive. Most of the time, folks should be working with higher level constructs in conceptually simpler control flow paradigms like call-and-return (async/await) or streams.

true_religion · on May 5, 2018

In Erlang, all those control flow paradigms exist in library form.

You are right about the actor model being a low-level choice, but its a choice that has to be made since the whole system revolves around allowing/disallowing shared state.

dnautics · on May 5, 2018

It's also relatively easy to implement async/await using actor primitives.

pcwalton · on May 5, 2018

Data parallelism in CUDA, OpenGL, and other GPU APIs is doing fantastically and has for decades. (If writes are allowed, these APIs technically have the same problems as threads, but in practice they're easier to deal with since traditional mutex locks and condition variables are mostly unavailable in that environment, and the APIs force you to carefully declare the sharing semantics of your data buffers.)

Most parallel (not concurrent) problems map well to the data parallel model. Even Make is basically a data parallel API with read-only constant data, just with a more complex dependency graph.

barbegal · on May 4, 2018

Promises in javascript have become quite popular. Unfortunately, they're not understood very well so they aren't being used much in areas where they can improve the performance of javascript applications and instead are being used to reduce nested callbacks.

klodolph · on May 5, 2018

There's nothing wrong with using promises to reduce nested callbacks.

acjohnson55 · on May 5, 2018

Really? At my company, promises have long since won the day. Now I'm trying to get people on async/await, which gets you like 90% of the way to the simplicity of synchronous code.

nickpsecurity · on May 5, 2018

On the parallelism side, it might bring ideas to look at languages attempting it so far. IBM's X10, Cray's Chapel, and Taft's ParaSail come to mind.

pcwalton · on May 5, 2018

You don't even need to get that exotic. OpenGL shaders, for instance, offer a simple, safe data parallelism model.

nickpsecurity · on May 5, 2018

I don't know much about those. Thanks for the tip!

zzzcpan · on May 4, 2018

What do you mean by "the concurrency problem"?

rurban · on May 5, 2018

Deadlocks and data races.

Which boils down to problems created by the POSIX implementation with condvars, mutex and semaphores. No lockless and waitfree data structures.

With threads there are also minor hidden contants: limited stack size, high cost of context switches. And random order of evaluation.

Lockless threading semantics needs to know ownership, copy or ref and relationship to be able to fix these problems. I only know a few not well-known languages who actually did a solve these problems.

zbentley · on May 4, 2018

My hunch: that modern programming often requires concurrent execution of software, but that most ways in which we have to model concurrency in code are at best hard to learn, and are frequently orders of magnitude harder to learn and use.

kapv89 · on May 5, 2018

Node.js is pretty good in this sense. Except for the very hard parts, because of node's async nature, you can introduce good amount of concurrency in your code by default, resulting in a decent amount of IO being concurrent. You have to get used to a fully-async programming model though.

zzzcpan · on May 5, 2018

It would be the mainstream languages problem then or something. Concurrency with actor model is easier to learn and use than OOP, which many people seem to be able to use.

gumby · on May 5, 2018

I find a lot (most?) OOP code I read is either spaghetti (with weird object interdependencies), diffuse (waay to many classes and subclasses so following the flow of computation becomes difficult) or both. It's a great tool, but perhaps harder than widely realized.

And concurrency is even harder, especially with the ever-popular "tweak until it parses/compiles, then ship" approach, or when many people are working on the same section of code.

The Hewitt Actor approach reduces independencies dramatically, at some cost for certain algorithms, and with added clarity for many others. And it scales somewhat automatically beyond one machine which is a big win these days.

My $0.02.

chrisseaton · on May 5, 2018

The actor model is non-deterministic and doesn't solve all concurrency problems, such as creating fine-grained or irregular data parallelism. There isn't (and it may not be possible to have) one single solution to 'the concurrency problem'.

zzzcpan · on May 5, 2018

Actor model is very deterministic, but it can model unbounded non-determinism, i.e. any concurrency problem. Including fine-grained and irregular data parallelism. It's up to the compiler to generate SIMD instructions out of it, if that's what you mean.

chrisseaton · on May 5, 2018

How co you argue it’s deterministic? If one actor asks two other actors to do a job and send the result back, those results come in a nondeterimistic order. That’s a race condition. It’s easy to write programs with bugs in them because of this.

Something like fork-join is deterministic because results come in a fixed order.

And for generating SIMD from actors? Or handling irregularity efficiently? I feel like you’re making the ‘sufficiently clever compiler’ argument.

We cannot currently efficiently solve all parallelism problems in practice using actors, and we don’t know how we would be able to.

zzzcpan · on May 5, 2018

Non-deterministic order is not a race condition. You also need some sort of shared resource and being unaware of said non-deterministic order of incoming messages. With actor model you can't have shared resources and can't be unaware that messages come in no specific order.

And I'm not arguing for a sufficiently clever compiler, just that you can express any concurrency with actors. You can definitely create a convention backed by actors that compiles into SIMD if you need it.

chrisseaton · on May 5, 2018

> [it is a general race condition if] there is some order among events a and b but the order is not predetermined by the program

(Christoph von Praun)

If two actors do some work concurrently and when finished send a message to another actor, the order those messages arrive at the other actor, event a and event b, is not predetermined by the program. So it's a race condition.

    actor a {
      do some work;
      send 'a' to x;
    }

    actor b {
      do some work;
      send 'b' to x;
    }

    actor x {
      receive; <- has this received from 'a' or 'b'? Nobody knows. They've raced.
    }

You can express any concurrency with actors, but we do not know how to do so as efficiently as with other concurrency models for parallelism. Someone might be able to implement it efficiently, but nobody has managed it yet, so we're still reliant on shared memory and other approaches concurrency.

zzzcpan · on May 5, 2018

Common, you are literally trying to describe message passing as race conditions.

chrisseaton · on May 5, 2018

It's not my definition! And you're right, normal message passing does leave you vulnerable to race conditions and your program can run a different way each time you run it! That's a major problem with it.

That's why I think other models of concurrency, such as the fork-join model, where the equivalent of 'messages' have to arrive in a deterministic order and so there are no race conditions, are safer.

zzzcpan · on May 5, 2018

Message passing forces you to explicitly handle non-deterministic order, how can it leave you vulnerable to race conditions? If you need to receive a specific message first, you wait for that specific message, that's it. Simple and deterministic.

chrisseaton · on May 5, 2018

This is a real error I've seen someone make using Erlang in making a parallel sort when teaching parallel programming to masters students.

    actor A {
      receive half an array
      sort it
      send it to C
    }

    actor B {
      receive half an array
      sort it
      send it to C
    }

    actor C {
      (a, b) = divide input array into halves
      send a to A
      send b to B
      receive a'
      receive b'
      merge a', b'
    }

They send to A, send to B, and then think they're going to receive from A first, because they sent first, but B could finish first instead. Sometimes their program works, sometimes it doesn't.

Yeah it's their fault, but the model hasn't helped them not make the bug and worse they may never see the bug until they deploy into production.

If we used a fork-join model, they could not have made this mistake, and even if they did make some kind of mistake, at least they'd see it every time they ran their program.

    (a, b) = divide input array into halves
    fork {
      sort a
    }
    b' = sort b
    a' = join
    return a' + b'

toast0 · on May 5, 2018

As with most things in Erlang; if it's important, you must make it explicit. Implicit ordering works in your fork-join example with only a single fork, but if you require an ordering, you must be explicit about passing information through to enforce the ordering you need.

If you instead did

    fork {
      sort a
    }
    fork {
      sort b
    }
    a' = join
    b' = join

you would have the same problem as in Erlang. or you could have actor C sort B inside the actor between send a to A and receive a' and you would also have an implicit ordering.

In this case, merge sort could work with either order if a stable sort isn't required, or if the sort key is the whole element.

If it matters, this is easy to defend against, you just send a tag (a ref in Erlang would be perfect for this case, if the merge happened in a fourth actor, a numeric indication of ordering would be more useful) in the message to actors A and B, and use that to enforce an ordering when receiving the replies.

chrisseaton · on May 6, 2018

> you would have the same problem as in Erlang

Ah but that's not how fork-join works - you fork multiple jobs, and then you must join them all at the same time - you can't join just one.

You have to do something like

    (a, b) = join

toast0 · on May 7, 2018

If you have to join all the the jobs at the same time (which is pretty inflexible), how is the ordering of the results determined? My exposure to this model was in the perl threads module (and the forks module which offers the same api with os forking instead), where you join on a specific thread id, so you can easily enforce ordering by first joining a, and then joining b; I assumed a join with no parameters would join a thread that's ready and/or wait for the first to become ready, because that seems like the most useful/basic interface, and anything more specific (like join all the threads, or join the threads in some order) could be added as needed in the context where it's needed.

chrisseaton · on May 8, 2018

> If you have to join all the the jobs at the same time (which is pretty inflexible), how is the ordering of the results determined?

The model is that you can start a sequence of jobs to run in parallel, and then you have to wait for them all to finish. You get the results in the same order as the jobs you created. The order can't vary.

Think about a diamond shape - one job create two more jobs, and then they both send their results to the original job which cannot continue until all child jobs are finished.

> because that seems like the most useful/basic interface

Yes useful and basic, but the problem is it makes it easy to cause race conditions, which is where this thread started! You think you'll get some some thread being ready first so you write code assuming that without even thinking about it, and then once in a trillion you actually get the other result first. Yes, it's a programmer bug, but the point is because it's non-deterministic they may not notice until the one time it actually matters and someone dies.

zzzcpan · on May 8, 2018

Fork-join is not really a concurrency model, I don't understand why you are trying to push it in every message. But if you insist to treat it like a concurrency model..

> The model is that you can start a sequence of jobs to run in parallel, and then you have to wait for them all to finish.

Yes, that's the model.

> You get the results in the same order as the jobs you created.

No, you don't get the results in the same order. Jobs still finish in random order and store results before synchronization happens. Synchronization happens on join after that. And instead of relying on order you specify exactly from where you are getting the result of each individual job. So, if you have to specify that, why do you need an order then? Oh, you don't need it and you don't get to have it. It's not Erlang, where you can actually have a deterministic order and can wait for messages in any order you want, while it will reorder them for you.

chrisseaton · on May 8, 2018

> No, you don't get the results in the same order.

You and I just don't see to be on the same page about what fork-join is, so we probably aren't going to agree on this.

> It's not Erlang, where you can actually have a deterministic order

Can, but my point is you can also not have a deterministic order, which is how Erlang programs can end up being racy, which is the problem with them if you are trying to solve the original problems of threads.

toast0 · on May 8, 2018

So, if you want this inflexible model, you could easily build it as a library in Erlang, just as you're clearly using someone's inflexible fork-join model library (or an inflexible language implementation); in the mean time you're missing out on things like fork A, B, C, A expected to run much longer than the other two, when B and C return, fork D using the results from B and C , finally wait for A and D to return. Or fork A, B, C and merge the first two that return, then merge that one and the third. Or fork these ten jobs, but only run four of them at any given time (resource constraints). My first example you could probably structure into your model with an extra fork; the second example won't fit in your model, the third fits if you fork four workers and add a shared queuing mechanism, but that feels more complex.

These kind of techniques are key to using parallelism to reduce latency. Always having to wait for everyone to finish at each step makes for a lot of waiting.

zzzcpan · on May 8, 2018

Fork-join is not what you think it is then. It's just a behavior, uncommon outside of the shared memory programming. And there is no order of results. There can be, if it is implemented on top of message passing. But then writing code that relies on order instead of specific names actually becomes error prone.

And lack of order of messages is still not a race condition.

zzzcpan · on May 5, 2018

Erlang lets you receive first message first even if it arrives last, you just have to specify which message, exactly as in your example. But order doesn't actually matter for sorting, you cannot possible make a mistake wrt to ordering here.

rectang · on May 4, 2018

I've always liked section 3 of this paper, specifically the concept that "infinite interleavings" make threads executing in parallel non-deterministic and difficult to reason about. That gets to the heart of why threaded programs are so prone to heisenbugs.

"They make programs absurdly nondeterministic, and rely on programming style to constrain that nondeterminism to achieve deterministic aims."

You can't write an infinite number of test cases for all those interleavings, and it requires hard thought to suss out where any problems might lie.

chubot · on May 5, 2018

This talk was about Foundation DB was brought up recently, and it's pretty amazing. I recommend watching the whole thing, but to be brief they are taming the "infinite interleavings" problem through determinism.

"Testing Distributed Systems w/ Deterministic Simulation" by Will Wilson

https://www.youtube.com/watch?v=4fFDFbi3toc

They wrote an interesting Actor DSL that compiles to C++ and is completely deterministic, and they torture this deterministic engine with generated test cases on a cluster every night.

I guess you could say that the whole cluster is necessarily non-deterministic, but an individual node is deterministic, given an ordering of the messages it receives.

jlarocco · on May 4, 2018

This is just my opinion, but I've never found that part of multi-threading difficult. Interleaving doesn't matter except where resources are shared between multiple threads, and the solution is to protect the resource with a mutex.

Sometimes it's hard to tell when a resource is shared, but that has more to do with not knowing how the code works than it does with multi-threading.

typomatic · on May 4, 2018

> Sometimes it's hard to tell when a resource is shared, but that has more to do with not knowing how the code works than it does with multi-threading.

With respect, this sort of thing works a lot better for small codebases where you're the only one working on it. Multithreading when you can't contain the entire relevant codebase in your brain is where the real challenge is.

mchahn · on May 5, 2018

> the solution is to protect the resource with a mutex.

Then you have deadlocks.

vvanders · on May 4, 2018

Ayup.

My favorite one is where adding debug traces causes the heisenbug to disappear because the printf() inserted a memory fence somewhere deep in the logging library.

Nothing like debugging via atomics.

ModernMech · on May 5, 2018

I identify with this bit from the paper

> To offer a third analogy, a folk deﬁnition of insanity is to do the same thing over and over again and to expect the results to be different. By this deﬁnition, we in fact require that programmers of multithreaded systems be insane. Were they sane, they could not understand their programs.

I actually had this exact notion when implementing pthreads for a course. I noted to myself "Gee, I keep doing the same thing and every time I get a different result... I must be insane according to the definition"