There was a paper long ago that showed duality between semaphore/locking code and message-queuing code. So folks figured they were the same.
Not so! semaphore/locking is very hard if more than one semaphore exists. Look at the 'dining philosophers problem' and so on.
But queuing! Done right, that can be proved correct through static analysis. Do threads that consume one queue, block on another queue? Draw the graph of queue-blocking - does it loop anywhere? Then you could have a problem.
I.e. if your message-consumers don't block at all, then you cannot have a problem with deadlock.
You CAN have a problem with messages stalling however - languishing on a list waiting for something to complete that might never complete. But at runtime this can be debugged fairly easily - if all your queues are visible to the debugger.
In the semaphore implementation, the locking situation depends on the state of threads and their stacks, which all have to be frisked and back-executed to find out who holds what lock etc. Not always debugger-friendly to do.
I favor a multi-threading environment of threads dedicated to queues, and all the queues visible to the debugger. That kind of setup has never done me dirty.
One trick I used to use when writing multithreaded java code for debugging purposes was to never use the built in synchronized/implicit object lock, since that made it quite difficult to understand from a stack trace what object was being locked. Instead, we would define an explicit inner lock and lock on that explicitly. The class name would then show up in stack traces making it much easier to track down.
In cases where not all operations on the objects necessitate the same lock semantics, this is also handy because you can split or nest locks to reduce write-blocks-read and read-blocks-write situations.
But if you're finding yourself using this more than a couple of times in a project, you're probably uncovering violations of the Single Responsibility Principle. Especially if you're splitting locks.
But ... semaphores and queues are indeed very similar, and everything you said about a queue is also true for a semaphore. A queue has two fundamental operations:
trait Queue<T> {
fn push(t: T)
fn pop() -> T // waits until the queue isn't empty, then does the pop()
}
A semaphore is is just a Queue where T is a constant (typically the unit type, which has zero size). Since it's always the the same thing there is no need to store the actual items as pop() can just create copies as needed, which also means push() can just discard it's argument. That means you can get away with just storing a count of how many items are in the queue, not the items themselves. Now rename:
push --> signal
pop --> wait
And bingo, we have a semaphore. Which also means if you are having difficulty reasoning about semaphores but find queues easy to think about, just reverse the name transformation above and semaphores become just as easy.
Rust exploits a similar equivalence with a mapping and a set: a set is just a mapping that maps every key to the unit type.
The other half of it is, thread blocking on a single queue. And system message queues being a debugger friendly concept (not just a declared local). Code discipline is 2/3 of the solution.
Surprisingly, our study shows that it is as easy to make concurrency bugs with message passing as with shared memory, sometimes even more. For example, around 58% of blocking bugs are caused by message passing.
...
Our study found that message passing does not necessarily make multithreaded programs less error-prone than shared memory. In fact, message passing is the main cause of blocking bugs. To make it worse, when combined with traditional synchronization primitives or with other new language features and libraries, message passing can cause blocking bugs that are very hard to detect.
I'd also add that in the modern world we've "solved" many multi-threading problems by splitting the queues and data stores out into multiple services. Now we don't have any threading problems - only "distributed system" problems, and if there's a bug there that's the architect's fault, not the programmer's!
The results of concurrency bugs with message passing tend to be less bad than the results of concurrency bugs with semaphore/locking code; although this is often because message passing tends to not be shared memory semaphore/locking tends to be shared memory.
Blocking bugs are a service issue, but incorrect locking can lead to deadlock or unprotected writes that corrupt data.
Depending on what they mean by 'detect', I've found it's actually quite easy to detect blocking issues in a message passing environment (Erlang, specifically), if a process has a blocking issue, it will end up stuck waiting for a message that will never arrive; however its mailbox will be growing as new requests are added. Scanning for large mailboxes, as well as growing mailboxes is easily automated and detects the issue. Of course, that's runtime detection; it's hard to detect these with code analysis (because it's not exactly a code problem. It's a system problem, and the system is an emergent property of the running code, it's not explicitly described anywhere, at least in Erlang; this is simultaneously amazingly powerful and limiting)
Although that's true, I found in the past that a pure queues/actors based approach didn't work well for some kinds of app. I had to introduce locks again and ended up with a hybrid solution.
The problem was that the app had a mix of low latency requests to access some parts of a large data structure, but the actor had to do a combination of fast and slow requests, so low latency requests would end up sitting in the queues for unacceptably long periods even when they needed access to only a small and stable part of the overall data. A related problem was the code size and complexity increase that came from having to make immutable snapshots of the underlying mutable data structures. Adding locking fixed all that and it didn't lead to any noticeable increase in bugginess. If anything the code was simpler, because the readers could simply access the data when it was in an immutable (locked) state, get what they needed and get out as quickly as possible, with RW locks ensuring good scalability and memory locality.
The actors approach also ran into circular deadlock when increasing the granularity of the actors, but unlike with locks there was no tooling support to find the issues (Guava has cycle-detecting lock classes).
The experience soured me on anything claiming to be a silver bullet for concurrency. It felt like the problems had been transformed into different shapes but not actually decreased in volume. Since then I use whatever approach feels intuitively right at the time.
I mostly agree with you, but it feels a bit handwave-y. That such bugs are less serious is maybe true for a particular class of idioms and a particular class of problems that, say, dominate CRUD application design.
Unsynchronized writes are also detectable at runtime, I would guess (having done no production Erlang work) with around the same order of false negatives as a stuck mailbox. TSan will only find the race if it occurs relatively close temporally; mailbox scanning will only find the stuck message if the send rate is sufficiently high. TSan is a lot more expensive but can be productively used during test suites. Mailbox scanning seems likely to fire false positives and only works in a complete running system.
In the end, only simpler program design can save us.
Interesting - in other words you are saying that message passing concurrency bugs tends to result in liveness violations whereas locking bugs tend to violate safety?
Yeah, that's a nice restatement. Additionally, it's a lot easier to find liveness violations than safety violations. If your system deadlocks, it's obvious (not so much if it livelocks), and you can see which parts are deadlocked. If your system corrupts data, it's not always obvious, and even when it is obvious that there was corruption, it's not always obvious where it happened.
the only model I use with threads is either pre-static (start a function on each thread with unique parameters, result is returned to a driving function at the end) or message-queue (each thread is blocking, waiting for a message). For the former, I use shared memory, but it's read only and guaranteed to live longer than any of the worker threads that use it (passing a consted ref_ptr). For the latter, I don't share any memory at all; the message itself contains the parameters, fully materialized.
I can ensure all my queues are loop-free (directed acyclic graph of workers reading from their parent nodes and writing to their children nodes). IIRC one of the complaints about the petri net model is that it was unprovable that any problem could finish, even trivial ones.
This has worked pretty well for me and I usually don't have to debug any truly nasty thread problems. My biggest problem tends to be thread pools where one of the workers gets wedged, and there's no way to cleanly kill the worker thread and steal the work onto another thread.
> I favor a multi-threading environment of threads dedicated to queues, and all the queues visible to the debugger. That kind of setup has never done me dirty.
This is a basic description of Erlang, and by extension elixir, gleam, caramel, etc.
Plus at least some of them get default access to (useful for 80% of use cases) implementations of backpressure, deadlock timeouts, etc.
This sounds interesting and I've implemented simple ideas similar to the pattern you describe before, however I haven't read about its use in depth. Do you happen know of an article/book/resource that describes this along with real world experiences? If not, would you mind writing a blog post or article on it please?
I had the benefit of cutting my teeth at my first job, working on a message-passing OS. CTOS, based on RMX/86 used messages for everything. It was a very early networked computing system. Diskless workstations where file messages simply got forwarded to a server etc. And all the messages and queues were visible in the debugger!
So I learned good thread hygiene right out of school.
The top level comments strategy with queues seems similar, in that it isn't always possible. It requires the queues form a DAG, which is another word for partial ordering. Thus, requiring the locks form a partial ordering doesn't seem like a comparatively excessive restriction.
> doesn't seem like a comparatively excessive restriction
I think there are some applications and algorithms that we just don't know a way to create an ordering for realistically fine-grained locks.
For example airline seat reservation. Someone clicking on seats to reserve them. You can't apply locks in a total order because... you can't read the person's mind on what seat they're going to click next.
I favor a multi-threading environment of threads dedicated to queues, and all the queues visible to the debugger. That kind of setup has never done me dirty.
You mean like Go does with channels? (Not sure how good their visibility is to the debugger though.)
Probably more like Elixir/Erlang, if we're talking programming language. Though even there, BEAM processes are the unit of execution, and are multiplexed on threads. Parent references OS development elsewhere.
Go has a couple of deviations; channels aren't queues unless you are careful to size them > the number of things you'll ever put in them (else the thing trying to put something on a channel deadlocks; you can of course timeout on it, but it presents tight coupling in that case, the goroutine sending has to care about the state of the thing receiving), goroutines aren't dedicated to channels (that is, there is an m-n relationship between channels and goroutines which can lead to a lot of incidental complexity), and, you can see what is on a channel if you have a breakpoint, but that assumes you can get the code to break inside of something containing a reference to the channel.
It's based on Tony Hoare's Communicating Sequential Processes, and is far more comprehensible and possible to reason about than primitives like mutexes and semaphores that are too close to the underlying hardware implementation.
its a _little_ more costly depending, but otherwise I agree with you. it also clearly delineates what data is shared and everything else can be assumed private...so it helps make the overall architecture more explicit
In fact, Valgrind is generally my favorite debugging tool: memcheck (default) for memory errors and leaks, callgrind for profiling, massif for memory profiling and finding "stealth" leaks, helgrind for multi-threading and a few others that I use less often.
The great thing about Valgrind is that it doesn't require any instrumentation at build time (though debug symbols are recommended), just run "valgrind your_program". Behind the scenes, it is actually a VM, that's how it can work with binaries directly. In theory it works with any language, and no need to do anything funny with libraries, kernel stuff, etc...
The biggest problem is that the performance penalty is huge, 10x is typical but 100x is not uncommon. ThreadSanitizer is slow but not that slow, I don't know which one is the best at finding issues, I think they are on par, but when you are an a particularly hairy problem, it is good to have both.
In practice what this means is that cpu-bound threads can monopolize the single thread, and we've had thread starvation issues as a result in some tests. Setting the "-fair-sched" option resolved those issues, but made valgrind run even slower ...
TIL that, so predictably, "[Go's] race detector is based on the C/C++ ThreadSanitizer runtime library". So, I can definitely confirm its usefulness. https://go.dev/blog/race-detector
> In this approach, threads own their data, and communicate with message-passing. This is easier said than done, because the language constructs, primitives, and design patterns for building system software this way are still in their infancy.
OP does say system software, and although Erlang can have amazing uptime, I'm not so sure about it's raw performance story (on the other hand, I do remember seeing some insane benchmarks from a elixir framework, so, maybe I'm wrong).
Also, other than Erlang and Elixir, which other reasonably mainstream language has this as first class features? Even Rust doesn't really put queues and message passing front and center, it just makes it easier to get away with doing traditional thread programming.
Things can be old in years since discovery/invention, but still in it's infancy in usability and adoption.
Go has channels (`chan`) which it considers more native than e.g. Mutex. The latter is in a stdlib, the former is a feature of the language proper.
Alas, the Go ecosystem could use more channels. I'd say that the adoption is still at the infancy stage. I wonder whether there are any practical shortcomings or is it just a readability tradeoff (the Mutex tends to be super-readable, where the channels not so).
Truth be told, it is trivially easy to create deadlocks even in a pure message-passing environment such as Erlang. Message-passing is way oversold and doesn't solve as many problems as people think.
Most of us work in information systems, which is to say, doing database stuff all day, usually with an RDBMS like MySQL, Postgres etc. And what are we doing? Grabbing a bunch of transactional locks. Against what? Shared mutable state, and not in-memory state, but database state counting into the thousands, millions, billions or more records. And if we do it wrong? Deadlock, often deadlock across multiple processes. And what do we do about that? We deal with it, because what else can you do?
You could of course throw transactions away and declare them more trouble than they're worth, but that probably won't go over well. I've found it helps to have a "consistent first lock", so for example in a consumer-based system you'd lock the customer record first, because most transactional operations happen in that context. If you always have that consistent first lock then deadlock can't happen.
My point is that if I assert "multithreading is unacceptable!", most of business information systems goes out the window on the same principle, because the locking is even more dastardly - multi-process and persistent state instead of in-process in-memory state. I don't think you could throw actors/schmactors/coroutines/goroutines/etc at such usefully either. And if you said "Well only the best programmers need apply," well BIS is not exactly hotshot heaven, by and large. It's a bunch of grunts.
So I agree multithreading is hard when you are locking many things at once, which I try to avoid. But I just don't get this thing of trying to banish it or make it the exclusive domain of geniuses.
Databases (and I would argue runtimes, compilers, etc.) are written in a much much higher quality than your regular CRUD webapp, by usually much more experienced developers, with proper architectural plans and they are sometimes even model checked by TLA+ or the like.
That's beside the point, as I'm not talking about RDBMS bugs. I can corrupt database data when the RDBMS is working exactly as it's supposed to. Strong types and various constraints like foreign key, unique, etc. will mitigate, but there's infinite ways to make a mess beyond that.
While doing anything creative with threading is in fact challenging, in Java it’s easy to safely boost performance with threading using parallel streams and concurrent data structures.
Even in other languages like C/C++, it’s mostly worth it to design something up-front that is naturally parallelizable and find a safe and correct way to do it.
And the incentive to parallelize is only increasing with the increasing thread counts of newer hardware. Sure, it’s scary, but there’s really no excuse to avoid running with threading, multiple processes, or multiple services.
In my experience I saw a bunch of novice devs using parallel streams for N<=100 lists and such, for much worse performance. Its certainly not foolproof either or it would just be the default.
To be fair, whether parallel streams is useful isn't a function of N alone, but the time each operation takes. If it's a nontrivial calculation, then by all means.
Although parallel streams are seldom the best option in terms of performance even for large N. They may often be faster than sequential processing for large N, but a more thought out approach is almost always faster still, often significantly so.
Low-level thread programming is a minefield. I've been doing IRQ/concurrent programming since the 1980s, and still hate it. Thread bugs are a nightmare.
Concurrency needs to be "baked into" higher-level languages, development APIs, and standard libraries. Basically, covered in bubble-wrap. That's starting to happen. It won't be as efficient as low-level coding, but it is likely to still have a significant impact on the performance of most code.
And folks that are good at low-level threading would be well-served to work to support this.
I studied distributed computing in college, and I spent a lot of time not quite internalizing the fact that since this was an elective at one of the highest ranked state schools in the nation, probably most other people didn't have the same information I did.
I ended up doing a lot of concurrency work early on because I was good at it, but over time the evidence just kept piling up that nobody else could really follow it, and so I've used it less and less over time, and in particular try very hard to keep it away from the main flow of the application. It's more something I pull out for special occasions, or to explain bugs.
Where a lot of frameworks fail is that while re-entrance and concurrency are different problems, they share a lot of qualities, both computationally and with the limits of human reasoning. Recursive code with side effects looks and acts a lot like concurrent code, because here I am looking at some piece of data and the moment I look away some other asshole changed it out from under me. Most frameworks end up being a bad study in recursion, usually in the name of reducing code duplication.
Pure functional people love recursive code, but it's the pure part that avoids the problems with trying to interlace multiple activities at once. Without that, you're trying to eat your dessert without eating your vegetables first.
Idk I think it is not as difficult as it is sometimes framed. I think the key is to design your program/separation of concerns in such a way as to make it easy to reason about concurrency, and minimize the need for synchronization points.
I think a lot of the problems arise when you just try to write normal synchronous code, or to parallelize code which was originally synchronous, and don't realize the implicit constraints you had been relying on which no longer hold when concurrency is introduced.
Based on a non-scientific study, I think the spatial thinkers do great with concurrency, the visual thinkers do okay, and everyone else is in real trouble. Which reminds me, I need to interview my developer friend with aphantasia about how he feels about concurrency.
Concurrency should be the sizzle and not the steak, otherwise you're reducing the diversity of your team rather substantially. Good people are hard enough to find. Driving the people you have away doesn't generally end well.
I wouldn't say I have aphantasia, but I don't know that I think of computers in a visual way, although kind of maybe.
I'm not great at writing concurrent code with locks and shared memory (messaging passing is much easier for me to do well; but message passing is almost exclusively built upon locks and shared memory, so my hands are forced). But I'm pretty good at debugging concurrency problems of either type. It's all about thinking about the sequence of events of an individual thread of execution and not making assumptions that things would necessarily stay the same between steps... and then trying to line up how another thread could make those things happen. I don't line the threads up visually, but I guess you could?
FP is good for adapting to threading, but it has difficulties, when applied to things like GUI programming, async communications, or device control (places that need threading).
A lot of languages are getting things like async/await/actor, etc., but even that is still too low-level for a lot of programmers. It can easily turn into a lot of single-threaded code.
It needs to be completely under the surface. Swift does that with generics (I suspect other languages do it, as well). For example, you can say Array<Int>, or [Int]. They mean the same thing, but one does not have the generic syntax.
If we can do similar stuff with things like mutexes and syncing, then it will go a long way towards highly performant, safe, code.
It's interesting that you choose Swift as an example, because I think in some ways Swift does a poor job of accommodating async using the "bubble wrap" approach.
Specifically I am talking about the default of using atomic reference counting on all references. This makes it somewhat idiot proof at least with respect to avoiding NPE's, but it comes at a huge performance cost (especially on x86 which I guess is becoming less of a concern for Swift code).
I think this is a fundamental issue: a big part of the benefit of concurrency and parallelism is supposed to be increased performance. However a lot of the ways to make concurrency "generally safe" tend to hamstring performance, because you have to protect against all the different types od errors programmers can make, which basically means a lot of expensive synchronization.
Maybe there is a way to hide this completely, as you say, in a performant way. To do this, I think you would almost have to have either some kind of really clever risk-free data structures, or else a very smart compiler.
Another approach might be to keep concurrency at arms length from the average programmer. I.e. hide 90% of it inside of libraries, and expose safe interfaces which the programmer interacts with in a mostly synchronous way.
> I think in some ways Swift does a poor job of accommodating async using the "bubble wrap" approach.
I agree. Async/await has a "kludgy" feel to it (to me). It's "added on," not "embedded in." But I'm not a language writer, so I don't have alternatives to offer.
I was talking about how it does generics. I think it does that quite well.
I agree that compilers and schedulers should be enforcing partial orders of concurrency. I talk of various multiversion concurrency control ideas in my ideas4 page.
As an example I think Thread yield should take an argument for a matching data structure that when the state of that data structure changes you trigger rescheduling at that point and avoid a busy wait.
In audio code, I'd rather use a properly written wait-free SPSC queue, than a least-common-denominator messaging mechanism provided by the standard library like postMessage() (where both the Win32 and JavaScript version suffer from contention and cause audio stuttering, see https://github.com/Dn-Programming-Core-Management/Dn-FamiTra... and https://blog.paul.cx/post/a-wait-free-spsc-ringbuffer-for-th...), though I'm not sure if generic channel/queue objects are as bad in practice. And message-passing (with anything other than primitive types) is a pattern for sharing memory that, if properly implemented and utilized (you don't send a pointer through a channel and access from both threads/goroutines), ensures no more than 1 thread accesses the object in a message at a time.
I think most but not all code can be constructed using primitives like (regular or wait-free) message queues and (RwLock or triple buffer) value cells, but I think all concurrent code which communicates with other threads of execution needs concurrent reasoning to design and write correctly. In my experience, Rust marking data as exclusive or shared is quite helpful for concurrent design and reasoning, whereas prohibiting shared memory altogether reduces performance drastically but is no better at correctness. I think message-passing merely shifts data race conditions into messaging race conditions (but perhaps Go is easier to reason about in practice than I expect). In fact, building programs heavily reliant on message passing between separate OS processes per service (like PipeWire) doesn't suffer from multithreading but rather multiprocessing and distributed systems, making it harder to establish consistent execution states or data snapshots at any point in time, or reason about invariants.
And even code not intended to communicate between threads needs to take care that no state is shared and mutated by another thread on accident (I concede this is easier with JS Web Workers or Rust which restrict shared mutability, than C++, Java, or Go which don't).
I think you need shared memory concurrency for performance. There are papers that argue for optimistic concurrency control and blocking concurrency control.
Concurrency Control Performance Modelling Alternatives and Implications
Pony has concurrency baked in via high-perf actor implementation. It's really nice. I believe Go has this as well, but it also has low-level concurrency API as well?
The problem with concurrency and parallelism is ownership — mainly that is write access than coherence. I have seen again and again the bad habit of people to bypass well defined set ups of data ownership because they feel they are smarter, it is unnecessary, we will figure it out down the road, and my favourite: it doesn’t solve deadlocks so why use anything as some bug families remain.
The issue is not of frameworks or type systems but of companies and people. Engineers know this stuff by training, but either are new and commit hubris or management makes them (or maybe some of them are simply untrained — and somehow never went to a bank (queues)).
P.S. For fun instance, I have seen unsafe used to bypass Rust’s type and ownership system to keep track of db snapshots in a “nice setup.” Rust was not wrong.
I don't know, maybe it's the sadist in me but I enjoy the challenge of writing fast and correct multi-threaded code. It's a valuable skill if you can master it.
This is the way. That code breaks when someone new comes along and makes several modifications that has implications which take a long time to rear their ugly head.
Broken multi-threaded code behaves a lot like perfectly coded multi-threaded code....until you get a lot of traffic through that code, you know, like on a busy holiday weekend when lots of revenue is flowing through the business!
Exactly. And you don't always need a lot of traffic, so even if your code survives load testing doesn't mean it's correct. Just last week I ran into a serious problem with a piece of code that handled infrequent events from several different threads, and an almost unrelated refactoring changed timing so that events arrived from two different threads exactly at the same time, on exactly one family of Android phone models. The original mistake had been in production for a long time without anyone noticing.
I old but not too old, I was lucky(unlucky ?) enough to still had to write C++ mutli-threaded code in my first job.
No matter how good I thought (spoiler, I wasn't), many a bugs could have been traced back to my erm "understanding" about what is safe and proper threaded code.
If you have to do things in parallel or concurrent (lol calm down now HN, I know there is a difference). I'd say just go look at GO !
Damn once you get the hang of channels and CSP concept. It's so beautiful and elegant ! :) I'm sure there are many counterpoints and it doesn't fit every problem, but as a professional cs practitioner, you owe it to yourself to give Go and it's paradigms a fair and open mined go !
Oh and the built in race-detector is glorious tool ! The "feeling" you get when you got a zillion fibers/channels going and the race detector reports no issue is like "levelling up" !
Sorta how if you serious about CS/Coding you have to at least install a LISP once in your lifetime - even if you do nothing with it but swear and squint :) #duck !
I think warnings like this are good and necessary. A lot of people are going to ignore this warning simply because they wrongly believe they're good enough to ignore it. And they're going to try and write this massive project simply to prove that the warning doesn't apply to them. And they will serve as great examples of why that warning is there in the first place.
But there's another group. This group heeds the warning. They do their best to not violate the warning. But then they run into an issue. They exhaust every possible option trying to resolve the issue. But eventually, they realize that if they can violate the warning, just a little bit. And they're careful. And they violate the warning in the least way possible and take every action deliberately. And it works.
Because the warning doesn't mean it's impossible, it just means that it is incredibly difficult and should be approached with the utmost care.
Eventually, we do learn the pitfalls, internalize them, abstract them away, etc. And then we can even do away with the warning. Because it's no longer the edge of the universe, it's just another stop along the way.
I'm glad that Rust has changed this, at least in my very specific case: I wrote a command-line application in PHP and adding parallelisation was a long journey of trial and error and concurrency bugs that took over a week to get right.
I rewrote the same application in Rust. I followed the Arc/Mutex documentation and parallelised it over a few hours. It worked perfectly.
> Buggy multi-threaded code creates race conditions, which are the most dangerous and time-consuming class of bugs in software. > safe and scalable parallelism is achievable by minimizing or eliminating concurrent access to shared mutable state. > Rust is designed from the ground up to facilitate this
Does async code with mutable state sometimes result in race conditions in rust?
I have had that experience with JavaScript. For example old school short timeouts with a function written with a closure implicitly expects the state to remain the same but a race condition happens asynchronously to change the state, and then the closure is called, and then the closure code fails. A problem of unintuitive surprise to the programmer that can happen with async code (even without thread concurrency).
No, you can't accidentally modify shared state in Rust. Unique ownership requires making sharing explicit, and protections against data races force shared mutable data to explicitly use synchronization primitives like mutexes or atomics. This applies to everything, including closures and async code.
It's not entirely foolproof. You can still have higher-level logic bugs, race conditions when accessing external resources (files, network, database), and deadlocks. However, these tend to be easier to track down and reproduce than heisenbugs caused by data races from unsynchronized memory access.
> it’s critical to explore ways to incrementally add safe concurrency in C++.
I don't buy it. Some things are foundational and attempts to add them incrementally are doomed because your foundations won't support the work. After building them easy things seem easy - but then hard things tear out your inadequately founded "incremental addition" and make a huge mess everywhere.
Google has a team doing this sort of thing (indeed David Baron might even be on it for all I know), for exactly the same reason, they wrote a web browser in C++ and it has threading problems and they wish they could fix that. So Mozilla isn't exploring terra incognita here, and if there's something to be found I hope they find it, but I expect nothing.
>So Mozilla isn't exploring terra incognita here, and if there's something to be found I hope they find it, but I expect nothing.
This post was written in 2015. Looking back I think we can infer:
1. They found a solution (create a new language)
2. From the success of that language, it seems to deliver on its promise (although it has its warts)
3. They failed to incorporate that into the web browser because rewriting a web browser as a momentous task that had little measurable business impact; and decided to layoff the engineers instead.
Google is a wealthier company, but they may decide that trying to "fix" C++ is a more realistic endeavor.
Firefox has a whole bunch of Rust in it, much more than it did when that blog post was written. Mozilla let go some core Rust people, but of course it still employs numerous Rust programmers.
Rust is encouraged to be used by all developers who are writing native code (it's not supposed to be just the domain of a select few). As I recall, the message was, "Not knowing Rust must not be an excuse for why you're not using Rust." Of course developers must still contend with the practicalities of integrating Rust into an existing C++ codebase, so some guidelines were developed:
"If the fastest chips have N cores, a mostly-single-threaded program can only harness (1/N)th of the available resources. As N grows (and it is growing), parallel programs will win."
This win will be constrained by Amdahl's law. A single program/process that is multithreaded can effectively use only a finite number of cores.
To use more resources, the task must be broken into multiple processes.
It can be faster to run $(nproc) number of gzip processes in parallel than to use pigz, for example.
> This win will be constrained by Amdahl's law. A single program/process that is multithreaded can effectively use only a finite number of cores.
This assumes that the workload is fixed. A classical example would be compiling source code: you have a fixed number of source files and the compiler is only able to parallelize so much.
On the other hand, there are applications where more threads allow you to increase the workload itself. For example, a modern DAW (digital audio workstation) is able to render tracks in parallel; if you have more threads, you can use more tracks with more plugins. In a computer game, more threads might mean more complex simulations, more objects, better graphics, etc.
Also, the major change with having so many cores is the ability to do lots of different things at once on a single machine. You don’t need any single process to utilize all the cores on a machine, you take advantage of all those cores by having a single machine do a lot do things.
It remains important not to buy cores that you cannot use, especially if there are other constrained components that would have a larger performance impact.
Sophie Wilson (creator of the ARM instruction set) spends some time on these issues, also stressing core throttling and shutdown due to heat.
I read this recently and found it weird that it ends by saying Rust enables message-passing. It sort of seems like the opposite. Rust will fight you to the death to ensure you don't have concurrency bugs even in single threaded applications or applications in which threads only communicate by copying data into and out of queues. When you do copy the data, it will insist that you use its global allocator containing synchronization primitives instead of a per-thread allocator.
I think it's gotten easier over time, actually, and even when this was written it was theoretically out of date.
There is a famous paper about how bad threads are from the 1990s. I believe it is valid on its own terms but misdiagnoses the underlying issue. The underlying issue is this: Taking multiple locks at once is not a sane operation at scale. Any system built on this will fail.
There is no magic bullet that solves all concurrency problems, but a combination of techniques, used with discipline, take it from insanely impossible to merely challenging:
1. Assign data to a thread, with only that thread able to access it. Offer concurrency via message passing to that thread alone, and don't allow references to escape from the thread (pointers, whatever you want to call it). Language support for this is nice (Erlang and Pony do it one way, Haskell another, Rust yet a third) but not technically necessary.
2. Using locks is valid, and even quite helpful and useful, but never let yourself take more than one. If you need transactions, see #1. Don't write code that encourages people to take more than one. Don't forget about things that may take locks without you thinking about it (e.g., in Go, taking a mutex and then trying to send on a channel is taking two locks); this is one thing that can be a bit of a challenge.
3. For more complicated things, use libraries as much as possible that have solved the super complicated cases already. For instance, databases that offer transactional security. You can do this even if you don't otherwise "need" such databases. Or libraries that offer a "parallel map" within your language, or other canned solutions to parallelism/concurrency.
4. Use whatever support your language provides for "linting" your conditions. The Go race condition checker is theoretically insufficient but practically useful (passing it doesn't mean that you have a safe program but it's a good start). Your language's mileage may vary but take advantage of what you can.
Honestly, just these things taken together can take you quite far. Again, these aren't magical solutions. There is an irreducible sense in which threading is harder than not threading, and I don't expect that to ever change. There will always be programmers that are more expert in it than others. But it doesn't have to be an impossible challenge anymore. And to a large degree, the challenge becomes less "solving the multithread problem" and more just figuring out how to architect systems under the restrictions of the patterns that work and compose together.
Although all that said, at a browser scale concurrency is always going to be a problem. They gotta go fast, but that fights the safe patterns pretty hard because they do involve a certain amount of runtime slowdown (e.g., passing a message is always going to be slower than just not passing a message). Rust is the only practical solution at that scale. But just as You Are Not Google, You Are Not Writing A Browser. More normal programs can go a long way with just what I laid out above.
This is an excellent comment that clearly comes from a lot of experience. Can you say any more about why having multiple locks is such a red flag? Is it because you start getting into deadlock/hungry philosophers territory? Are there rules of thumb for using multiple locks that make it more tractable or is it just always a bad idea?
Not OP, but that might be an implication. Another, more general reason is that this kind of concurrency requires design. It can’t be thought of as a defensive programming technique. Locks are too powerful and they are by definition a point of contention.
I think providing a lock is like giving up control to someone else. It’s inherent coupling. To contrast, when I first learned about them I thought of them as a protection mechanism of an isolated part, but they are an execution contract that everyone participates in.
This matches well with one of the beliefs I've been polishing up over the past few years which I don't think is well understood, which is that structured programming tends to create more coupling than we realize in deep call stacks. When you end up with a 2000-line call stack in Java, that's 2000 function invocations coupled together by the way structured programming works; for instance, at a bare minimum, that's 1999 functions that can't progress until the innermost one does, and that's not the only form of coupling.
I think the coupling induced by structured programming is fairly light per stack frame on average, but they tend to add up quickly because they're just so darned easy to add more of.
Locks are another thing that makes this really come to light. The more stack frames between some function and something above it in the stack that has taken a lock, the more likely it is for execution to eventually wander back into something trying to take a lock inadvisably.
I mean, I've had a number of deadlocks just within an object where it has some internal lock, and a method takes that lock, then tries to call a method that takes the same lock. Whoops. Easy to fix, of course, but that's the easy case. The cases get harder in real code.
So the stack is a 'lie'? Or the 'wrong' data structure? It isn't just caller knows about callee, there's cross frame and cross stack coupling going on. There is something interesting in your idea.
I don't think it's a lie. It's very, very useful. It did not survive as long as it did without providing value. It killed its competition stone dead for a reason.
I just think we underestimate the coupling it introduces. Doesn't mean the solution is to throw it all out. I'm not even sure there is "a solution". Just something to keep in mind.
The rule of thumb for multiple locks is always acquire them in the same order. Lock A, lock B, unlock B, unlock A. Otherwise you risk a deadlock. If another thread locks B, then tries to lock A, you're in trouble.
And keeping that discipline over many developers, over many years, is in practice impossible, which is why multiple locks seem attractive but are actually a major problem for complex systems.
You can, of course, encapsulate such a lock order with dedicated classes (or functions/closures) modeling linear types enforcing the sequence of lock A > lock B > unlock B > unlock A. The resulting objects can still be misapplied, but should fail immediately with a runtime error, making it much harder to inadvertently misuse.
But that still doesn't scale up to more locks of that size very well, because you start trying to take even more of them, and in dynamic orders, and etc etc.
I know about the theoretical solution of taking locks always in a defined order, but I don't consider it a practical solution. If taking locks willy-nilly lets you scale to program size X, this might give you 2X or 3X, but that's not nearly large enough in practice.
& dandelany, the community has chipped in and answered for me. I endorse this subthread, at least, as I write this comment. :)
IME there's plenty problems where nested locks will lead to a simpler solution than trying to avoid them. It's important to avoid when reasonably doable, but it's also not uncommon to take it too far and end up with a slower and way more complicated solution.
The problem here is when you have more than 2, and especially as it combinatorially expands as you get more and more locks. Now I need Lock A, Lock C, and Lock F, but then this other critical section needs Lock A, Lock D, and Lock F, and if you are encapsulating then you start needing a whole lot of different functions for all the different combinations, and have to make sure that all are consistent.
I've got multi threaded C++ in production right now, and it is because the code is small and avoids multiple locks that I have any faith in it. As a code base expands, in kloc and in time, this gets harder and harder to reason about.
> Language support for this is nice (Erlang and Pony do it one way, Haskell another, Rust yet a third) but not technically necessary.
At one point I did some name mangling magic with the C preprocessor for "this can only be accessed from the thread (statically) named foo" for a project where that was relevant. It was pleasant, although limited in how much it could generalize - all the threads very much needed to be statically known.
If you want to see some of the most complex multithreading systems, take a look inside modern browsers. That's what he means by "browser scale".
I have some knowledge into working with Chromium's internals, and it is crazy what's going on inside there. Several different job and message queue systems interlocking with each other, tasks spread across multiple processes getting split and regrouped and synced and recombined, practically everything running async, just so a few pixels are rendered a little faster when your multicore machine updates a small section of a website.
Browsers are indeed idling a lot, but if you want them to do something, they must be as crazy fast as possible. The competition in that space is ridiculous.
In the tens of millions to hundreds of millions of lines of code. Browsers are some of the largest things out there. See also desktop office suites, multi-decade CAD programs, a few other things.
Few programs are at that scale. IIRC, even the Linux kernel isn't really at that scale, once you clean away a few files in the drivers that are just literally millions of #defines or something.
If you're writing at that scale, by all means copy the necessary practices, but don't just casually assume your need is super large and you've just gotta use the hard-core lock-based concurrency because it's the only option for you. You can get a long way with much nicer architecture that is only slightly slower.
For instance, I write a lot of Go. I've sometimes gotten a bit nervous about writing a goroutine (/thread) that owns some particular data structure because I'm like "wow, a lot of things are going to access this, is this going to become a bottleneck in the system?" But every time I've run the analysis carefully, the answer generally comes back not just "no" but "heck no, not even close". Your intuition can deceive you very easily. Locking a single structure behind the limit that "we can only dedicate at most 1 CPU full time to maintaining this structure" is, in practice, far less scary than it sounds. 1 CPU is a lot of power and you generally need a lot of other CPUs trying to access it simultaneously to overwhelm it.
Ahdahl's Law kinda works in your favor on this matter (for once); for something like this to be the bottleneck, it needs to approach accounting for (1/(CPUs - 1)) of your execution time, and for that to be the case, those other processes hammering this 1-CPU-locked data structure need to be doing whatever else they are doing very quickly. As with anything in discussions like this, it absolutely can happen! But always check your intuition concretely before going crazy adopting something more complicated and "higher performance"; odds are good this isn't a problem for a given "random" data structure.
In my world of embedded single core processors, we use "multi-threaded" code all the time. There is a single core so nothing truly ever executes in parallel and that makes it easier: Someone else already wrote the RTOS, and figured out how to prevent priority inversion, etc. So long as one follows a few rules, and the general pattern is tasks and ISRs communicating via the primitives provided by the RTOS, there are never any issues.
Assuming you don't need bare-level performance and Go is okay, I love how easy and approachable parallel code is with Go. Locks and channels each offer trade offs you can easily benchmark using the built in go test.
I've written so many different concurrent approaches to workers, pipelines, and multi-stage processing systems and found it always easy to reason through ways to limit the number of workers, prevent duplicate work, and implement backoffs.
One issue is that the pthread model is too low level (think AND and OR gates), but you must include it for completeness.
I agree with the comments here that say higher-level multithreading support needs to be "baked in" to the language. Grand Central Dispatch did a good job of raising the bar here, but has its own problems, and is arguably too low level.
Another issue is that existing, widespread languages are too old to have cooked higher-level threading support in. You really have to do it from the start. C++ bolt-on parallelism might as well be syntactic sugar. Newer languages, like Rust, have attempted to solve concurrency problems but failed - it's not clear that the designers genuinely understand concurrency at all. Goroutines solve one limited issue of many and seem about as difficult as asking programmers to write threaded programs.
To do this correctly you need a new language that cooks it in from the start. There is a laundry list of things that have to be done precisely. It might be possible to do this by rewriting a reference implementation of an existing popular language, but the development overhead and performance concerns involved in backwards-compatibility make that probably infeasible. (Python can't even escape it's global-interpreter lock.) So likely the language that solves this problem is either unwritten or unknown.
Other mentions in the comments here:
ThreadSanitizer: It's not very good, nor can it be.
Erlang: solves the problem, but the actor formulation is too awkward for all but Erlang-specific use-cases.
>> Erlang: solves the problem, but the actor formulation is too awkward for all but Erlang-specific use-cases.
Err? I mean, it requires rethinking a bunch of the lessons you learn when concurrency is hard and expensive and to be minimized, sure. But when concurrency is easy and cheap and to be embraced, a bunch of things become a lot easier and more natural to write.
My go to example, from a production system, was a complicated task scheduling system (with user and hardware interrupts to handle). Sure, we could have written it using threads and a priority queue or something...instead we just had one Erlang process per task, with interrupts routed to that process as messages (and linked timer processes that fired messages to those processes for the scheduled stuff, essentially just another type of interrupt/message). Super simple to write, super simple to reason about, never had any sort of race condition or locking issues, across years of development and production use.
The problem with concurrent computing is not a lack of Erlang awareness or message passing awareness. If either was the solution we wouldn't be having this discussion since they both pre-date the 90s.
There ought to be a codeword where you can invoke the shortcomings of Erlang without getting dragged into a conversation with someone desperate to talk about Erlang. Next time I will rot13 it.
-You- said "But the actor formulation is too awkward for all but Erlang-specific use-cases". I was simply commenting that it's not awkward at all, for all sorts of use cases that aren't Erlang specific. Perhaps you'd like to define your terms better if you want to avoid people saying "hey, that statement as it comes across is flawed". Maybe you meant Erlang isn't the right tool to generalize to all use cases - no one would disagree. Maybe you meant what it sounds like, it's too awkward to use except for certain degenerate cases, in which case I strongly disagree; it's quite easy, and switching back to running N isolated units of execution on < N threads, and sharing memory between them, feels like a chore. Or maybe you meant something else entirely.
Maybe if you don't want to "get dragged into" a conversation on something...don't mention it on a public forum with such a badly worded statement?
>The problem with concurrent computing is not a lack of Erlang awareness or message passing awareness
It most certainly is.
We have been so thoroughly fed/brainwashed with Shared-Memory concurrency that most people don't even think beyond it unless forced by circumstances (or from curiosity). That is where Message Oriented Concurrency languages like Erlang can help. Just reading Joe Armstrong's thesis Making reliable distributed systems in the presence of software errors will teach one a simple approach to robust Concurrency that works and can be emulated in another language if you don't want to learn Erlang (the difficulty may vary based on language support).
PS: Folks may find Foundations of Multithreaded, Parallel, and Distributed Programming by Gregory Andrews a good reference for most major forms of Concurrency.
> It's completely feasible, from a programming standpoint, to imagine an architecture where each core has its own private RAM, the cores communicate through perfectly reliable IPC, and the device runs applications written in a successor to the distributed computing technology being developed now.
It is called processing in memory. What you have described was actually inverted. Instead of giving each core private RAM, you give each RAM chip a CPU core.
I implemented multiversion concurrency control and a poor but safe lock free algorithm.
I am still coming up with rules. As the Serializable Snapshot Isolation paper attests, you can avoid dangerous 2 reads and writes to the same data structure with an algorithm (in the paper) with timestamp ordering and trying to detect cycles with a heuristic in the graph with two booleans inConflict and outConflict.
The answer to concurrency bugs is that languages implement MVCC.
See my ideas4 page for concurrency control ideas.
Most software doesn't scale with threads and doesn't offer true parallelism. In other words programs don't use threads to their full potential.
I have an idea for LMAX disruptors with a multiproducer multiconsumer ringbuffer in ideas4. Don't block the event loop!
I like process calculus and pi calculus and CSP. But I think we can combine them altogether.
Universal constructors and work stealing and compare and set is a powerful primitive for concurrency.
I recently wrote some multi-threaded code (for the first time) for a game I'm working on. Luckily the data I need to process is in a (C++) vector, so I was able to break the data up by dividing the vector size by available cores. I certainly felt 10 ft. tall when it worked!
Another area of my code where concurrency will greatly benefit it requires updating an unordered_map. Since it's positioned based and not unit_id based, I'll need to figure out a way to prevent data races. The best option I've thought of so far is assigning thread numbers to each edge in my graph, and storing the units in each edge in a queue, which should prevent data races and car collisions. (edit: ha)
That being said, I'm the sole developer on this project and I cannot imagine working on a multi-threaded application with dozens of other developers. It is nice dipping my toes in the water though.
Games had to adapt to threading before I even joined. I got lucky to work with some very efficient threaded systems. They were never correct or safe xD.
But I can share that if you see comment that these variables are protected by this lock more that few times per million of lines of code performance is not the reason why it is threaded.
From public videos i think best explanation is available from Sean Parent "Better Code: Concurrency" [1].
I really would like to see high level or safer languages or systems to come to future game platforms. It should be possible. It just needs economical factors to kick in. Like people finally notice that slow and safe is not going to be any faster next year with new hardware.
This debate doesn’t mention software transactional memory which as a programming model, in languages like Clojure and Scala and Haskell has a much lower cognitive overhead for multi-thread scenarios. Ie it is much easier to ensure you have avoided deadlocks using STM.
Here is a technique I use to write multi-threaded code without too many bugs.
I write sequential code, where most heavy work is arranged to fit in arrays and processed
with a parallel_foreach function receiving a container and a worker that will be dispatched by the job system that is part of my framework.
The multiple cores are used mostly to accelerate the processing of heavy and linear tasks while the control flow is still purely sequential.
This is not optimal, but I think this is a reasonable tradeoff between simplicity and performance.
I write concurrent systems code for a living. Rust is much easier to get right than go, because it uses the borrow checker to statically check for correct synchronization. (Check out tokio.)
Amazon somewhat quietly replaced the S3 service implementation with a rust reimplementation, then announced it after the transition landed without making any bad headlines.
I've heard Microsoft has moved to Rust for all new projects (there is an exception process for C++, of course).
Parts of Chrome use it now.
Support for using Rust in the mainline Linux kernel should land soon.
I would change to edit MT code. It usually starts with a clear picture in mind... But even one's own code quickly becomes cryptic without lots of comments and hidden assumptions. I used to have this picture printed and put above a server in a small startup-like team. The whole thing ended not related to any lock or interlocked call though.
Writing multi-threaded code is easy if you use semaphore protected queues between threads as the only way for threads to exchange data. Basically emulating how Erlang does it.
Not so! semaphore/locking is very hard if more than one semaphore exists. Look at the 'dining philosophers problem' and so on.
But queuing! Done right, that can be proved correct through static analysis. Do threads that consume one queue, block on another queue? Draw the graph of queue-blocking - does it loop anywhere? Then you could have a problem.
I.e. if your message-consumers don't block at all, then you cannot have a problem with deadlock.
You CAN have a problem with messages stalling however - languishing on a list waiting for something to complete that might never complete. But at runtime this can be debugged fairly easily - if all your queues are visible to the debugger.
In the semaphore implementation, the locking situation depends on the state of threads and their stacks, which all have to be frisked and back-executed to find out who holds what lock etc. Not always debugger-friendly to do.
I favor a multi-threading environment of threads dedicated to queues, and all the queues visible to the debugger. That kind of setup has never done me dirty.