Hacker News new | past | comments | ask | show | jobs | submit login
An introduction to lockless algorithms (2021) (lwn.net)
216 points by signa11 on April 24, 2023 | hide | past | favorite | 69 comments



"Lockless" sounds very positive, and it's tempting to think that everything should be made lockless. However, it's less great than it sounds.

Mutexes are very cheap in the uncontended case, and in the contended case they have some nice properties. You should only consider going for a lockless algorithm if you can prove it is better than just using a mutex. Sometimes lockless algorithms are in fact slower than algorithms using locks, as things like atomic compare-and-exchange instructions can have a significant cost.

Another problem with lockless is that you can't turn every algorithm into a lockless one, it's actually only a handful of them that are practical to implement locklessly. Of course, the ones that are practical, like some queue and linked list algorithms, can give very nice benefits if used correctly.


To add to this, the superset of these ideas is referred to as "non-blocking", and lock-freedom is one flavor. There is also obstruction-freedom, starvation-freedom, wait-freedom, etc.. All of which have their own trade-offs and complexities. If you're interested in some literature, McKenney's book[0] is considered a great practical introduction to these ideas. (also discussed before on HN [1][2])

In applications where you're willing to trade throughput for latency (like RTOS schedulers, drivers, audio processing), it can often make sense to go for a wait-free data structure.

[0] https://arxiv.org/pdf/1701.00854.pdf

[1] https://news.ycombinator.com/item?id=34859102

[2] https://news.ycombinator.com/item?id=26537298


I'd like to add some context as to why lockless algorithms are popular in the Linux kernel. Roughly speaking, there are two possible scenarios in which kernel code is executed: Either in context of a process (e.g. while handling a syscall) or while handling an interrupt [1]. Now, a mutex usually sets the current thread to sleep if it tries to lock a currently locked mutex. Even in kernel land, this is perfectly fine as long as the kernel is in process context. However, while handling an interrupt there is no current thread that can be put to sleep. Therefore you can't use a mutex for kernel data structures that will be accessed during interrupts, and that applies to a lot of data structures.

This is why some common locking advice does not really apply to kernel code. Another example would be "avoid using spinlocks, prefer a mutex" - spinlocks are widely used in the kernel since they're the most straightforward alternative to mutexes.

Not disagreeing with the parent post, just wanted to add some context why LWN likes to talk about lockless algorithms.

[1] Actually, the kernel differentiates between hardirqs and softirqs, but that's not important here.


You're describing the top-half and bottom-half driver model that's been part of Unix since forever. Indeed, bottom-half code can't sleep or block in any way, and that's a problem even in single-CPU systems like the ones Unix grew up on. There's ways to deal with this other than lockless data structures, though lockless data structures help, naturally.

The reason lockless data structures are popular is performance and scalability, and they're popular in user-land as much as in kernel-land, and in the Linux kernel as much as in any other OS.

The idea is to make sure that you don't block because context switches are expensive, but also not to spin for a long time either because that can be even worse than blocking. Along the way you want to make sure that contention is not a problem and that the algorithm scales to many CPUs and lots of racing, and that it's free of race condition bugs.


Indeed. Also, spinlocks are not that great in userlang where one thread can preempt another thread that is holding a spinlock, which means threads on other cores that also want that lock would spin for a whole timeslice and burn a huge amount of cycles.

In the kernel, spinlocks are typically taken at the same time IRQs for the current core are disabled, which avoids this issue (and deadlocks of course).


You don't have to disable IRQs if the spinlock is never taken by an interrupt handler. Disabling preemption would be enough in such a case, and that reduces latency issues related to disabling IRQs.


> Mutexes are very cheap in the uncontended case

It was a while ago I was deep into this mess so forgive any ignorance–but–iirc the thread-mutex dogma[1] has many pitfalls despite being so widely used. Primarily they’re easy to misuse (deadlocks, holding a lock across a suspend point), and have unpredictable performance because they span so far into compiler, OS and CPU territory (instruction reordering, cache line invalidation, mode switches etc). Also on Arm it’s unclear if mutices are as cheap because of the relaxed memory order(?). Finally code with mutices are hard to test exhaustively, and are prone to heisenbugs.

Now, many if not most of the above apply to anything with atomics, so lock-free/wait-free won’t help either. There’s a reason why a lot of concurrency is ~phd level on the theoretical side, as well as deeply coupled with the gritty realities of hardware/compilers/os on the engineering side.

That said, I still think there’s room for a slightly expanded concurrency toolbox for mortals. For instance, a well implemented concurrent queue can be a significant improvement for many workflows, perhaps even with native OS support (io_uring style)?. Another exciting example is concurrency permutation test frameworks[2] for atomics that reorder operations in order to synthetically trigger rare logical race conditions. I’ve also personally had great experience with the Golang race detector. I hope we see some convergence on some of this stuff within a few years. Concurrency is still incredibly hard to get right.

[1]: I say this only because CS degrees has preached mutices to as the silver bullet for decades.

[2]: https://github.com/tokio-rs/loom


I think the issues with concurrency are at this point greatly overblown. Is it hard? Maybe, so it are a lot of things in our field. Is it phd level? Not at all, unless you are literally breaking new ground; most problems have very well known solutions.

Mutexes are a solution to the mutual exclusion problem, no more no less. Sometimes this problem can be solved by a queue, but that opens other large cans of worms like asynchronicity.

I write more lock-less [1] code than most, but I will often fall back to a mutex as the right solution for a problem given a complexity and performance budget.

[1] by that I mean code without standard use of mutexes, not necessarily non-blocking.


The nasty part of concurrency bugs is that normally you can expect the code is almost correct and needs some tweaks to fix oversights, but concurrency bugs in incompletely thought-out code more often requires throwing out the design to build it right. Additionally, concurrency issues usually arise from multiple components interacting in nonlocal ways, requiring a global knowledge of the system to diagnose and fix. And finding the presence of a bug or reproducing it is usually probabilistic, making concurrency heisenbugs hard to discover, locate, and prove they're fixed.

It can be made somewhat tractable to experienced concurrency wizards; I think Rust's "cross-thread shared ^ mutable" rule is a good starting point, but have less experience with Go, JS, or Erlang-style approaches.


Aren’t there formal analysis techniques that can help ferret those out? My MSCS had a course called “science of programming”. The focus used to be proofs for correctness of programs, which included concurrent programs.


I've certainly heard of such things. If there's a tool to prove that my 135,000 lines of C++ are formally correct, I'm up for it. :-)


On Intel processors, which have strong memory order, concurrency is fairly easy to reason about. On processors with weak memory ordering (ARM notably), things get really treacherous.

I'm still settling in to the horrors of memory ordering on ARM. The one thing I know for sure, is that my oeuvre contains a trail of code that will work fine on Intel processors, but won't work on ARM (or any other processor with weak memory ordering). :-/


If you are using mutexes or other well defined abstractions, you need to care little about ordering.

If you are writing atomic based code, you should be using something like C++11 and C11 atomics that will take care of fencing on all platforms.


You should always use memory fences on Intel when using atomics imo


Explicit fences are extremely rarely needed on x86.


From my point of view, the issues with concurrency (specifically the shared memory "threadlike"-kind) are absolutely NOT overblown, because misuse of that mechanism can completely RUIN a codebase in a way that few other things can; "goto" abuse would be a good analogy, or massive amounts of partly redundant global state.

My perception is that other comparably dangerous mechanisms are taught with appropriate caveats-- you WILL typically be admonished to keep variables as local as possible, learn how to encapsulate state, and probably be instructed to ALWAYS use proper loops, conditionals and function calls instead of wild gotos-- it does not even matter if you learn programming in university or on your own.

But with threading this is not the case, you'll typically get handed threading primitives without much prejudice, with predictable results.


Concurrency, like distributed systems, is only "hard" in so much it's easier to sell your product, that features it, if you can convinice others of how arduous solving such a CS101 topic would otherwise be.


If I were younger you would have triggered my impostor syndrome, but as a man of wisdom I know that you (a) have super-human skills[1] nobody else has, or (b) you’re talking about something different.

Note I’m not talking about an isolated case of analyzing a concurrent snippet of code for happens-before relationships, but rather how to write highly concurrent code within large scale applications, without runaway complexity.

[1]: https://bholley.net/blog/2015/must-be-this-tall-to-write-mul...


In addition to lock-free queues, I think another important lock-free primitive is a cell where writing a value drops previous values, even if they haven't been read, used for eg. time of day, volume levels passed from an audio thread to a GUI VU meter, and the like. Implementations include word-sized relaxed atomics (bigger tears), SPSC triple buffers (doesn't generalize to >2 threads), or other days structures I don't understand as well, like RCU or hazard pointers.


> another important lock-free primitive is a cell where writing a value drops previous values, even if they haven't been read

More importantly, if you've read a value then that value should stay alive (not be destructed/collected) until you explicitly release it or re-read that "cell".

Then what you do is publish immutable data to the "cell".

Unlike rw locks there's neither the risk of readers blocking behind writers nor writers being starved, and it's highly concurrent.

Clojure has a `ref` for this. Linux has RCU, which is a bit like this.

I've implemented just that in a lockless way twice (see elsewhere in this thread) with this C API:

  typedef struct thread_safe_var *thread_safe_var; /* TSV */
  typedef void (*thread_safe_var_dtor_f)(void *); /* Value destructor */

  /* Initialize a TSV with a given value destructor */
  int  thread_safe_var_init(thread_safe_var *, thread_safe_var_dtor_f);

  /* Destroy a TSV */
  void thread_safe_var_destroy(thread_safe_var);

  /* Get the current value of the TSV and a version number for it */
  int  thread_safe_var_get(thread_safe_var, void **, uint64_t *);

  /* Release the reference to the last value read by this thread from the TSV */
  void thread_safe_var_release(thread_safe_var);

  /* Wait for a value to be set on the TSV */
  int  thread_safe_var_wait(thread_safe_var);

  /* Set a new value on the TSV (outputs the new version) */
  int  thread_safe_var_set(thread_safe_var, void *, uint64_t *);


Yes! I’ve used atomics for metric counters (MPMC) with reasonable success. But for anything larger than an atomic uint64, you’d need good library support. Fortunately this can be solved without OS support in an optimistic fashion (atomic retry loop), right?


By an atomic retry loop, do you mean reading a generation, reading a set of values, then checking the generation is consistent? These kinds of seqlocks have interesting codegen challenges with the C++11 threading memory model (https://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf).


Another problem with lockless is that you can't turn every algorithm into a lockless one, it's actually only a handful of them that are practical to implement locklessly.

You can make every data structure lock-free, even wait-free, there are universal constructions [1]. The resulting performance may however be prohibitive.

[1] https://www.researchgate.net/publication/221343511_Impossibi...


Well, yes, but the appeal of lockless data structures is that they generally perform very well because they don't require sleeping, and sleeping is bad because context switching is expensive. A lockless data structure that doesn't perform or scale better than a traditional blocking data structure is just not worth using (though it might be worth talking about).


You are totally right. Here's my rough personal guideline: prefer mutexes over lockfree code - unless you can't. For example, you cannot lock a mutex from the audio callback because it might put the thread to sleep and you would miss your deadline. (For anyone interested in this topic: http://www.rossbencina.com/code/real-time-audio-programming-...)


If putting the thread to sleep would be the problem, simple try_to_lock instead of blocking would be the solution. In practice, it looks like releasing the lock and potentially waking up threads that were blocking on it is another problem (https://timur.audio/using-locks-in-real-time-audio-processin...)

For what it's worth, on macOS profiling audio apps shows that the OS itself is using mutexes on audio thread and it doesn't seem to be a problem. The popular JUCE libraries add another set of mutexes.


try_lock fails. Then what?


Citing TFA,

> If you’re less lucky, you’ll have to switch your audio effect to bypass mode, or perhaps fade out, or some similar fallback strategy. It’s probably not ideal, but in many cases still better than a glitch.


or if the lock is on some smaller part you might be able to still render the audio but e.g. not update the settings that changed yet, and try again next time.


Similar issues with locking mutexes from signal handlers or from high priority threads (if you want to avoid prio-inversion).


Mutexes are implemented with (among other) cmpxchg instructions under the hood. It’s sort of a false dichotomy to divide the world into locks and cmpxchgs. Also inconsistent to claim they are slow but uncontended mutexes are fast. It is certainly true that you can write slower and harder to understand algorithms using atomic primitives instead of mutexes.


This is a common misunderstanding but just because you use cmpxchgs and other similar instructions has nothing to do with being lockfree. Lockfree really means *dead*lock free since it guarantees that some thread always makes progress. This is far more difficult than it sounds since you can't assume that some particular thread (like the one holding a lock) is ever scheduled. Some models relax this a bit though and do assume a thread requesting to execute will eventually execute in which case systems with mutexes can be lockfree.


I was responding specifically to this claim in the original comment:

> Sometimes lockless algorithms are in fact slower than algorithms using locks, as things like atomic compare-and-exchange instructions can have a significant cost.

Not interested in getting into a pedantic argument about the definition of "lockless" or "lockfree."


> You should only consider going for a lockless algorithm if you can prove it is better than just using a mutex.

I'm with you. If you aren't benchmarking your datastructure changes, you probably shouldn't be messing around with lock-free code to begin with.


Even an uncontended mutex can create a lot of cache-line bouncing, making it less cheap than it seems.


But a lockless algorithm would have the same amount of cache-line bouncing in the uncontended case.

In the LWN series about lockless algorithms, there is a lot of detail about the atomic primitives used in them, but apart from ringbuffers, linked lists and RCU, I actually didn't see any mention of higher level lockless algorithms. For single-producer, single-consumer queues and for linked lists, lockless algorithms are known and performant, but many others are either not that performant, or they come with severe restrictions. For example, RCU sounds great, but you have to worry about grace periods, and you should only use it when you read much more often than you modify.


I've an RCU-like scheme that's performant and doesn't have anything like grace periods, and it works in user-land. https://github.com/cryptonector/ctp


Just looking at this initially, thread_safe_var_init is not correct. There is no guarantee whatsoever that "*vpp = vp" actually does anything, the compiler is perfectly free to just never allocate vp and just use the memory at vpp (thus allowing partial initialization) and the situation is even worse on ARM. You need some sort of memory barrier between that last assignment and it should probably be a volatile/std::atomic pointer.


I've added use of Helgrind's `ANNOTATE_HAPPENS_BEFORE()` and `ANNOTATE_HAPPENS_AFTER()` macros to the functions in `atomics.c` and now this is Helgrind-clean as well as TSAN-clean.

The only bugs found by TSAN (as noted in another reply) were the one you found and two assert()s that didn't use atomics. Those two asserts could be a big problem if one ran a non-NDEBUG build in production.

So now this is TSAN- and Helgrind-clean.

Thanks for prompting me to do this extra work!


I fixed that. TSAN also found a pair of data races involving assert()s (oof). All three are fixed.


Hmmm, good point, the init function does need a memory barrier. Thanks for the review!


> O(N log(N)) where N is the maximum number of live threads that have read the variable and M is the number of values that have been set and possibly released).

If you reference N twice and M zero times, why did you define M? Did you make a typo in the O() or is this intentional?

> Most threads only ever need to call thread_safe_var_get().

> reference count the data.

Does getting a new value automatically release a reference to the old one (so it's no longer safe to read)?


> If you reference N twice and M zero times, why did you define M? Did you make a typo in the O() or is this intentional?

That was a typo. It should have read `O(N log(M))`.

> Does getting a new value automatically release a reference to the old one (so it's no longer safe to read)?

Correct. Every read (get) of the variable causes the previous value to no longer be safe to use in that thread.


> Sometimes lockless algorithms are in fact slower than algorithms using locks, as things like atomic compare-and-exchange instructions can have a significant cost.

You'll need to expand on this, since mutex acquisition also requires similar atomic read-modify-write operations. I think what you may be trying to say is that lockless algorithms require more of these operations as they scale to more processors?


Yes, only a few lockless algorithms require only a single atomic operation. And as soon as you need to do cmpxchg in a loop and there is contention, you can burn a lot of cycles while cache lines bounce between cores, potentially an unbounded number of cycles if it is not a wait-free algorithm. While a mutex would have to pay the price of a futex call in the contended case, it is at least bounded.


Yeah. Now what about cases where the max contention is bounded (say, at most 2 threads waiting on a lock)? And what about factors other than running time? How does, say, fairness affect this picture?


If multiple cores are trying to do atomic operations on the same value, the hardware typically is not fair. For mutexes, it depends on whether the operating system wakes up waiters in a fair way. Often that is not fully fair either.

If there are only two threads at most, you can sometimes use a different algorithm than if you have an arbitrary number of threads (for example, for single-producer, single-consumer queues, lockless is typically better than with a mutex).

You should of course benchmark your particular problem to see what solution is actually faster.


> You should only consider going for a lockless algorithm if you can prove it is better than just using a mutex.

I would, in fact, give precisely the reverse advice for one reason:

Mutexs don't compose.

Mutexs are really good at putting subtle bugs into your code that are ridiculously difficult to figure out.

Lockless algorithms may be slower or cause performance issues, but they don't malfunction because you got the release order backwards.


> Mutexs don't compose.

The don't compose because (as you hinted) lock taking/releasing order has to be the same for all callers, but if mutexes are being taken behind the scenes then API usage order becomes critical, and it is trivial to screw that up.

> Lockless algorithms may be slower or cause performance issues, but they don't malfunction because you got the release order backwards.

And they may scale better. Scalability is important.

For me the classic case is rw locks vs. RCU-ish schemes. I think there's no case where rw locks are ever appropriate if you have an RCU-ish alternative. For example, OpenSSL uses only rw locks -- nuts!


I see your point, but as a counterpoint, consider that you have two or more distinct data structures, for example several separate linked lists. What if you need to update multiple lists in one atomic operation? With mutexes, you can do this rather easily: either you have one mutex that guards all lists, or you have one mutex for each list and you ensure you always take them in the same order. I would think that qualifies as composing. However, you can't easily compose a lockless algorithm that works on one list to make it work on multiple at the same time.


I think even this muddies the waters - oftentimes people think of performance in terms of metrics (latency, bandwidth, memory, etc). Lock-free algorithms are about performance guarantees.

Specifically, a lock-free algorithm is appropriate if it would be considered an error for one thread to stall and cause any other threads to stall.

An example of this would be a concurrent garbage collector. These are used when you do not want garbage collection to stop a program from executing. The way that the GC thread communicates with the rest of the program therefore must be lock-free.

That's not to say a stop-the-world GC could be faster: just that its lack of performance guarantees would be considered a bug.


> You should only consider going for a lockless algorithm if you can prove it is better than just using a mutex.

This is true for all optimizations: optimizations should be chosen based on profiling, not based on reasoning about the programs. Reasoning can give you ideas of what to profile, but modern programs, including the environments in which they run, are complex enough that you can never really be sure that the assumptions you're basing your reasoning on are true.

So I fully agree with this, but I have some questions about the rest of your post.

> Mutexes are very cheap in the uncontended case, and in the contended case they have some nice properties. Sometimes lockless algorithms are in fact slower than algorithms using locks, as things like atomic compare-and-exchange instructions can have a significant cost.

My (very limited) experience in this area is mostly with optimistic lock-free algorithms, so maybe there's something that I'm missing. But my understanding of optimistic lock-free algorithms is that they make use of the idea that the uncontended case is the most common case, and try to optimize that case heavily. On Intel processors, a compare-and-exchange operation should literally be a single instruction (CMPXCHG[1]). So I'm wondering how the uncontested case using a mutex could be seen as a positive?

> Another problem with lockless is that you can't turn every algorithm into a lockless one, it's actually only a handful of them that are practical to implement locklessly.

How could "you can't turn every algorithm into a lockless one" could be true? It seems to me that you can "simulate" an atomic compare and swap using mutexes--although I can't find a reference easily, I believe this is what GCC's __sync_X_compare_and_swap does on architectures which don't have a compare-and-exchange instruction. Likewise you can implement a mutex fairly easily using a compare-and-swap operation.

Whether this is practical would of course depend on the situation--implementing a mutex in terms of compare-and-exchange or vice-versa are likely not the optimal ways to implement either. But it seems to me that if compare-and-exchange and mutexes can be implemented in terms of each other, isn't that a trivial proof that any mutex algorithm can be implemented as a lock-free algorithm and vice-versa?

[1] https://www.felixcloutier.com/x86/cmpxchg


What you're describing about writing a mutex with cmpxchg is a spinlock. If you just use that operation it's also a very inefficient spinlock.

The point of a lock free algorithm is to avoid having a thread wait on a resource and not do any work. It may do useless work that needs to be reverted, but it's not sleeping or spinning until something happens. It's not guaranteed that you can do that for every algorithm in a way that makes sense performance wise.

Here's a nice resource I used on my thesis for an adequate but not perfect spinlock with atomic instructions: https://rigtorp.se/spinlock/


> The point of a lock free algorithm is to avoid having a thread wait on a resource and not do any work. It may do useless work that needs to be reverted, but it's not sleeping or spinning until something happens. It's not guaranteed that you can do that for every algorithm in a way that makes sense performance wise.

Sure, but that seems like a much narrower statement than "you can't turn every algorithm into a lockless one".

Maybe I'm just being pedantic and the person intended something more like what you said, though.


You're too focused on "mutex". There are many more interesting constructs that one can build in a lockless manner: lists, hash tables, RCU, etc.

Here's some of mine: https://github.com/cryptonector/ctp/


The problem with typical OS-provided mutex is that it is almost always in some way distinctly non-cheap either because of the particular implementation or because of limitations imposed by ABI/API compatibility (and in some case both).


I am the author of liblfds, a portable, license-free, lock-free data structure library written in C.

I've been working on other projects for the last number of years, but I'm finally back to liblfds and making progress toward the next release.

https://www.liblfds.org/slblog/2023-04.html


I might also point people at ConcurrencyKit.


I know Samy. CK is extremely well made.


I like the code of this ringbuffer

https://www.linuxjournal.com/content/lock-free-multi-produce...

I liked this whitepaper https://www.cs.technion.ac.il/~erez/Papers/wfquque-ppopp.pdf

I am a beginner at this kind of thing but I created an array of integers that each thread owns an index. They write to the array at their index that they want access to the critical section.

We scan the array forwards and backwards to see if there is any thread that has claim to the critical section.

I even TRIED to write a model checker https://github.com/samsquire/multithreaded-model-checker

This is inspired by left-right concurrency control whitepaper Left-Right: A Concurrency Control Technique with Wait-Free Population Oblivious Reads


The article in the first link is pretty bad, and the entire code is undefined behavior. First of all, using inline asm + volatile to emulate well-ordered release & acquire on x86 is unnecessary. That's exactly what atomics are supposed to provide for you.

> In this article, I concentrate on x86, as it is the most widespread architecture rather than write generic (but slower) code.

Any reasonable C++ compiler will generate simple loads and stores for atomic r&a. There's no penalty for writing generic code.


I too am a beginner with concurrent programming—it's a difficult topic to get your head around at first. Some resources that I've found super helpful:

- Dmitry Vyukov's Lockless Algorithms: https://www.1024cores.net/home/lock-free-algorithms

- Jeff Preshing's blog (worth exploring all adjacent articles): https://preshing.com/20120612/an-introduction-to-lock-free-p...

- Bartosz Milewski: https://bartoszmilewski.com/2008/12/01/c-atomics-and-memory-...

- Memory Barriers: a Hardware View for Software Hackers—P.McKenney: https://www.researchgate.net/publication/228824849_Memory_Ba...

I hope you find some of these useful. I've re-read Paul McKenney's paper every 2-3 months in an attempt to get this stuff to stick! :)


The Art of Multiprocessor Programming is really good


Thanks for linking all these references!

I’m also new to a lot of these topics but recently read Rust Atomics and Locks which has helped my understanding quite a bit - at least keeping me above water reading all the discussion in this post. The examples are in Rust but the concepts are at the kernel/cpu level and apply beyond the language. I recommend it.

https://marabos.nl/atomics/


Ringbuffers are one of my favorite things to use. Thanks for the article.


I am a noob here, so maybe this is obvious, but how exactly is a "lock" defined? it seems to me the acquire/release semantics are very similar to locks? I.e. I'm struggling to understand what is "lockless" about the primitives.


I only had time to skim, but the primitives described here (acquire, release, happens before) sound almost exactly like the model that was standardized in C++11 and C11.


This is not a coincidence. :-)


When “lockless” algorithms rely on CMPXCHG aren’t they just shifting the lock from software to hardware?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: