A fast multi-producer, multi-consumer lock-free concurrent queue for C++11

colanderman · on Dec 5, 2016

Maybe someone can educate me. I've studied them for a while, but I don't get the fascination with lock- or wait-free queues.

If you actually care about real-time guarantees enough to look into lock-free solutions, you already ought to have non-preemptive workers (i.e. cooperative multitasking). This then allows you to have only one queue per CPU (since exclusive access is guaranteed), simplifying the problem.

With one queue per CPU, multi-producer becomes trivial (atomic doorbell + round robin), and multi-consumer is easy (CAS to acquire a work item). You don't have fairness guarantees for consumers, but who cares, only when they are not saturated can they not be fair and then that's a good thing.

Application stalls don't happen (one non-preemptible queue per CPU), application crashes are managed at a higher level, and IRQs can be disabled for the 5 cycles you spend in the critical sections.

I assume I'm missing something, since I'm somewhat new at this, and lock-freedom is clearly in vogue, but I don't know what it is.

RossBencina · on Dec 5, 2016

Your architecture proposal seems sound assuming that you have that much control over the whole system, and you don't need to integrate with any legacy components. In the common case, you need to deal with limited control (e.g. in a userspace app) and you have legacy components (e.g. a consumer OS, a Java VM, etc).

If you're doing best-effort soft-real-time on a consumer OS then you probably don't have control over thread affinity to _guarantee_ one thread per CPU, and you certainly can't disable IRQs.

Even on servers, you can get Linux to set up your thread-per-CPU thing, but still no IRQ masking. Also, you'd need your whole stack to do cooperative multitasking in a fully composable way before you could do as you propose. Maybe that's plausible in Go, but if you're in Java (as a lot of the HFT people seem to be, for reasons I don't understand) then probably only _some_ of your system is structured as one-thread-per-CPU cooperative multitasking, and you may still need to offload high-latency work using lock-free queues.

In my domain (real-time audio on desktop OSes), neither Windows nor macOS have working deterministic priority inversion mitigation. So using lock-free queues between priority domains (e.g. between GUI thread and real-time audio thread) avoids priority inversion risk (compared to using mutexes). Granted, this is a rather specialised use-case.

Another reason people are interested in these things is that there's a perception that lock-free algorithms perform better for inter-thread queuing than the alternatives. In particular, there are claims that they scale better when there are a large number of producer and consumer threads/cores. I think this is the most controversial area, since the queue algorithms may scale better than e.g. a single mutex on a single queue, but it's not necessarily the case that an application architecture that uses a MPMC queue is the best fit for purpose.

valarauca1 · on Dec 5, 2016

     If you're doing best-effort soft-real-time on a consumer OS then 
     you probably don't have control over thread affinity to _guarantee_ one thread per CPU

Which consumer OS? FreeBSD, OpenBSD, Linux, and Windows all offer the functionality. To my knowledge MacOS is the only consumer OS that doesn't.

     and you certainly can't disable IRQs

This has been in Linux for over a decade. You can modify IRQ affinity from the root shell via the proc-fs [1].

[1] https://www.kernel.org/doc/Documentation/IRQ-affinity.txt

Unklejoe · on Dec 5, 2016

Hmm. Interesting. I would have never thought to use that within a program during runtime.

I don't think that the IRQ affinity facility is meant to be used like that ("that" being the masking of interrupts for a given CPU to create critical regions with a program).

For starters, it seems like you'll need to be running as root to be able to change it. That could be an issue for some applications. It also says that you can’t disable interrupts for all cores.

It also seems very non portable.

valarauca1 · on Dec 5, 2016

    I would have never thought to use that within a program during runtime.

During Runtime is unlikely. It does have clear benefits especially when dealing with 10GbE nics. Having N-cores handle 2x NIC's the caching thrashing can bottleneck your maximum bandwidth. While dedicating 1:1:1 NIC -> Core -> IRQ solves this.

Generally setting up IRQ masking should be part of your pre-startup configuration. Not setting it dynamically at run-time.

    It also says that you can’t disable interrupts for all cores

Yes/No.

One can disable all interrupts for a core via separate mechanisms. But this means you need to implement your own scheduler on that core, and remove that core from the scheduler. Also you lose the ability to fire syscalls, lose kernel level Mutexes, Futexes, IO, etc. It gets very complicated very fast. Having some interrupts is generally a good thing.

Lastly you'll still be subject to memory bus interrupts (but they aren't called interrupts, they're non trivial stalls) to maintain Cache coherency and R/W ordering (in AMD64 and ARMv8). You can never opt out of these.

If you want to disable all interrupts on all cores why are you even running an OS?

    It also seems very non portable.

AMD64 Linux is literally the most used OS in data centers, but you are nonetheless correct.

Unklejoe · on Dec 5, 2016

Thanks. Lots of good information there.

As for disabling interrupts on all cores, the specific case which I was thinking of was when you’re running a preemptive OS and you have a bunch of threads (== num CPUs) which all need to temporarily disable interrupts to create a critical region at the same time. I could be misunderstanding the original goal though. My understanding of the parent comment was that disabling IRQs would be used to prevent a thread from being scheduled out in the middle of an operation (like a kernel spinlock on a UP system), so that you can avoid the lock even on a non-cooperative multitasking system. I guess by temporarily disabling IRQs, the system becomes temporarily cooperative.

valarauca1 · on Dec 5, 2016

     My understanding of the parent comment was that disabling
     IRQs would be used to prevent a thread from being scheduled
     out in the middle of an operation

1. The scheduler already respects this. It won't suspend during the body of a function. Only at entry/exit.

2. `chrt --sched-deadline X` can ensure your thread will run for `X`ns without interruption (higher priority then all other tasks). And ensure you always get the same time budget consistently. [1] [2] [3]

3. The original goal of decoding audio can mostly be handled with DEADLINE + setting CPU affinity + masking interrupts on that CPU. Modifying IRQ Masks dynamically hits a lot of internal locking, and hurts your cache coherency. Any gains in one process will reflect massively negatively on the entire system.

4. This really seems overkill. Deadline Scheduling will do 99% of it seems you/parent need/want. Basically your process cant be interrupted for NanoSeconds, and will run every NanoSeconds I suggest:

     chrt --sched-deadline [TIME_BETWEEN_RUNS_NS] --sched-runtime [TIME_WILL_RUN_NS] -p [PID]

(I may have the Runtime/Time between backwards).

Most my experiencing is the Networking for HFT not audio decoding FYI.

[1] https://en.wikipedia.org/wiki/SCHED_DEADLINE

[2] http://man7.org/linux/man-pages/man1/chrt.1.html

[3] Consistently means as close to absolutely prefect as possible. Your OS is a dynamic system so it'll be +/- a few NanoSeconds. Generally the delta is very small. Caches stall, RedBlack trees are re-balanced, work is stolen, etc.

RossBencina · on Dec 5, 2016

I stand corrected on the Linux IRQ point.

In the past I have not found Windows thread affinity APIs to provide hard guarantees. Perhaps this has changed since I last tried it.

colanderman · on Dec 5, 2016

Thank you, that was very informative.

haberman · on Dec 5, 2016

If you can make your system lock-free, it will have a bunch of nice properties:

- deadlock-free

- obstruction-free (one thread getting scheduled out in the middle of a critical section doesn't block the whole system)

- OS-independent, so the same code can run in kernel space or user space, regardless of OS or lack thereof (basically any lock besides a spin-lock needs OS support)

These properties are especially nice if you are writing a library that you want to be as unassuming and unimposing as possible about the surrounding environment. Given two libraries, one that depends on locks and one that is lock-free, I'll go for the lock-free one if at all possible.

MichaelMoser123 · on Dec 5, 2016

Are you sure that you get rid of deadlocks? Instead of a lock it loops on an atomic variable if there is some contention. Granted, you have some more control over the priority of tasks and it will be more efficient - if you are smart enough (still it gets very tricky), but the essence is that you are rolling your own locks with atomic variables.

Unklejoe · on Dec 5, 2016

There's a slight difference between rolling your own spinlocks and simply doing something like a CAS loop to change a variable. They’re usually both built using similar atomic primitives, but…

The difference is that in a true lockless algorithm, no thread "holds" the lock (because there is no lock). Therefore, if a thread gets scheduled out while in the middle of any operation, every other running thread can still continue working. That’s in contrast to a locking implementation where it’s very possible for a thread to get swapped out while holding a lock, thus causing every other thread to spin endlessly until the scheduler ends up swapping the original thread back in so it can release the lock.

This is one key advantage of using the OS supplied mutex facility rather than rolling your own spinlocks. The OS mutex has the ability to know that if a thread is blocking on the mutex and the thread which holds it is currently not running, it might as well start running the original thread again so that it can release the lock (as nothing else can get done until that happens anyway).

I believe this is why (in my personal experience) userspace spinlocks perform better than mutexes up until about the point of where you have many more threads than CPUs. My guess is that the probability of a thread getting scheduled out while holding the lock increases with both the number of threads as well as the overall contention for the lock.

RossBencina · on Dec 5, 2016

> Are you sure that you get rid of deadlocks?

By definition, a lock-free algorithm guarantees at least one thread makes forward progress at each time step. So yes, there is no chance of deadlock.

If you want to guarantee that all threads make progress per unit time, you need wait-free.

MichaelMoser123 · on Dec 6, 2016

there is the lock-free circular buffer, where the buffer is linear and the indexes are atomic variables; now once the buffer is full, the indexes act like locks - the producer can't insert anything unless the consumer removes entries, now the 'locks' have the crucial role in maintaining the data structure.

what am i missing?

tonyplee · on Dec 5, 2016

I implemented a IPC msg Q for a router company 15 years ago.

Passing msg with different Threads/Process is not "rolling your own locks with atomic variables".

Basically, each T/P need to receive msg will create MsgQ with msgQLen and return Q_id.

msgQ_Send() Api will use atomic_add() to get the next slot on the msgQ to place the value and/or MsgPtr.

There is no lock needed, only Atomic_add().

Performance wise, it works very well in SMP environments and I ported the library to vxWorks, Linux kernel, user space and QNX.

I have tons of validation tests on the correctness of the message orders, contain and constantly run them on multiple 4 core Xeon machine to check the correctness and benchmark performance.

gpderetta · on Dec 5, 2016

> but the essence is that you are rolling your own locks with atomic variables.

Such an algorithm wouldn't even be obstruction-free.

colanderman · on Dec 5, 2016

Yes, I understand the guarantees lock-freedom gives. My point is that I don't see how the same benefits aren't achievable with a simpler mechanism merely given a CPU that doesn't go out to lunch.

I suppose environmental portability is a good enough reason, even if I find it silly that high-performance multicore apps are forced into such a scenario.

beagle3 · on Dec 5, 2016

> If you actually care about real-time guarantees enough to look into lock-free solutions,

lock free does not give any real time guarantee for any specific task (and neither does wait free, a much stronger and harder to achieve constraint)

The reason I use it when I do, is that it guarantees that a problem in one thread/worker/process, e.g. a stall due to a database commit, does not make others stall as well.

> (CAS to acquire a work item)

Most likely, you've just described a lock free implementation.

> I assume I'm missing something, since I'm somewhat new at this, and lock-freedom is clearly in vogue, but I don't know what it is.

Lock freedom is the property that, at any instant, at least one thread/worker/process is making some progress independently of the others. It is a useful definitions because it doesn't care whether locks are explicit (e.g. mutex, semaphore, critical section) or implicit (polling or spinning or whatever)

slashdev · on Dec 5, 2016

Go is a language exactly like that, but I still use lock free algorithms. The reason is even with a private per thread data structure you don't have atomic higher level operations like enqueue. So you need to disable interrupts around the data structure methods, both for the OS and the Go runtime. That will cost you more than a lock would, and afaik it's not possible from userspace on Linux (if it is I would love to know how!)

adekok · on Dec 5, 2016

> Maybe someone can educate me. I've studied them for a while, but I don't get the fascination with lock- or wait-free queues.

Let's say you're running a multi-threaded program. The threads (at some point) need to talk to each other. Perhaps to exchange data.

The simplest solution is to use a mutex. It's simple. Just take your existing single-threaded queue / list / tree code, and slap a mutex around it. Maybe 10 lines of code changed. It will work.

Then you do profiling... you discover that when 30 threads exchanged 10's of 1000's of messages per second, mutex contention is a huge problem. You can easily have 30% of wall-clock time spent waiting for contented mutexes.

When you move to lock-free algorithms, you give up a little, and gain a lot. If you're willing to relax some requirements, even better. e.g. I care that the other thread gets a message... but it doesn't have to be now. It could be a millisecond in the future.

A lock-free queue means that you can just dump messages to another thread, and they'll show up in the other thread. (Eventually, when it gets around to looking at the queue).

And mutex contention does way down. Especially because you're no longer using mutexes!

There can still be some contention on CAS in lock-free algorithms, especially for MPMC queues. The solution there (as is done in the linked project) is to move to single-producer single-consumer queues as much as possible.

The contention goes down to pretty much zero, and performance goes up.

I've spent a good chunk of the last few months re-designing the core of FreeRADIUS to avoid mutexes, shared memory, etc. The old code would get ~40K packets/s with multiple highly contended mutexes. The new test code is getting 3-4M packets/s with a lock-free queue / message passing system.

i.e. mutexes were killing performance in the old system. They're gone in the new system. And performance is substantially higher.

ww520 · on Dec 5, 2016

One lock-free queue is just simpler to work with than multiple queues. And the one queue nicely decouples the producers and the consumers, where they don't need to know each other's load and scheduling.

> multi-producer becomes trivial (atomic doorbell + round robin)

This implements all the work units consumes the same load for the consumers. Otherwise, the producer needs more logic to figure out which queue to enqueue. Accessing the consumers' statistics again becomes the problem of accessing shared data.

gpderetta · on Dec 5, 2016

One benefit of wait free algorithms is that allow realtime threads to communicate with non-realtime threads without risking unbounded waits. Similarly priority inversion is no longer an issue.

sleepingeights · on Dec 5, 2016

A real-time OS/app wouldn't benefit even with a 5 cycle Crit-section. Most, if not all, "true" real-time OSes require "real-time" data coming in from IRQs. Sure, there can be some complex synchronization if an IRQ line requires a certain number of cycles to read/write data where some queue logic can happen.

mfukar · on Dec 5, 2016

Real time systems with preemption exist. Non-preemption is not always a feasible choice - its disadvantages are too many.

kitsuac · on Dec 5, 2016

Cooperative multi-tasking doesn't always mean a per-worker queue is feasible. Really the only time I could see that being feasible is if the work queue is constantly saturated. Often times you just don't know in advance which worker to queue up with which task.

colanderman · on Dec 5, 2016

I mean to choose a queue by producer, not consumer.

RossBencina · on Dec 5, 2016

New algorithms are always welcome. But it isn't a general-purpose MPMC queue (nor does it claim to be). The following constraints are listed in the "Reasons not to use" section:

- not linearizable wrt multiple producers

- not NUMA aware

- not sequentially consistent, quote: "things (such as pumping the queue until it's empty) require more thought to get right in all eventualities"

These are subtleties (especially the first and the last) that may bite if you don't know what you're doing and just want to "plug in a queue." It's going to depend on how you use the queue.

Side note: There are other examples of novel lock-free algorithms that have only been published by blog, powerpoint or github (e.g. Dmitry Vyukov's well-known work, Cliff Click's concurrent hash table, Jeff Preshing's hash table.) However, in general, lock-free algorithms are widely known to be very difficult to get correct (not too dissimilar to getting a Distributed Consensus Algorithm correct). I can't help thinking that we need a higher bar of correctness than the author's claims and some unit tests. Would you use a distributed algorithm that didn't come with a correctness proof? Personally I'd like to see a formal proof, peer review, and a Spin model. Peer-review need not be via academic channels, just something more than self-publication.

eternalban · on Dec 5, 2016

> Would you use a distributed algorithm that didn't come with a correctness proof?

That happens all the time in practice. Prior to Kyle Kingsbury's influential blogs on noSQL darlings, for example, it was not even on the geek-pop radar.

And then, quite frankly, there is the gap between implementation and formal description of an algorithm. Incorrect implementation will obviate the guarantees your formal proof is asserting.

RossBencina · on Dec 5, 2016

> there is the gap between implementation and formal description of an algorithm. Incorrect implementation will obviate the guarantees your formal proof is asserting.

Agree. One aspect of this wrt implementing lock-free algorithms in C++: Usually the academic papers assume a sequentially consistent memory model, so when you're implementing the algorithm you have to work out how to place the memory barriers correctly (a potentially non-trivial task, as other comments on this page demonstrate).

hvidgaard · on Dec 5, 2016

> I can't help thinking that we need a higher bar of correctness than the author's claims and some unit tests. Would you use a distributed algorithm that didn't come with a correctness proof?

Not any chance. Using any multi party algorithm, without any formal proof of any kind, and you just know that there is something that will not work as expected.

vvanders · on Dec 5, 2016

There's a reason that there aren't many lock free implementations. If you don't get one from your CPU vendor then there's a high chance that it's not correct.

DSMan195276 · on Dec 5, 2016

I would agree, but I'd add that it has been made a little bit easier to handle with the addition of `atomic` in C11 and C++11, since it means you can write atomic code without having to drop to inline assembly to ensure you use the right instructions to make it lock-free. That said, that's only one piece of the puzzle, and you really have to know what you're doing to ensure you write it correctly.

That said, after looking at this code the author appears to know what they're doing. I'd have to read it a lot closer to really make sure though.

ape4 · on Dec 5, 2016

`atomic` may require a lock, depending on the CPU.

DSMan195276 · on Dec 5, 2016

True. In that event though, there's little other options. Locking those variables individually like that is probably worse then other locking options though, so it is worth keeping in mind. but while it would be slower, it would still be just as 'correct' as using actual atomic variables.

slashdev · on Dec 5, 2016

You wouldn't use this kind of queue on an architecture like that anyway, just use a lock and a regular queue (the lock is not avoidable anyway)

flamedoge · on Dec 5, 2016

As far as atomic 'types' are concerned, they rely on CPU hardware for atomicity. I wouldn't be surprised if generated code for non-supported width is done using lock. http://en.cppreference.com/w/cpp/atomic/atomic

DSMan195276 · on Dec 5, 2016

Don't get me wrong, I understand that atomic operations require CPU support. My point was that `atomic` allows you to get access to these operations in a more standard way for different platforms (And of course, it will fall back to generic locking in the event you use an unsupported width or operation - but that's arguably better then simply not working at all).

gambiting · on Dec 5, 2016

I've been to a C++11 presentation on multithreading recently, and x86 CPUs guarantee that all reads/writes on from std::atomic are indeed, atomic and they always happen in order they were called in. Code generated on ARM guarantees that even multi-byte reads/writes happen in one instruction, but there's no guarantee on the order.

gpderetta · on Dec 5, 2016

> x86 CPUs guarantee that all reads/writes on from std::atomic are indeed, atomic

... for sizeof(T) <= 64 bits on amd64.

128 bit is technically supported (thanks to the availability DCAS) but stores and especially loads are significantly more expensive (as there are no native atomic 128 bit load and store), even in relaxed modes. Also, as even loads perform write cycles, they can't be used on read-only memory.

Anything larger uses, IIRC, a spinlock pool and is not lock-free.

/pedantic

wfunction · on Dec 5, 2016

Are you saying this one isn't correct? Confused what point you're trying to make.

vvanders · on Dec 5, 2016

The atomic compare and swap is only one part of getting it right. You also have to deal with memory and instruction fences.

Certain architectures have different semantics when it comes to that ordering so something that works on x86 for instance might explode on PPC or ARM. If you want to verify that the algorithms are correct you basically have to do static analysis and the only way you'll do that is if you have access to the microcode and register models(I.E. you make CPUs).

[edit]

To put this into practical terms I spent some time working with a popular game engine that used a lockfree queue at it's core rendering path. It wasn't until 2 or 3 titles had shipped on this engine that it was discovered that there was a bug in the lockfree algo. We're talking trillions of operations under many different threading and context switching loads and at least 3 different CPU architectures.

trungaczne · on Dec 5, 2016

I don't get the point you're trying to make. Aren't std::memory_order's guarantees supposed to work with all support archs?

The code in the OP does use memory fences. Are you implying that their implementation are incorrect?

vvanders · on Dec 5, 2016

I'm not implying their implementation is incorrect. Just that these types of things are very easy to get wrong and when you do it's usually the type of bugs that take months to track down after you've eliminated every other subsystem involved.

Generally if there's not a huge organization putting their reputation(and $$$) on the line there is going to be bugs.

Most of the time if you're going lockfree for performance reasons there's usually much large gains to be found in your cache usage or overall architecture.

sillysaurus3 · on Dec 5, 2016

Generally if there's not a huge organization putting their reputation(and $$$) on the line there is going to be bugs.

This argument applies to any hard problem, so it doesn't seem valid. Whether there's an important bug in a project depends on someone's skill and on how much time they've dedicated to it, and it's hard to know how skilled or dedicated someone is.

lmm · on Dec 5, 2016

When the prior against any particular implementation being correct is so high, I think it's correct to not trust any new implementation without strong evidence that it is correct, even if one is not aware of any specific issues. Personally I wouldn't adopt a new lock-free structure implementation without at least one of established backing or a formal proof of correctness.

vvanders · on Dec 5, 2016

Yeah, the more I think about it the parallels with hand-rolled crypto are pretty strong, although if you get crypto wrong you don't just crash.

marcosdumay · on Dec 5, 2016

You hit it, but it's not a side point.

The entire problem of rolling your own security code is that you will never know if you messed it up.

For functional code this is only an issue with silent corrupted data.

je42 · on Dec 5, 2016

this is less about "skill" but about the awareness how the different CPUs are implemented and where the algorithm is not behaving correctly in conjunction with the CPU spec.

In addition the error class is a mean one: doesn't happen often statistically and difficult to reproduce and as such can be very expensive to track down.

sillysaurus3 · on Dec 5, 2016

The specs are quite clear about memory fences. Just because something has a failure mode that's hard to detect doesn't mean that luck has anything to do with implementing it correctly. And if luck isn't a factor, then that leaves skill and dedication.

vvanders · on Dec 5, 2016

Specs/hw can have bugs too and he never said anything about luck.

I have no issue with hard problems but the accountability for concurrency issues is gnarly. I've had driver issues look like concurrency bugs and concurrency bugs look like driver issues. If you feel the need to take on concurrency you better have the schedule budget for it or be willing to throw it away.

TeMPOraL · on Dec 5, 2016

Assuming I don't have access to CPU specs, how could I debug constructs like this queue? Should I set aside a machine and have it pound the queue with random data for the next six months, alerting me each time the error rate rises above cosmic ray threshold?

vvanders · on Dec 5, 2016

> a machine

In the absence of formal proof think 50+ machines. You basically setup a lab to run 24/7 and a/b with a hope that you repro.

Same technique works for really gnarly intermittent driver bugs.

It's still no guarantee but at least you get in the same range assuming the execution profile is diverse and robust.

panic · on Dec 5, 2016

How was the bug discovered? Sounds like a fun story.

jjawssd · on Dec 5, 2016

What was the solution?

vvanders · on Dec 5, 2016

More fences :) (and the overhead that came with them)

jdright · on Dec 5, 2016

Can you cite the name of the tech or any of the games?

kitsuac · on Dec 5, 2016

Not true at all - the core atomic operations are very well understood by now and are supported on any modern CPU.

dbaupp · on Dec 5, 2016

This is a little optimistic: being able to tractibly reason about atomicity in weak memory models like C11 is still at the cutting edge of CS research. It is really hard to prove nontrivial data structures are correct.

std_throwaway · on Dec 5, 2016

Could someone please explain for the uninitiated what lock-free actually means and why it matters.

After reading http://www.drdobbs.com/lock-free-data-structures/184401865 i got this:

* Normal locking means that the process which holds the lock can hold it arbitrarily long thereby locking out all other processes. Also a live-lock and dead-lock can occur if there is a conflict between processes which try to acquire the same set of resources but cannot acquire all of them at once.

* Wait-free means that no algorithm working with the data structure will be delayed arbitrarily. This is pretty strong. A simple example would be a ring-buffer with a single reader and writer.

* Lock-free means that no process can block the resource for longer than it takes to read/write it. There will always be at least one process that can make progress while the others may have to wait (weaker than wait-free, but stronger than ordinary locked).

Normal processes on most operating systems can be interrupted at any instruction. This would make it impossible to carry out a multiple-instruction sequence to lock-modify-unlock the data structure because it could leave the data structure locked. Does this in turn mean that there must be a "commit" instruction that is uninterruptible?

RossBencina · on Dec 5, 2016

Lock-freedom and wait-freedom are formal terms in concurrent algorithm theory. Best to check out Herlihy and Shavit, "The Art of Multiprocessor Programming," which I have unfortunately temporarily misplaced.

Informally:

"Lock-free" means that at each time step, at least one thread that is interacting with a shared object makes progress towards completing the operation (e.g. an enqueue or dequeue operation). Neither deadlock nor livelock (where no thread makes progress) can happen (hence "lock free"). This does not guarantee fairness or starvation-freedom (a fast thread could in-theory DoS a slow thread).

"Wait-free" means that at each time step, every thread will make progress. This might e.g. involve an algorithm where fast threads help slow threads complete their operations.

Wait-freedom is indeed a stronger condition, but it's usually more expensive to implement (although not always).

> Does this in turn mean that there must be a "commit" instruction that is uninterruptible?

Yes, lock-free algorithms make use of atomic instructions such as CAS (compare-and-swap). Sometimes it's used as a "commit" but there might be multiple atomic operations depending on the algorithm and the data structure (so I guess, a kind-of multi-phase commit sequence). This is a nice intro paper by one of the giants of the field:

Maged M. Michael, "The Balancing Act of Choosing Nonblocking Features" ACM Queue vol. 11, no. 7 http://queue.acm.org/detail.cfm?id=2513575

Update: I should add that "lock free" hasn't always been a formally defined technical term, and some people use it informally to mean "doesn't use mutexes," or even "only uses atomic operations." Under such a relaxed definition a hand-coded spin-lock might be considered "lock free," but it really isn't -- if a thread holding the spinlock crashes, the spinlock would never be unlocked; such a situation could not arise with a formally lock-free algorithm.

pzh · on Dec 5, 2016

It's interesting that many people consider lock-free algorithms to be appropriate for real-time programming, but by the formal definitions, lock-free doesn't guarantee that a particular thread won't be starved or that an operation would finish by a certain amount of time. Maybe in these cases, wait-free would be more appropriate...

RossBencina · on Dec 5, 2016

> Maybe in these cases, wait-free would be more appropriate...

In some cases wait-free algorithms are used (e.g. real-time Java queues).

There's ongoing research into whether lock-free queues are wait-free in practice.[0] For example, under some reasonable scheduling assumptions, lock-free operations have been shown to have bounded time execution.[1] That's a result for uniprocessors. I'm not aware of a corresponding result for multiprocessors, but I haven't gone looking for a couple of years. There are some leads in the first citation.

[0] "Are Lock-Free Algorithms Practically Wait-Free?" http://tce.technion.ac.il/wp-content/uploads/sites/8/2015/06...

[1] Anderson, J. H. et al. 1997. “Real-Time Computing with LockFree Shared Objects.” ACM Transactions on Computer Systems. 15(2):134–165 https://cs.unc.edu/~anderson/papers/tocs97.pdf

gpderetta · on Dec 5, 2016

> Does this in turn mean that there must be a "commit" instruction that is uninterruptible?

Yes, in a way.

Some CPUs provide atomic instructions that perform multiple operations in a bounded time. These instructions cannot be interrupted (by other threads, the OS, interrupts) and appear all-or-nothing (i.e. the intermediate effects are not visible. The most common of such instructions is Compare And Exchange (aka CAS); CAS is often used as the 'commit' action in lock-free algorithms.

Some other architectures have instead a more general 'transactional' feature (known as Load Linked/Store Conditional or LS/SC) which provides for very limited transactions of a single cacheline. A naive implementation could live-lock, but in practice server class architectures provide stronger guarantees.

It can be proven that LL/SC and CAS are equally powerful (i.e. the same set of lock-free/wait-free algorithms can be implemented with both).

LL/SC is more natural for RISC machines, while CAS is common in CISC, but there are plenty of counterexamples.

drfuchs · on Dec 5, 2016

From the documentation, it doesn't seem clear that there's any guarantee that a particular queued item will ever eventually be dequeued (in a program that runs forever).

Consider the case where each producer thread queues N items, and then waits until at least one of its N items is dequeued before immediately topping back up; while the consuming thread dequeues at a slower rate than the producers are able to produce. Maybe no item from producer number 1 ever gets dequeued? Or did I miss something in the documentation?

kasey_junk · on Dec 5, 2016

The documentation says it's not serializable or linearizable in some cases, nor does it have preemption or fairness garauntees so you are correct.

loeg · on Dec 5, 2016

If you like this, you may also like http://concurrencykit.org/ .

film42 · on Dec 5, 2016

Facebook's folly also has a MPMC queue:

https://github.com/facebook/folly/blob/master/folly/MPMCQueu...

pzh · on Dec 5, 2016

Here is one more:

https://github.com/pzhivkov/SPConcurrency

(Mostly targeting iOS / macOS).

Unklejoe · on Dec 5, 2016

Another useful lockless library - http://liblfds.org

It's written in C though.

anti-thought · on Dec 5, 2016

For what its worth, the base algorithm is extremely similar in concept to an Erlang queue implementation I saw recently [1]. But I do like doing these language implementation comparisons. The Erlang one definitely suffers from the base language being copy-on-write though.

[1] - https://github.com/dstar4138/libemp/tree/develop/src/buffers...

Blackthorn · on Dec 5, 2016

I'd like to raise my hat to some absolutely fabulous documentation. That's one of the best READMEs I've seen in a very long time.

pzh · on Dec 5, 2016

The README is really good for users of the library. I was looking for a description of the algorithm, though, and I couldn't find any. Does anyone know what algorithm this library implements? (e.g. a literature reference would be helpful). I'm familiar with a couple of PQ implementations based on skip-lists: Sundell & Tsigas and Linden & Jonsson--but this library doesn't seem to be based on any of them.

CyberDildonics · on Dec 5, 2016

I've used this and relied on it, it works very well.

tomovo · on Dec 5, 2016

I have also used this for queuing audio buffers in my game. It works great and the author was nice enough to do a quick adjustment for me. How I used it: http://www.catnapgames.com/2016/01/19/new-sound-mixer/

martincmartin · on Dec 5, 2016

Facebook's folly library contains a general purpose, multi producer multi consumer queue[0]. It's general purpose and linearizeable. It's used extensively inside Facebook.

Disclaimer: I work at Facebook.

[1] https://github.com/facebook/folly/blob/master/folly/MPMCQueu...

dkersten · on Dec 6, 2016

I found this earlier this year and was using it for a little toy game project to communicate between the rendering thread and everything else. Its super easy to use :)

I didn't develop the project to the point where I could comment on its performance though and most of my processing was happening in shaders anyway.

hitlin37 · on Dec 5, 2016

ok, i have been also looking into concurrent vectors lately to get something like this running between two threads.

Someone know or point here to a header only implementation of vector/queue with locks and really simple code with explanation ? Something written as beautifully as herb sutter's or scott meyers examples, They are easy code to play around and understand :)

*An implementation without vendor concurrent library. Looking for an implementation on Linux/ARM. Although i do have access to libpthread threads and locks.

nlightcho · on Dec 5, 2016

A really simple templated queue that uses std::mutex.

https://gist.github.com/AlexanderDzhoganov/cffa9aaae5a325cdf...

hitlin37 · on Dec 5, 2016

Thanks, i will give it a try.

nathancahill · on Dec 5, 2016

Can't wait to see the aphyr teardown.

marvy · on Dec 5, 2016

I though that was for distributed stuff??