Maybe someone can educate me. I've studied them for a while, but I don't get the fascination with lock- or wait-free queues.
If you actually care about real-time guarantees enough to look into lock-free solutions, you already ought to have non-preemptive workers (i.e. cooperative multitasking). This then allows you to have only one queue per CPU (since exclusive access is guaranteed), simplifying the problem.
With one queue per CPU, multi-producer becomes trivial (atomic doorbell + round robin), and multi-consumer is easy (CAS to acquire a work item). You don't have fairness guarantees for consumers, but who cares, only when they are not saturated can they not be fair and then that's a good thing.
Application stalls don't happen (one non-preemptible queue per CPU), application crashes are managed at a higher level, and IRQs can be disabled for the 5 cycles you spend in the critical sections.
I assume I'm missing something, since I'm somewhat new at this, and lock-freedom is clearly in vogue, but I don't know what it is.
Your architecture proposal seems sound assuming that you have that much control over the whole system, and you don't need to integrate with any legacy components. In the common case, you need to deal with limited control (e.g. in a userspace app) and you have legacy components (e.g. a consumer OS, a Java VM, etc).
If you're doing best-effort soft-real-time on a consumer OS then you probably don't have control over thread affinity to _guarantee_ one thread per CPU, and you certainly can't disable IRQs.
Even on servers, you can get Linux to set up your thread-per-CPU thing, but still no IRQ masking. Also, you'd need your whole stack to do cooperative multitasking in a fully composable way before you could do as you propose. Maybe that's plausible in Go, but if you're in Java (as a lot of the HFT people seem to be, for reasons I don't understand) then probably only _some_ of your system is structured as one-thread-per-CPU cooperative multitasking, and you may still need to offload high-latency work using lock-free queues.
In my domain (real-time audio on desktop OSes), neither Windows nor macOS have working deterministic priority inversion mitigation. So using lock-free queues between priority domains (e.g. between GUI thread and real-time audio thread) avoids priority inversion risk (compared to using mutexes). Granted, this is a rather specialised use-case.
Another reason people are interested in these things is that there's a perception that lock-free algorithms perform better for inter-thread queuing than the alternatives. In particular, there are claims that they scale better when there are a large number of producer and consumer threads/cores. I think this is the most controversial area, since the queue algorithms may scale better than e.g. a single mutex on a single queue, but it's not necessarily the case that an application architecture that uses a MPMC queue is the best fit for purpose.
If you're doing best-effort soft-real-time on a consumer OS then
you probably don't have control over thread affinity to _guarantee_ one thread per CPU
Which consumer OS? FreeBSD, OpenBSD, Linux, and Windows all offer the functionality. To my knowledge MacOS is the only consumer OS that doesn't.
and you certainly can't disable IRQs
This has been in Linux for over a decade. You can modify IRQ affinity from the root shell via the proc-fs [1].
Hmm. Interesting. I would have never thought to use that within a program during runtime.
I don't think that the IRQ affinity facility is meant to be used like that ("that" being the masking of interrupts for a given CPU to create critical regions with a program).
For starters, it seems like you'll need to be running as root to be able to change it. That could be an issue for some applications. It also says that you can’t disable interrupts for all cores.
I would have never thought to use that within a program during runtime.
During Runtime is unlikely. It does have clear benefits especially when dealing with 10GbE nics. Having N-cores handle 2x NIC's the caching thrashing can bottleneck your maximum bandwidth. While dedicating 1:1:1 NIC -> Core -> IRQ solves this.
Generally setting up IRQ masking should be part of your pre-startup configuration. Not setting it dynamically at run-time.
It also says that you can’t disable interrupts for all cores
Yes/No.
One can disable all interrupts for a core via separate mechanisms. But this means you need to implement your own scheduler on that core, and remove that core from the scheduler. Also you lose the ability to fire syscalls, lose kernel level Mutexes, Futexes, IO, etc. It gets very complicated very fast. Having some interrupts is generally a good thing.
Lastly you'll still be subject to memory bus interrupts (but they aren't called interrupts, they're non trivial stalls) to maintain Cache coherency and R/W ordering (in AMD64 and ARMv8). You can never opt out of these.
If you want to disable all interrupts on all cores why are you even running an OS?
It also seems very non portable.
AMD64 Linux is literally the most used OS in data centers, but you are nonetheless correct.
As for disabling interrupts on all cores, the specific case which I was thinking of was when you’re running a preemptive OS and you have a bunch of threads (== num CPUs) which all need to temporarily disable interrupts to create a critical region at the same time. I could be misunderstanding the original goal though. My understanding of the parent comment was that disabling IRQs would be used to prevent a thread from being scheduled out in the middle of an operation (like a kernel spinlock on a UP system), so that you can avoid the lock even on a non-cooperative multitasking system. I guess by temporarily disabling IRQs, the system becomes temporarily cooperative.
My understanding of the parent comment was that disabling
IRQs would be used to prevent a thread from being scheduled
out in the middle of an operation
1. The scheduler already respects this. It won't suspend during the body of a function. Only at entry/exit.
2. `chrt --sched-deadline X` can ensure your thread will run for `X`ns without interruption (higher priority then all other tasks). And ensure you always get the same time budget consistently. [1] [2] [3]
3. The original goal of decoding audio can mostly be handled with DEADLINE + setting CPU affinity + masking interrupts on that CPU. Modifying IRQ Masks dynamically hits a lot of internal locking, and hurts your cache coherency. Any gains in one process will reflect massively negatively on the entire system.
4. This really seems overkill. Deadline Scheduling will do 99% of it seems you/parent need/want. Basically your process cant be interrupted for NanoSeconds, and will run every NanoSeconds I suggest:
[3] Consistently means as close to absolutely prefect as possible. Your OS is a dynamic system so it'll be +/- a few NanoSeconds. Generally the delta is very small. Caches stall, RedBlack trees are re-balanced, work is stolen, etc.
If you can make your system lock-free, it will have a bunch of nice properties:
- deadlock-free
- obstruction-free (one thread getting scheduled out in the middle of a critical section doesn't block the whole system)
- OS-independent, so the same code can run in kernel space or user space, regardless of OS or lack thereof (basically any lock besides a spin-lock needs OS support)
These properties are especially nice if you are writing a library that you want to be as unassuming and unimposing as possible about the surrounding environment. Given two libraries, one that depends on locks and one that is lock-free, I'll go for the lock-free one if at all possible.
Are you sure that you get rid of deadlocks? Instead of a lock it loops on an atomic variable if there is some contention. Granted, you have some more control over the priority of tasks and it will be more efficient - if you are smart enough (still it gets very tricky), but the essence is that you are rolling your own locks with atomic variables.
There's a slight difference between rolling your own spinlocks and simply doing something like a CAS loop to change a variable. They’re usually both built using similar atomic primitives, but…
The difference is that in a true lockless algorithm, no thread "holds" the lock (because there is no lock). Therefore, if a thread gets scheduled out while in the middle of any operation, every other running thread can still continue working. That’s in contrast to a locking implementation where it’s very possible for a thread to get swapped out while holding a lock, thus causing every other thread to spin endlessly until the scheduler ends up swapping the original thread back in so it can release the lock.
This is one key advantage of using the OS supplied mutex facility rather than rolling your own spinlocks. The OS mutex has the ability to know that if a thread is blocking on the mutex and the thread which holds it is currently not running, it might as well start running the original thread again so that it can release the lock (as nothing else can get done until that happens anyway).
I believe this is why (in my personal experience) userspace spinlocks perform better than mutexes up until about the point of where you have many more threads than CPUs. My guess is that the probability of a thread getting scheduled out while holding the lock increases with both the number of threads as well as the overall contention for the lock.
there is the lock-free circular buffer, where the buffer is linear and the indexes are atomic variables; now once the buffer is full, the indexes act like locks - the producer can't insert anything unless the consumer removes entries, now the 'locks' have the crucial role in maintaining the data structure.
I implemented a IPC msg Q for a router company 15 years ago.
Passing msg with different Threads/Process is not "rolling your own locks with atomic variables".
Basically, each T/P need to receive msg will create MsgQ with msgQLen and return Q_id.
msgQ_Send() Api will use atomic_add() to get the next slot on the msgQ to place the value and/or MsgPtr.
There is no lock needed, only Atomic_add().
Performance wise, it works very well in SMP environments and I ported the library to vxWorks, Linux kernel, user space and QNX.
I have tons of validation tests on the correctness of the message orders, contain and constantly run them on multiple 4 core Xeon machine to check the correctness and benchmark performance.
Yes, I understand the guarantees lock-freedom gives. My point is that I don't see how the same benefits aren't achievable with a simpler mechanism merely given a CPU that doesn't go out to lunch.
I suppose environmental portability is a good enough reason, even if I find it silly that high-performance multicore apps are forced into such a scenario.
> If you actually care about real-time guarantees enough to look into lock-free solutions,
lock free does not give any real time guarantee for any specific task (and neither does wait free, a much stronger and harder to achieve constraint)
The reason I use it when I do, is that it guarantees that a problem in one thread/worker/process, e.g. a stall due to a database commit, does not make others stall as well.
> (CAS to acquire a work item)
Most likely, you've just described a lock free implementation.
> I assume I'm missing something, since I'm somewhat new at this, and lock-freedom is clearly in vogue, but I don't know what it is.
Lock freedom is the property that, at any instant, at least one thread/worker/process is making some progress independently of the others. It is a useful definitions because it doesn't care whether locks are explicit (e.g. mutex, semaphore, critical section) or implicit (polling or spinning or whatever)
Go is a language exactly like that, but I still use lock free algorithms. The reason is even with a private per thread data structure you don't have atomic higher level operations like enqueue. So you need to disable interrupts around the data structure methods, both for the OS and the Go runtime. That will cost you more than a lock would, and afaik it's not possible from userspace on Linux (if it is I would love to know how!)
> Maybe someone can educate me. I've studied them for a while, but I don't get the fascination with lock- or wait-free queues.
Let's say you're running a multi-threaded program. The threads (at some point) need to talk to each other. Perhaps to exchange data.
The simplest solution is to use a mutex. It's simple. Just take your existing single-threaded queue / list / tree code, and slap a mutex around it. Maybe 10 lines of code changed. It will work.
Then you do profiling... you discover that when 30 threads exchanged 10's of 1000's of messages per second, mutex contention is a huge problem. You can easily have 30% of wall-clock time spent waiting for contented mutexes.
When you move to lock-free algorithms, you give up a little, and gain a lot. If you're willing to relax some requirements, even better. e.g. I care that the other thread gets a message... but it doesn't have to be now. It could be a millisecond in the future.
A lock-free queue means that you can just dump messages to another thread, and they'll show up in the other thread. (Eventually, when it gets around to looking at the queue).
And mutex contention does way down. Especially because you're no longer using mutexes!
There can still be some contention on CAS in lock-free algorithms, especially for MPMC queues. The solution there (as is done in the linked project) is to move to single-producer single-consumer queues as much as possible.
The contention goes down to pretty much zero, and performance goes up.
I've spent a good chunk of the last few months re-designing the core of FreeRADIUS to avoid mutexes, shared memory, etc. The old code would get ~40K packets/s with multiple highly contended mutexes. The new test code is getting 3-4M packets/s with a lock-free queue / message passing system.
i.e. mutexes were killing performance in the old system. They're gone in the new system. And performance is substantially higher.
One lock-free queue is just simpler to work with than multiple queues. And the one queue nicely decouples the producers and the consumers, where they don't need to know each other's load and scheduling.
This implements all the work units consumes the same load for the consumers. Otherwise, the producer needs more logic to figure out which queue to enqueue. Accessing the consumers' statistics again becomes the problem of accessing shared data.
One benefit of wait free algorithms is that allow realtime threads to communicate with non-realtime threads without risking unbounded waits. Similarly priority inversion is no longer an issue.
A real-time OS/app wouldn't benefit even with a 5 cycle Crit-section. Most, if not all, "true" real-time OSes require "real-time" data coming in from IRQs. Sure, there can be some complex synchronization if an IRQ line requires a certain number of cycles to read/write data where some queue logic can happen.
Cooperative multi-tasking doesn't always mean a per-worker queue is feasible. Really the only time I could see that being feasible is if the work queue is constantly saturated. Often times you just don't know in advance which worker to queue up with which task.
If you actually care about real-time guarantees enough to look into lock-free solutions, you already ought to have non-preemptive workers (i.e. cooperative multitasking). This then allows you to have only one queue per CPU (since exclusive access is guaranteed), simplifying the problem.
With one queue per CPU, multi-producer becomes trivial (atomic doorbell + round robin), and multi-consumer is easy (CAS to acquire a work item). You don't have fairness guarantees for consumers, but who cares, only when they are not saturated can they not be fair and then that's a good thing.
Application stalls don't happen (one non-preemptible queue per CPU), application crashes are managed at a higher level, and IRQs can be disabled for the 5 cycles you spend in the critical sections.
I assume I'm missing something, since I'm somewhat new at this, and lock-freedom is clearly in vogue, but I don't know what it is.