> The spin until success strategy seen above is employed in many lock-free algorithms and is called spinlock: a simple loop where the thread repeatedly tries to perform something until successful. It's a form of gentle lock where the thread is up and running — no sleep forced by the operating system, although no progress is made until the loop is over. Regular locks employed in mutexes or semaphores are way more expensive, as the suspend/wakeup cycle requires a lot of work under the hood.
A compare-and-swap loop is not a spinlock. (It is the primitive used to implement a spinlock, but that's different.)
Pure spinlocks are almost always a bad idea, at least in userland, because threads spinning trying to acquire the lock look "busy" to the scheduler, so the scheduler may run them instead of the thread that owns the lock. It does make sense to spin a limited number of times before going through the slow path to acquire a mutex, but most OS mutex implementations have that functionality built in, so there's usually no need to do that manually.
True, but they might still appropriate for very tiny critical sections (a handful of instructions) of finely distributed locks where the probability of both contention and preemption is very low.
Also some applications have the luxury to exclusively dedicate on thread per core.
> Also some applications have the luxury to exclusively dedicate on thread per core.
To save power and thermals to enable higher frequencies for other cores doing useful work, if the latency requirements and CPU architecture allows, be nice and put the spinning CPU in a lower power state.
Someone wrote a comment and then quickly deleted it. Since I already typed a reply, I'll include it here anyways:
---
> Does this mean all spinlock operations are pinned to the low-power core?
That wouldn't make sense. Perhaps for example in a case a core is allocated to immediately operate on some data once the spinlock is released.
> If yes, how is this usually supported by OS/language runtimes?
That really depends as MONITOR and MWAIT require ring 0. That means executing those instructions in userland always causes an exception. Yeah, I develop kernel drivers... :-)
PAUSE [0] works at any ring level, so it works fine in usermode code. It's encoded just as "REP NOP".
> And as soon as the spinlock operations succeed, the work will be moved out of the low-power core, which would involve the usual context switch costs. So I'm assuming this strategy would need a lot of workload-specific benchmarking before you decide to use it.
If you're doing this, you're actively avoiding context switches. That's the whole point in this kind of spinning. Kernel is not involved for usermode code.
Why would it make more sense to "spin a limited number of times before going through the slow path to acquire a mutex"? Does a mutex have more overhead in the short term than a spinlock but after a few cycles become more efficient?
Off topic, but I remember reading that the Linux kernel prefers spinlocks to mutexes. Is there a good technical reason for that?
> Does a mutex have more overhead in the short term than a spinlock but after a few cycles become more efficient?
Think about what a true mutex does. The true mutex switches into kernel mode (aka: your program is no longer running, Linux is running). That means a spectre-guard / meltdown guard is executed (your TLB buffer may be flushed, as well as various other memory-guards to prevent Spectre from leaking data).
Once the guards are executed, kernel-mode has to find more work to do. Traversing the kernel-data structures can take 1 to 10 microseconds, depending on how cold the cache is. Finally, since another thread may be running (before your thread comes back), you probably lost all your data from L1 cache (and at minimum: your branch-predictor state because of Spectre).
A spinlock without any contention takes less than 10-nanoseconds to run, to maybe 50-nanoseconds with a bit of contention (!!). You're basically reading/writing data to L1 cache, maybe L3 cache under contention.
However, a scheduler invocation will be on the order of 5000 nanoseconds (~5 microeconds) or so, due to all of the work that the scheduler has to do.
--------
Window's default spincount is something on the order of 4000 cycles. Spinning for 4000-cycles (or less) is an advantage towards a spinlock-like methodology. (4000 cycles x 4GHz == 1-microsecond). Just to give you an idea of the speed-magnitudes that are being discussed here.
Depends on the mutex implementation. Many mutexes will do their own spinning internally - for short critical sections, you can avoid sleeping and waking your thread, a (relatively) expensive operation. Obviously spinning for an unlimited amount of time is less efficient - in the case of a priority inversion, you'll effectively deadlock. [1]
There's also tools like [2], where you can tell a mutex to spin if the current lock holder is actively running (as opposed to blocked or preempted), to get the best of both worlds.
> Why would it make more sense to "spin a limited number of times before going through the slow path to acquire a mutex"?
Have you ever waited at a door for someone, and after a long enough time, sat down on the floor? It's basically the same logic.
First, the "slow path" for acquiring a true mutex involves calling into the kernel. This is relatively expensive. At first, you hope that your waiting time will be very small. So you optimistically just spin in user-space waiting for the lock to be released. But this spinning is expensive: you're consuming the CPU doing nothing productive. Eventually, there's a basic principle to consider: the longer you wait, the more likely you are to wait longer. After along enough wait, you assume you're going to wait even longer, so it's worth it to pay the cost of calling into the kernel so that someone else who has productive work to do can do that while you're waiting.
Kernels will use spinlocks in various places because the kernel implementors know that the critical sections will be short, or they know that scheduling another task over the current one would be problematic.
Yet another article discussing atomics without discussing the memory model or memory fences.
Atomics are the easy part to understand. Memory fences and memory-ordering is the hard part that needs far more discussion. If you use atomics with the wrong memory fence, everything will break.
Memory fences are necessary to make the compiler, CPU, and caches put the data into main-memory in the correct order.
----------
Do NOT write Atomics unless you understand fences. It is absolutely essential, even on x86, to put memory fences in the correct spot. Even though x86 automatically has acquire/release consistency... the compiler may move your data ordering to the wrong location. Even with "volatile" semantics, your data may be committed to memory in the wrong order.
Fredor Pikus has an amusing example where you can have a properly coded lock but the compiler (may) mess you up: https://youtu.be/lVBvHbJsg5Y?t=811
----------
EDIT: if you use locks (spinlocks, mutex, condition variables, etc. etc.), the author of the library would have put memory fences in the correct place for you. Only if you use synchronization "manually" (and... you probably are doing that if you are writing atomics), do you have to think about memory fences.
> Topics like sequential consistency and memory barriers are critical pieces of the puzzle and can't be overlooked if you want to get the best out of your lock-free algorithms. I will cover them all in the next episode.
Not every article has to be an exhaustive description of everything you need to know.
Anywho, sequential consistency is the default for C++11 and C11 atomics, so you pretty much can use them without a complete understanding of memory consistency models.
Ah, you're right. They call them "barriers" instead of "fences", so I didn't pick up on that paragraph.
Still, C11 and C++ atomics aren't widely implemented yet in the GPU world. OpenCL 1.2 doesn't have them (and OpenCL 2.0 is not commonly implemented), CUDA doesn't have them yet, AMD ROCm doesn't have them yet.
And anyone with older compilers will probably be working with fence intrinsics, instead of "innate" barriers per atomic instruction.
Only if you use synchronization "manually" (and... you probably are doing that if you are writing atomics), do you have to think about memory fences.
I was thinking about adding atomic pointer references to a partially persistent (1) AABB tree, then allowing multiple actors in multiple threads each implementing a gameloop to query the AABB tree. Do I need to think about memory fences in that application?
I've also been thinking about implementing the above system without atomic pointers. Instead, the system would have an update phase where all writes to the AABB tree happen at once, then all of the actors are allowed to run concurrently for one game loop tick.
In both cases, I'm thinking of placing the AABB tree into shared memory, but my brief experience with using shared memory makes me wary of performance problems.
> I was thinking about adding atomic pointer references to a partially persistent (1) AABB tree, then allowing multiple actors in multiple threads each implementing a gameloop to query the AABB tree. Do I need to think about memory fences in that application?
Probably?
Let me give an attempt at what a memory-fence actually is, and why I think you may need to think about memory fences.
Atomics solve one PART of the synchronization problem. The second problem, which is far more complex, is that the compiler, CPU, and Memory / Cache system reorders your reads-and-writes in a non-obvious way. These reorderings can lead to obscure bugs.
These reorderings happen because of single-threaded optimization. Programmers expect the compiler, CPU, and L1 cache to reorder statements to go as fast as possible. Ever wonder what the -O2 and -O3 compiler flags do? They'll eliminate variables and change your code in subtle ways.
Even if you run at -O0 optimizations, these conceptual reorderings exist at the CPU (store buffer) and L1 cache levels. So you cannot escape this unfortunate, complex reality.
Therefore: it is the programmer's responsibility to define a "Happens-before" relationship through memory fences / memory barriers. A "happens-before" relationship will kill optimizations (be it a compiler-optimization, CPU store-buffer optimization, or L1 cache optimization) in just the right way to make your code correct.
------------
With that out of the way, lets go into your AABB tree question.
Assume "node" is a node to a binary AABB tree, with "node->left" and "node->right" referring to its two children. Lets assume both "left" and "right" are atomic<> pointers.
Now think of the code:
AABB_Node tmp = malloc();
node->left->store(tmp); // This is a race condition!
tmp.initialize();
The above code is wrong. You can't "publish" the data before it is initialized! "tmp" would contain garbage data (its "fresh" from the malloc). Therefore, you need to write the following code:
Okay, now you "think" you've solved the multithreaded issue, but you have NOT. Because the compiler says "tmp is an unnecessary variable" and optimizes it away. So the compiler may decide to emit:
Uh oh, the compiler "moved" the code around, which is logical in a single-threaded setting. But this reordering destroys your multithreaded logic. Your other threads may get the malloc() before the initialize.
------------
"Happens-before" is the key concept. What we want to say is:
Good news: modern C++11 allows you to specify a "happens before" relationship through memory fences.
node->left->store(tmp, memory_order_release);
"Memory_order_release" is the minimum happens-before relationship we want. The C++11 standard states: "A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store."
The compiler will emit the low-level code needed to ensure the happens-before relationship will work through lower-level memory fence mechanisms. GPUs will have to clear their L1 cache (due to incoherent memory), while CPUs will only have to wait for the store-buffer to be flushed and MESI messages to be passed around (ARM has the "dmb ish" statement for example)
Regardless of the low-level details, the job of the programmer with memory fences remains the same. Specifying important "happens-before" relationships.
-------------
In the GPU world, where C++11 atomics do not exist yet, the happens-before relationship is created with a threadfence instruction. In CUDA, you would do:
__threadfence_block() is only a fence that works within a CUDA-workgroup (and is therefore far faster than __threadfence()). Effectively, you only wait for L1 cache (while __threadfence() waits for an L2 cache commit).
-------------
I'll butcher the explanations a bit (real code doesn't work like this, but it may help you to understand things). EDIT: Oh gosh, I got these backwards again in my first draft. I'm pretty sure I put them in the correct order now.
An "acquire-read" operation basically compiles from:
Thanks. That was indeed helpful. A part of the problem is that I'm most familiar with barriers used in GC algorithms, and in sandboxing. So seeing the term "memory fence" was confusing. (I should've just read the Wikipedia page!)
"lock free" regards to being free from any process/thread/device holding a lock for an undetermined amount of time.
The formal definition IIRC is "a system is lock free, at least one concurrent process makes progress towards finishing in a unit of time" (There's a hard to achieve version called "wait free" which means "every process makes amortized progress towards finishing")
The important property of a lock free system is that pausing any thread/process does not stop the others from progressing; whereas with classical locks (mutex, semaphore, spin, whatever), if you pause a lock-holding process you risk starving the entire system.
> The important property of a lock free system is that pausing any thread/process does not stop the others from progressing; whereas with classical locks (mutex, semaphore, spin, whatever), if you pause a lock-holding process you risk starving the entire system.
With that being said, stopping any thread/process is rare in high-performance circumstances. In many cases, it is easier to avoid getting your thread stopped (as simple as spinning up one-thread per logical core), rather than to switch to a lock-free algorithm.
Besides, your typical OS Mutex (CriticalSection in Windows) handles the pause situation just fine. The OS will keep track of which threads are waiting on which locks, and restart the threads as resources become available. Its the "known problem" and well studied by pretty much all programmers.
Lock-free data structures are exponentially more difficult to write, especially when you are handling "out of memory" edge cases. The general recommendation is to write "mostly lock free", but use a lock for corner-cases.
CAS-loop with pointer swapping gets you lock free in most cases, and is very easy to understand. The only problem is that it doesn't work in all cases or data-structures, so you inevitably end up using locks to finish your program.
EDIT: And it should be noted that locks can be faster. So if you're going for absolute performance, you should write both implementations and measure.
> With that being said, stopping any thread/process is rare in high-performance circumstances.
While that's true from a performance perspective, it is not true from a reliability perspective - it's always possible that the thread holding a lock has faulted, was swapped out, or otherwise had a bug (e.g. in a library you called beyond your control) that caused it to stall. Also, if more than one lock is involved, a deadlock is possible as well (not all of which can be detected by an OS, and most OSes suck at detecting deadlocks if they even try).
> The OS will keep track of which threads are waiting on which locks, and restart the threads as resources become available.
But causes for a pause are not necessarily under your control or the OSes privilege; You can take a critical section, write to a local file .... and wait 20 seconds because the drive has gone to sleep, someone competing for the disk, swapping in/out, a recoverable disk error, or the fact that it is not actually local but rather virtualized SAN across the internet.
Also, someone might have paused a thread from the debug API for whatever reason (or kill -STOP on linux), which might be expected to only pause that thread -- but in a system that isn't lock free, that could pause anything and everything, in a non reproducible way.
> Lock-free data structures are exponentially more difficult to write
For sure. Personally, I find that they keep me on my toes "the right way", that is - they force me to consider what and where actually needs synchronization and can be contended (whereas a lock allows me to be lazy. which is sometimes good).
> The only problem is that it doesn't work in all cases or data-structures,
Yes. ABA problems are not solved by CAS, which is why I personally prefer LL/SC, but ... whatever, CAS is harder but still doable.
> So if you're going for absolute performance, you should write both implementations and measure.
Upon a quick review of history, my use of lock free structures and algorithms is dominated by reliability requirements ("nothing must unexpectedly pause the system" - mostly seqlocks), and only a few cases for performance - mostly shared lists and shared (often reference) counters.
> But causes for a pause are not necessarily under your control or the OSes privilege; You can take a critical section, write to a local file .... and wait 20 seconds because the drive has gone to sleep, someone competing for the disk, swapping in/out, a recoverable disk error, or the fact that it is not actually local but rather virtualized SAN across the internet.
If the write to the local file needs to be protected by a critical section (seems unlikely to me, but... I can imagine some situations like that), the alternative is performing the write inside of a CAS loop instead.
When the CAS-loop fails (because its been 20-seconds and the compare-and-swap has a grossly different value you were waiting on) you have to perform a 2nd I/O operation and restart the entire computation.
What is more performant? Locking your other threads? Or potentially doing extremely expensive work repeatedly?
Locking your other threads (especially if there are some other threads that have work to do) could very much be the more performant answer. In fact, I would argue that the locking implementation is more likely to be work efficient.
> Upon a quick review of history, my use of lock free structures and algorithms is dominated by reliability requirements ("nothing must unexpectedly pause the system" - mostly seqlocks), and only a few cases for performance - mostly shared lists and shared (often reference) counters.
I think that is reasonable and typical. I think beginners think of "lock free" as a potential improvement to efficiency. But the advantages of lock free are entirely independent of performance, it has more to so with requirements as opposed to performance in most cases.
Not to take anything away from the parent's thoughtful comment. That important property is, in fact, the definition of lock-free. Lock-free and wait-free (which has a stronger guarantee) algorithms are sub-classifications of non-blocking algorithms.
If I could recommend one book on this topic it'd be "The Art of Multiprocessor Programming". The authors have contributed to this subject through original research and make the topic quite approachable.
> Not to take anything away from the parent's thoughtful comment. That important property is, in fact, the definition of lock-free.
According to Wikipedia, it's the definition of non-blocking, not the definition of lock-free... which I guess is a correction for both the parent comment and yours. [1]
As far as I recall ever seeing, people always talk about it in terms of guaranteed systemwide progress, which I don't find as enlightening as what happens when you pause a thread.
I was about to say - purely atomic operations are neither guaranteed to be limited to a single operation, nor they are guaranteed to not have any hardware locks. Yes, such lock is "faster" than a kernel-level locking object(mutex), but you can have unexpected side effects - on the AMD Jaguar architecture(so on PS4 and XB1) you need to be careful because cores are split into two blocks, and cores in one block don't really know what the other block is doing - so issuing an atomic add for instance has the unexpected downside of locking the entire CPU while the caches are synchronized so that you don't end up with two separate cores running on different blocks incrementing the same variable.
I think thinking about write barriers as locks is slightly misleading. Rather it brings the real world into a state that looks like what a computer program executing from top to bottom would look like. Results of speculative execution are discarded. The differing views onto main memory of different processor caches are made consistent.
If you think context switches are expensive, then this is also expensive, but in a different way. Nothing is ever locked, but certainly the computer slows down from its maximum performance potential. (But there are many similar effects that happen even without atomic operations, like mis-predicting a branch.)
My experience in using both is that sometimes profiling dictates that a lock is too expensive. I used to maintain a program that incremented a counter from hundreds of threads. Using an OS-level mutex to achieve this resulted in way too many context-switches, to the point where we were only using about 30% of the CPU to run our program and the rest of the CPU time was spent on OS-level housekeeping. Using atomic load/store was much faster. Enqueuing all the writes in a thread-local counter and flushing them to the global structure in an OS lock was the fastest, but only provided eventual consistency.
On the hardware level for example the x86 some of these atomics translate into instructions with the lock prefix which locks the bus. It's different from the usual "lock" concept in concurrent programming.
The correct term to use here is blocking. The hardware bus locks finish in a bounded amount of time and allow the other processors to eventually make progress.
That is a good distinction, but I'm not sure if it's always true. A peripheral device on a shared bus could still hold the lock for an undetermined time.
Yes. Even better than that, atomic instructions are usually completely local to a core. I think that the only interaction with with the coherency protocol is that a core is guaranteed to be able to hold a cache in exclusive mode long enough to execute an RMW (and even that it is not really required, but useful to guarantee forward progress).
Since NVLink2 and POWER9, even a GPU can issue atomics over the bus, which will be executed local to the CPU that owns this cacheline.
This is very useful in high-contention write-heavy workloads, like atomic counters or accumulators.
Yes, and the cache hierarchy ultimately depends on the memory bus. I suppose this bus, which may be shared with many other devices, doesn't always have bounded-time guarantee.
Even to main memory there is not necessarily a single memory bus. Intracore or even intrasocket synchronization need not (and usually doesn't) go through main memory anyway.
True, but some atomic instructions may need to access main memory to complete their operation. Whether shortcuts can be taken in most cases is not relevant for worst-case considerations.
They may need to access main memory, but the RMW operation don't happen over the memory bus. The processor appropriates the cache line just like any other memory access, and then operates atomically on the cache line.
The cache coherency protocol takes care of that. In other words the first part is just a memory load and can vary from 0 to a few hundred clock cycles, the second is local to the processor and has a more or less fixed cost. The worst-case execution time is completely dominated by the first part, the best case instead is dominated by the second.
It is different. The lock prefix in x86 is only relevant in a multiprocessing (i.e. parallel) environment. But the general concept of locking is applicable to concurrent environments as well. For example when there are multiple processes on a uniprocessor machine, you still need to use locks (mutexes) which are implemented using, say, cmpxchg8b (atomically compare and exchange eight bytes), but in the absence of a second processor there's no need to use the lock prefix. Here on a machine level there's never really on any locking, but on an application level you are using locks.
X86 use MESI derivative directory-based cache coherency protocols on the cache line level (when synchronizing across sockets), bus-locking was probably in the 486/Pentium 1 days when bus snooping was used.
I read about one recently where an evil user program was able to bring a 28 core Xeon to its knees. Apparently a LOCK INC on an unaligned value that hangs over a page boundary forces this super-advanced CPU to lock everything in every cache as if it was a 386.
Back in the day, I wanted to implement a cache in our software product that used reference counting to update objects and when the object went to 0, free it from the cache. Pretty straightforward, no rocket science, but this was also cross-platform including Solaris, HP-UX, etc.
This was easy to do on Windows/Intel because they implemented a native CAS at the time (almost 20 years ago now), but the other systems didn't. The only way would be to add a mutex per object, which was a non-starter because it used real kernel memory and resources, so I had to abandon the feature entirely.
The above is truly lock-free because it doesn't require any additional resources that have to be tracked or freed.
I found it surprising that atomic operations in Java are converted in to locking when compiling for some versions of android. The guarantees of Java (before 9) on ordering are slightly too strong and result in locks being used under the hood.
Interesting. This probably is because of some hardware limitations on some older android hardware platforms.
The java.util.concurrent package was added in Java 5 and before that existed as a third party library (by Doug Lee). It contains a lot of concurrency primitives and makes use of a lot of things, including lock free instructions and optimistic locking: https://en.wikipedia.org/wiki/Java_concurrency
That would open the door to thread-starvation, right? Unlikely in practice, but the point being that it does affect the formal correctness of the algorithm.
I imagine it also impacts real-world performance, but lock-free algorithms aren't assured to be faster anyway.
Annoyingly I recall reading a good article on exactly this question - the importance of hardware atomics vs locking - but can't recall what it was. These locks still aren't as bad as ad-hoc locking, in terms of deadlock risk, as you're guaranteed that they are 'leaf locks', with no order-of-acquisition hierarchies/orderings to worry about.
Based on some experience on non-X86 systems, lock-free is where you start finding bugs in compilers and runtime libraries, and also hardware.
How much do you trust random ARM-based SOCs to get this right at all of the necessary levels of cache consistency and memory access queues? Lots? Great, I am happy for you. Now, extend your confidence into some other chips, like PowerPC (various versions), MIPs and maybe a couple of others.
Eventually you are going to hit some very odd bugs, the really difficult bugs that will make you tear your hair out, and eventually that lock-free stuff, too.
"But we'll never run our stuff on a MIPS, or a Fthaghn-V1000." That's good . . . but the chipset you trust might get an update with slightly different memory behavior and you'll be sunk.
The projects I've been on that have shipped lock-free structures have done so only when using them was (a) high value, (b) there was a way to choose an alternate method (usually at compile time), and (c) the use was very limited (e.g., just a handful of critical places).
> It is also possible to move stores from before an Acquire load or read-modify-write operation to after it, and move non-Acquire loads from before an Acquire operation to after it.
> If multi-threaded code is too easy for you, try multi-threaded code which is executed in a different order than written!
Most multi-threaded code works this way. Even safe languages like Java permit this. I believe the exceptions tend to be interpreted languages like Python, using green threads.
At the assembly level, most modern CPUs are permitted to perform out-of-order execution. I believe the exceptions are pretty rare these days. The custom PowerPC chip in the Xbox 360 guaranteed in-order execution, for instance [0]. GPUs are a different beast.
The difference is that reordering is not observable with just one thread. Both language optimizations and CPU OOO don't change semantics within a single thread. If you use higher-level primitives like mutexes, they come with fences that also preserve that illusion.
It does become observable when you try to do synchronization yourself with atomics, without setting appropriate ordering requirements. Lock-free algorithms are usually tricky, and implementing them with minimum ordering requirements makes them even more puzzling.
A compare-and-swap loop is not a spinlock. (It is the primitive used to implement a spinlock, but that's different.)
Pure spinlocks are almost always a bad idea, at least in userland, because threads spinning trying to acquire the lock look "busy" to the scheduler, so the scheduler may run them instead of the thread that owns the lock. It does make sense to spin a limited number of times before going through the slow path to acquire a mutex, but most OS mutex implementations have that functionality built in, so there's usually no need to do that manually.