Pro tip: if you really do know that contention is unlikely, and uncontended acquisition is super important, then it's theoretically impossible to do better than a spinlock.
Reason: locks that have the ability to put the thread to sleep on a queue must do compare-and-swap (or at least an atomic RMW) on `unlock`. But spinlocks can get away with just doing a store-release (or just a store with a compiler fence on X86) to `unlock`.
Spinlocks also have excellent throughput under most contention scenarios, though at the cost of power and being unkind to other apps on the system. If you want your spinlock to be hella fast on contention just make sure you `sched_yield` before each retry (or `SwitchToThread` on Windows, and on Darwin you can do a bit better with `thread_switch(MACH_PORT_NULL, SWITCH_OPTION_DEPRESS, 1)`).
> just make sure you `sched_yield` before each retry
Assuming `sched_yield` does something.
There's a futex congestion problem inside Wine's memory allocator. There are several levels of locks. If you're growing a buffer, in the sense of C's "realloc", and no buffer is available, memory allocation is locked during the allocation of a bigger buffer, copying of the contents, and release of the old buffer. "Push" type operations can force this. Two orders of magnitude performance drops ensue when multi-threaded programs are contending for that lock.[1]
Inside one of the lock loops is a call to "YieldProcessor".
static void spin_lock( LONG *lock )
{
while (InterlockedCompareExchange( lock, -1, 0 ))
YieldProcessor();
}
But the actual code for YieldProcessor is a NOP on x86:[2]
Wine devs are aware of this, but the mess is bad enough that no one has tackled it.
This is down in the core of what "malloc" calls, so changes there could have unexpected effects on many programs.
Needs attention from someone really into mutexes.
Just to clarify: the spinning on CMPXCHG is not also a bad idea, the YieldProcessor is correct (a pause), but inside the CMPXCHG loop it should be spinning on a pure unlocked load. Is that correct?
No you should spin on a read. Once you see the value you want you then try the CMPXCHG. If that succeeds you exit. If it fails you go back to spinning on the read.
Read and load mean the same thing. (I think GP just missed the end of your comment.)
You care about exchange vs read/load because of cache line ownership. Every time you try to do the exchange, the attempting CPU must take exclusive ownership of the cacheline (stealing it from the lock owner). To unlock, the lock owner must take it back.
If the attempting CPU instead only reads, the line ownership stays with the lock holder and unlock is cheaper. In general you want cache line ownership to change hands as few times as possible.
For what it is worth it seems the library in question does both, uses an exponential retry loop, busy read looping 2^i times for the first 7 attempts before then yielding. It seems like there must be some threshold where latency is improved by retrying before yielding, but I don’t test these things for a living.
I've tried to figure out where that goes wrong. Read the bug report linked. I've caught this situation in gdb with 20 threads in spinlocks inside realloc, performance in a game down to 1 frame every 2 seconds, and very little getting done.
But I don't understand how the three levels of locking involved cause that. Nor do the Wine people.
Works fine on Microsoft Windows. Only fails on Wine.
It seems like the loop around InterlockedCompareExchange is a bad idea since this is a bus lock in a tight loop. Rather the inner spinning loop that is yielding should just be reading the value surrounded by the cmpxchg. As for whether sched_yield should just be called in the inner loop or a short nop/pause loop should be attempted for microcontention reasons, the expert opinion here is don't bother with the nop loop. However while the nop loop might not be real world optimal I doubt that would be causing a catastrophic performance issue.
My data says that always yielding is better than ever busy spinning, except on microbenchmarks, where depending on the microbenchmarks you can get any answer your heart desires.
Has WINE considered using mremap() for its realloc() implementation? It allows you to make realloc() of a sufficient size basically cost nothing. See this screenshot https://x.com/JustineTunney/status/1837663502619889875 On other platforms like MacOS you can achieve the same thing as mremap() by mapping to random addresses, and then using MAP_FIXED to append.
`YieldThread` is just named confusingly. It is not equivalent to `sched_yield` or `std::thread::yield()`, it's rather a macro to issue a pause instruction (which is indeed what you typically want in a spin loop). The actual Windows equivalent would be `SwitchToThread`.
It might look like a NOP, but "REP NOP" is not a real NOP, it's the PAUSE instruction (which very old processors from the 1990s treated as a normal NOP).
On Darwin, it's possible for a pure spinlock to produce a priority inversion deadlock, because Darwin has a quality of service implementation in the kernel that differs from how everyone else handles thread priority. In other kernels, a low-priority thread will still eventually be guaranteed a cpu slice, so if it's holding a spinlock, it will eventually make progress and unlock. On Darwin with Quality of Service, it's possible for higher-QoS threads to preempt lower-QoS threads indefinitely.
For this reason, on Darwin you want to avoid spinlocks unless you know that all threads touching the spinlock are always running in the same QoS. Instead of spinlocks, your go-to for low-overhead locks there is os_unfair_lock, which is a spinlock variant that donates priority of the blocked threads to the running thread.
On Linux you can still easily go extremely slow with plain sched_yield() as there's no guarantee the thread who holds the lock will get to run, especially in numa situations due to a strong bias against migrating threads across numa boundaries.
You have to use the e.g. futex API to tell the kernel what you're waiting for.
Bare spinlocks are very bad in this way, unless you have dedicated hardware threads (with SMT you can pin all but one of the threads with the remaining thread being general purpose, though you may want a way to inform the scheduler which cores to currently prefer due to their pinned tasks being more idle).
Also worked for Apple for 13 years: there is a lot (and I mean a LOT) of priority inversions and abuse of locks in the codebase. Breaking out the internal profiler and actually having a look at whether you were running at the priority you thought can be a very illuminating thing.
One particular app had a particularly nasty habit of losing priority because most of its work was done in a set of daemons and not in the UI process itself. The foreground app is allowed to use Performance cores and receive USER_INTERACTIVE priority for operations if it wants, but daemons cannot unless they have a mach IPC voucher stating that the UI is blocked on their work. But if you do the wrong thing with a semaphore or a spinlock, you get booted out of the thread group and now you’re running on Efficiency cores at USER_INITIATED priority and suddenly your latency numbers spike 10x on systems with asymmetric cores. I remember solving a particularly bad one which was basically doubling our request latency on phones with E cores.
The company had a pretty widely spread internal article trying to whack people over the head to stop abusing lock primitives in this way, but we just kept shipping code with this issue.
(Most of the inversion issues came from the use of semaphores though. The problem got 1000x worse in swift concurrency because using semaphores blocks an entire SC OS thread and you only get as many of those as you have cores. Most Apple Watches and HomePod only have 2 of them so you can easily deadlock. Which most people fixed by just… adding a timeout to their locks. It was horrifying.)
> The company had a pretty widely spread internal article trying to whack people over the head to stop abusing lock primitives in this way, but we just kept shipping code with this issue.
It sounds like this is relevant when developing user app code, not just the kernel or core libraries. Is there an external version of this article? I would be very interested to read more.
let valueIWant: String? = nil
let sem = DispatchSemaphore(value: 0)
someAsyncFunction() { result in
// this is called in another thread
valueIWant = result
sem.signal()
}
sem.wait()
assert(valueIWant != nil)
Note, this is mostly about semaphores, so it’s a bit offtopic from the original discussion here which is about spinlocks. But basically, never lock something in expectation of another thread unlocking it. Since semaphores are ownerless, the scheduler has no idea what thread will eventually signal the semaphore, so it can’t boost the priority of any threads you’re waiting on.
Not to mention the thread explosion issue: If the task running the completion (ie. signal()) is submitted to a libdispatch queue, the thread pool implementation will spawn more threads if all of them are starved, to eliminate deadlocks, which can lead to thread explosion in addition to the priority inversion issue. This advice dates back to when libdispatch was invented (and the internal article was called The Semaphore Antipattern…)
But with Swift Concurrency it’s far, far worse, because the SC thread pool does not spawn new threads if all of them are blocked. Meaning the dispatch block that’s supposed to call the callback may never get run, and you deadlock. Most importantly, this is true even of plain ObjC libdispatch code written 15 years ago: If some ancient framework you’re calling into does the semaphore antipattern, libdispatch is allowed to assume that all Swift Concurrency threads will eventually make forward progress, and so it doesn’t spawn new threads to handle the work, even if all threads are blocked. So ancient code that uses this antipattern suddenly blows up if someone calls it (transitively!) from a swift concurrency context. I’ve squashed countless bugs on this issue alone. (I wrote my own article on this internally too, titled something like “Semaphore abuse: How to deadlock your process and require a device reboot”)
IMHO, the semaphore should be officially marked deprecated and trigger warnings… there are no appropriate uses for it. If you’re in a synchronous context and need to block on the result of async code, the best bet is to stop what you’re doing and file a bug to the people writing the async code and tell them to give you a synchronous API (or refactor your code to be async.) Never ever use semaphores to block on async code. Not even once.
It applies to any concurrency primitive that can be used across threads, yes. NSLock and os_unfair_lock can only be unlocked by the thread that locked them, so you can’t abuse them to turn async code into sync code in this manner in the first place.
Basically people recommend os_unfair_lock because of what it can’t do.
We use spinlocks where appropriate. In the 90s I recall that the general rule of thumb was if the lock is held for <10x the context switch time, spinlocks are generally a better choice. Not sure if that's still true of contemporary architectures.
The more common pattern in rt/audio code is "try to take the lock, but have an alternate code path if that fails". It's not that is never going to be contention, but it will be extremely rare, and when it occurs, it probably matters. RWLocks are also a common pattern, with the RT thread(s) being read-only (and still being required to fall back on an alternate code path if the read lock cannot be taken).
These days, fast lock implementations use the following rough idiom, or some idiom that is demonstrably not any slower even for short critical sections.
if (LIKELY(CAS(&lock, UNLOCKED, LOCKED))) return;
for (unsigned i = 0; i < 40; ++i) {
/* spin for a bounded a mount of time */
if (LIKELY(CAS(&lock, UNLOCKED, LOCKED))) return;
sched_yield();
}
... /* actual full lock algo goes here */
So, the reason to use spinlocks isn't that they are faster for short critical sections, but that they don't have to CAS on unlock - and so they are faster especially in the uncontended case (and in all other cases too).
In other words, the modern rule of thumb is something like: if you're going to grab the lock so frequently that the uncontended lock/unlock time shows up as a significant percentage of your execution time, then use a spinlock. That probably implies that you are probably holding the lock for <100x or even <1000x context switch time, or something around there.
> if you're going to grab the lock so frequently that the uncontended lock/unlock time shows up as a significant percentage of your execution time, then use a spinlock.
Yeah, and maybe also consider changing your design because usually this isn't needed.
This is (in my experience) a byproduct of good design, so changing the design wouldn't be a great idea.
Every time I've seen this happen it's in code that scales really well to lots of CPUs while having a lot of shared state, and the way it gets there is that it's using very fine-grained locks combined with smart load balancing. The idea is that the load balancer makes it improbable-but-not-impossible that two processors would ever want to touch the same state. And to achieve that, locks are scattered throughout, usually protecting tiny critical sections and tiny amounts of state.
Whenever I've gotten into that situation or found code that was in that situation, the code performed better than if it had coarser locks and better than if it used lock-free algorithms (because those usually carry their own baggage and "tax"). They performed better than the serial version of the code that had no locks.
So, the situation is: you've got code that performs great, but does in fact spend maybe ~2% of its time in the CAS to lock and unlock locks. So... you can get a tiny bit faster if you use a spinlock, because then unlocking isn't a CAS, and you run 1% faster.
Lots of code is 'lock free' and uses the same pattern (more or less). Attempt the change optimistically, if it fails spin it, if it keeps failing (Massive Contention), back off.
spin locks prevent forward grantees, e.g. when the thread is scheduled out (or serves an interrupt), no other thread can make progress. Lock free allows progress - there is no exclusivity. Of course if working on the kernel and controlling the scheduler, etc. the restriction does not apply immediately.
Realistically though, lots of lock free employs copy-on-write or rcu.
I have written quite the fair share of lock free code (back then when it was new), and generally enjoy writing it (to this day). Yet, for most folks it's just too hard to reason/understand/maintain it.
Just a note, lock free is a lesser requirement than wait-free.
Keep in mind that while try_lock() is realtime-safe, the following unlock() may not, as it may need to wake threads that have been blocked in the meantime!
So I would only use this pattern for situations where the very fact that a NRT thread tries to acquire the lock already means that RT safety is not a concern anymore (e.g. a device has been disconnected)
Excellent reminder. Yes, this is both a design defect of try/unlock that can't really be solved, and an implementation defect on several platforms that makes it worse than it actually needs to be (Windows, I'm looking at you)
One thing I've been using is a spinlock where the RT thread only uses try_lock() + fallback path and the NRT thread(s) call try_lock() in a loop with pause instructions and occasional thread yielding (or even sleeping). This might waste lots of CPU cycles when a NRT thread tries to acquire a lock that is already held by the RT thread, but assuming this only happens occasionally/rarely, it's a reasonable trade-off.
What type of scenarios come up in RT/audio code where an alternative path is available if you can't get a lock? In most of the concurrent code I've written the options available if I can't get a lock is some subset of try again, try again with backoff, or fail.
Let's say you are doing audio processing in the real-time thread. You try to lock to read new parameters, and if you fail to get the lock you do the processing with the old parameters.
> Reason: locks that have the ability to put the thread to sleep on a queue must do compare-and-swap (or at least an atomic RMW) on `unlock`. But spinlocks can get away with just doing a store-release (or just a store with a compiler fence on X86) to `unlock`.
This is something I've thinking about a lot over time, that the CAS is only there to atomically determine if there are any sleeping waiters on unlock and you have to do a futex_wake. I would really want some way to get away with non-fenced operations (at least on x86), but I don't know if it's just that nobody has figured out why, or there is a fundamental reason why that's not possible.
You do need a fence in the unlock path though (at least a release fence).
I think the issue is that if you ask the CPU to just store something (like in a spin lock), whether or not there’s a fence, it’s an operation with limited data flow dependencies so it’s easy for the CPU to execute. Even the fence can be handled using wacky speculation tricks.
But if you want to do something like, “store this value but only if the old value satisfies some predicate”, then there’s a load and the whole thing is dependent on the load. So you’re asking the CPU to load, then run a predicate, then store, and for that to be fenced, and atomic.
Strictly more work. I don’t think there’s any trick to make it faster than just the store release.
> You do need a fence in the unlock path though (at least a release fence).
Well yes but on x86 that comes for free. The overhead of the full fence brought in by lock cmpxchg or lock xchg is in the order of ~10ns, which for an uncontended lock means that a mutex is almost 2x as slow as a spinlock.
A load acquire + store release would be a couple of ns (assuming everything in L1 etc...)
As far as I know it is a fundamental limitation. You need to release the mutex, then check that there were no waiters in this order not to miss wakeups. As the mutex release must be globally visible before the load, release ordering on the mutex is not sufficient as the load could be reordered before the unlock; hence you need a StoreLoad fence which is always the most expensive barrier.
Consider the implementation of Dekker's algorithm for example.
As a more practical argument, let's suppose x86 had a an atomic CAS that guarantees that the store is released but the load is relaxed (unlike other x86 normal loads, but like non-temporal loads, it has no implied LoadLoad or LoadStore like other x86 loads).
This relaxed-load-CAS, coupled with some form of control or data dependency would be sufficient to implement your mutex. But would such a CAS be significantly cheaper than the existing CAS? If it was, you would be able to approximate the strong CAS semantics by adding lfence+sfence after the relaxed CAS. These fences are cheap, so if strong CAS was possible to be implemented this way with significant improvements, intel would have already done it.
Finally, it is practically possible to implement the sequence store(A);#StoreLoad;load (B) without an explicit fence by using the colocation trick: have A and B be adjacent in memory (on the same cacheline), store to A, then do a wider load on both A and B. Intel does not give multithread guarantees on this, but my understanding is that it works: the wider load fails to be store-forwarded and stalls the pipeline waiting for the previous store to be flushed. In practice this costs about as much as a fence, so in addition to being undefined, is not cheaper.
- On the wait side, do the CAS to set the "waiter present" bit. Down unlock, do a (relaxed) read of the lock word, and if "waiter present" isn't set, just do a release store to unlock (and go down some slow CAS-y wake path if a waiter is present). On the wait side, never do an un-timed futex wait; just do a series of timed waits, with increasing wait times (so that you still eventually fix things if you hit the unlucky race between the previous holder's check+store sequence). (You can also do some tricks with counters to let waiters do an unconditional sleep once they get their wait acknowledged).
- Split out the "waiter present" bit into its own byte, do a store-load sequence (with just a compiler reordering fence) to check for waiters, and have waiters either do a membarrier() syscall or wait "long enough" that they're sure they've gotten the same effect. (This gets tricky w.r.t. mutex lifetime though; you either need out of band lifetime knowledge or to use RCU or whatever and indirect through pointers).
Practically, neither was ever "better enough" for anything but microbenchmarks to be worth the complexity.
> Split out the "waiter present" bit into its own byte, do a store-load sequence (with just a compiler reordering fence) to check for waiters, and have waiters either do a membarrier() syscall or wait "long enough" that they're sure they've gotten the same effect. (This gets tricky w.r.t. mutex lifetime though; you either need out of band lifetime knowledge or to use RCU or whatever and indirect through pointers).
If you are doing this, given the cost of membarrier, you are optimizing for almost always uncontended locks. At this point you can make your lock default-owned by the first thread to lock it and have the owner lock and unlock be basically free until it is contended. This is basically the biased locking optimization that Java implements (or used to).
It kinda depends; you only do the membarrier when you're about to sleep anyways, and the non-expedited membarrier() call is just a synchronize_rcu(), so it's not that drastically more expensive than a futex wait.
You don't necessarily want a biased lock for all this kind of stuff, because "sparsely contended" doesn't necessarily imply thread-associated. E.g. one place I was looking at this for was locks for pages of virtual memory in a heap; no thread "owns" any given heap page, but it was very uncommon to get unlucky and have two threads touching adjacent pages at the exact same time. These kind of "sloppy mutexes" get half the fast-path speedup of biased locks but without the heavily asymmetric performance costs. (At least, that was the theory; like I said it didn't really pan out to be that useful in practice).
> If you want your spinlock to be hella fast on contention just make sure you `sched_yield` before each retry
Shouldn't you rather use pause instructions (_mm_pause, __yield, ISB, etc.) instead of thread switching? That's what I have seen in most spin lock implementations. Maybe it depends on the use case?
No, you shouldn’t be calling CAS in a tight loop, but rather a relaxed load to check if a CAS might be successful, with PAUSE equivalent after each load. After spinning for ~context switch latency you should back off with sched_yield before retrying the spin loop (and eventually start sleeping on backoff with exponential jitter etc).
Atomic RMW is all you need. Imagine 3 state futex lock of taken, sleeping, and unlocked. Waking thread just needs to see old value but writes unlocked unconditionally.
I don't think this is more expensive than release certainly not because of the line where the write goes in most cache coherency protocols.
Granted this lock does wake up too many people but you did say usually uncontended.
So you want uncontended unlocks to make a syscall to wake the threads that aren’t there?
That syscall will be worse than a CAS just because it’s a syscall. Not to mention it’ll have to lock some kernel data structure just to figure out that there aren’t any threads to wake.
I think I'm not doing a great job of explaining it.
There are three states of the mutex: locked=1, unlocked=0, and sleeping=2. A thread comes along to a locked mutex, and atomically moves it to sleeping and sleeps via futex. The freeing thread unconditionally writes 0 and atomically sees the old value: if 2 wake all, if 1 no need to wake anybody.
Maybe that is equally expensive to a CAS, but I don't think it ultimately is.
Pro tip: if you really do know that contention is unlikely, and uncontended acquisition is super important, then it's theoretically impossible to do better than a spinlock.
Reason: locks that have the ability to put the thread to sleep on a queue must do compare-and-swap (or at least an atomic RMW) on `unlock`. But spinlocks can get away with just doing a store-release (or just a store with a compiler fence on X86) to `unlock`.
Spinlocks also have excellent throughput under most contention scenarios, though at the cost of power and being unkind to other apps on the system. If you want your spinlock to be hella fast on contention just make sure you `sched_yield` before each retry (or `SwitchToThread` on Windows, and on Darwin you can do a bit better with `thread_switch(MACH_PORT_NULL, SWITCH_OPTION_DEPRESS, 1)`).