Hrm. Big fan of Justine and their work. However this is probably the least interesting benchmark test case for a Mutex. You should never have a bunch of threads constantly spamming the same mutex. So which mutex implementation best handles this case isn’t particularly interesting, imho.
Very large multithreaded programs with lots of diverse uses of locks, including:
- uncontended, mildly contended, and horrifically contended
- short and long critical sections
- contention among small numbers of threads and large numbers of threads
- contention that happens when other locks are held recursively and those are also contended on (like, thread A wants a lock held by thread B, but thread B is trying to get a lock held by thread C)
Different lock algos work well or badly depending on how the locks are used, so it’s important to pick real programs as your benchmark rather than trying to cook a synthetic benchmark.
I'd say the vast majority of cases where I use a lock/semaphore is around very expensive resources, where the utilization of that resource vastly outweighs any performance overhead of the lock.
This is how it should be. IIRC -- apologies, can't find a source --, Ulrich Drepper wrote somewhere about NPTL that its mutexes were not particularly lightweight, but that you should design your program for low contention anyways.
For highly contended data structures, spinlocks (and nowadays explicit atomics) are likely better.
what else would you measure? certainly the uncontended case is important and a baseline, but otherwise this is kind of weak point for mutexes - that if you don't handle contention well then you have idle hardware or lots of additional scheduler work or kernel crossings.
[edit - I forget to even mention one of the most important things, that locks that reform poorly under contention can have really negative systemic effects like hot spotting the memory network, and that would show up here too]
Uncontended is crucial. If you want to benchmark other things that's excellent, but if MutexA has crap uncontended performance then I'm on a loser if we pick MutexA unless I am absolutely sure we will have a lot of contention. Since contention is never desirable, that's rare.
Think of this like the random input case for a sort benchmark. Do I want stuff like all-ascending, all-descending, zig-zag and so on? Sure, those are nice. But without the random input case the benchmark is not very helpful. I might sort a zig-zag, I might sort data that's already in ascending order, but I will sort random data, that's going to happen or else I would not need a sort function.
Uncontended is uninteresting, because all mutex implementations perform roughly the same here, give or take a nanosecond or two. If you're truly uncontended then a naïve spin lock will actually seem fastest, because xchg is faster than cmpxchg which is needed for good locks.
On x86 you can. When xchg is used with a memory parameter it locks the bus. This is true even in the absence of a lock prefix. I included a spinlock implementation in the blog post. If you see any errors with it, then please let me know!
> You should never have a bunch of threads constantly spamming the same mutex.
I'm not sure I agree with this assessment. I can think of a few cases where you might end up with a bunch of threads challenging the same mutex.
A simple example would be something like concurrently populating some data structure (list/dict/etc). Yes, you could accomplish this with message passing, but that uses more memory and would be slower than just having everything wait to write to a shared location.
If you’re trying to mutate a dictionary many times from many threads you’re going to have a bad time. The fix isn’t a faster mutex, it’s don’t do that.
Depends on the dictionary implementation. There's a number of thread safe dictionaries in the wild with varying degrees of parallelism performance. Pretty much all of them benefit from faster mutexes.
For example, some thread safe dictionaries will segment their underlying key/value pairs which allows them to have concurrent reads and writes for a given segment which significantly improves performance.
Message passing has allocation pressure and cache consistency pressure not present in using a shared message location. Especially as the amount of memory in question goes up, the benefit of a shared location increases in terms of the performance impact.
Sure, for something silly like writing to an int, there is negative benefit in a shared location, but when you start talking about a dictionary with 1 million entries, the shared location becomes much more of a benefit vs all the copying and allocating you'd have to do if you tried to do the same thing with message passing.
For some datastructures, it's the optimal way to move data around. For example, LMAX disruptor is about the fastest way to pass messages because of the shared memory and well tuned locks.
You are talking about writes to a data structure such as a list or a dictionary from multiple threads. Nobody uses advanced message passing techniques for that. A list is basically the poster child of why you avoid mutexes: each thread writes to its own version of a sublist, with no use of mutexes, and then at the end of the processing every thread's lists are merged together. Merging a list takes O(1) time by manipulating a few pointers. You avoid mutexes and there's zero drawback. I don't know why you are talking about copying or allocating: a list requires no copying to be merged, and in this case all the allocations still happen over multiple threads so there's no change in allocation pressure. If your allocator is bad, there could be internal mutexes inside the allocator, but that's besides the point.
With dictionaries, you may have to do a bit of resizing and rehashing at the end. But in my benchmarks, it is still worthwhile.
Depends on the list implementation. I assume we are talking C++ lists? in that case yeah, I can see how you'd instead do a `splice` of sublists.
I was thinking more in terms of something like a `vector` in which case adding the sublists in requires copying values from one vector to the source vector. That (especially if done wrong) can involve allocating potentially more than once to drop the results in.
But, for the record, there are lock free algorithms that don't require a mutex to add values into a list. You basically CAS the tail pointer with your new value.
That being said, I concede that for a list a mutex is probably the wrong choice. It's a better choice in Java which has constraints that prevent (easily) splicing together 2 lists.
For the dictionary, implementation is key. A good segmented dictionary is going to be hard to beat.