On mutex performance and WTF::Lock

saagarjha · on April 2, 2017

For those wondering, WTF stands for Web Template Framework: https://webkit.org/blog/6161/locking-in-webkit/

cperciva · on April 2, 2017

My favourite mutex name is the "sex lock" -- also known as the shared/exclusive or sx lock.

cpeterso · on April 3, 2017

Windows NT has a kernel mutex type called a "mutant" that was added for compatibility with OS/2's mutex (back when the NT team was still pretending to be building a new OS/2 kernel). Dave Cutler thought the OS/2 mutex semantics were "brain-damaged" and called NT's version "mutant". If an OS/2 thread exited while holding a mutex, the mutex would remain inaccessible forever.

https://blogs.msdn.microsoft.com/larryosterman/2004/09/24/cl...

pjmlp · on April 3, 2017

They would have, if a Microsoft employee hadn't figured out a viable way to use extended mode in Windows 3.x.

0xFFC · on April 3, 2017

Like i am gonna believe that ;)

(I am kidding)

pizlonator · on April 2, 2017

WTF locks now use a policy we call eventual fairness. You can barge but the amount of time anyone loses due to the barging race is bounded. https://trac.webkit.org/changeset/203350/webkit

obstinate · on April 2, 2017

Hard fairness guarantees for locking are very expensive and only situationally needful, so it's hard to justify paying a high cost to have them always on. It seems like the best thing to do is to use the fastest locking primitive you can find, and implement fairness on top of that in the rare case that you need it.

pizlonator · on April 2, 2017

Eventual fairness costs nothing.

obstinate · on April 2, 2017

Right, I'm referring to the idea of immediate fairness. I'm agreeing w/ your approach.

bogomipz · on April 3, 2017

Might you have any good resources about hard fairness vs eventual fairness locking?

pizlonator · on April 3, 2017

It's all in the commit.

ajross · on April 2, 2017

So the takeaway here is apparently that POSIX mutexes on macOS/iOS are really slow due to an oddball design decision (fairness), and the browser vendors are racing to see who can reimplment them best. Windows and Linux do this just fine AFAICT.

Seems like it would be more effective for the WebKit folks to just work on their Darwin compatriots to fix the C library locks, no?

rbehrends · on April 2, 2017

It's not as simple as that. What makes locks expensive (no matter their policy regarding fairness) is blocking. Blocking leads to context switches, which are extremely expensive. Linux not using fairness does not eliminate that problem, it mitigates it (which is a good thing, just not a panacea).

I have a simple benchmark that loses performance the more you parallelize it beyond 4-6 cores, until eventually it becomes slower than the single-threaded case [1]. And contention on that is not actually that high. That's still better than the macOS case, where performance goes to hell for even two threads, of course.

Here's the output of time for 32 threads (running on a 48-core system):

  ./a.out -p 32  7.72s user 75.04s system 2603% cpu 3.179 total

as opposed to the sequential case:

  ./a.out -p 0  1.94s user 0.00s system 99% cpu 1.942 total

As you can see, the program spends the vast majority of the time in the kernel due to blocking constantly. (Note that other factors add to the overhead, too.)

In short, you do not want a high degree of contention, anyway. The macOS locks just make contention hurt much more.

[1] https://gist.github.com/rbehrends/27a1ca27ddec7016cd42428b9c...

Jweb_Guru · on April 2, 2017

Fairness isn't an oddball design decision, it's an important assumption that many realtime systems rely on.

ajross · on April 2, 2017

It's oddball in the sense that literally no one else does it. So I don't know what systems you mean, since by definition they don't run on Linux (glibc or bionic) or Windows.

I'm not saying there's no use case for a fair mutex anywhere (though I'll come really close and say that if you want fairness a lock is the wrong abstraction), I'm saying that stuffing a context switch into every contended call to pthread_mutex_unlock() is a terrible design decision.

Jweb_Guru · on April 2, 2017

If you don't need fairness, don't use a fair lock. The default assumption should be fairness because it doesn't require you to think (just like the default transaction isolation level should be serializable). As the blog post points out, OS X provides unfair mutexes too.

Windows locks are fair by default too, BTW (IIRC anyway). People just use the unfair version when they need performance.

striking · on April 2, 2017

http://www.pcdoctor-community.com/wp/posts/2007/10/fairness-...

Windows XP tried to implement some amount of fairness in its locks but did not guarantee it. Windows Vista gave up on fairness entirely.

You should just be using unfair locks. Thread starvation is rare in most OSes at this point.

lilyball · on April 2, 2017

Thread starvation is still very much a thing on a lot of devices. macOS and iOS even have a Quality Of Service implementation in the kernel specifically to ensure that high-priority threads get to run when they'd otherwise be starved, and this can lead to low-priority threads being suspended for many seconds at a time under high workload. In fact, it's now really dangerous to use a regular spinlock on iOS and macOS because you can get into a priority inversion livelock if threads from different QoS levels are contending the same lock. The Obj-C runtime was seeing delays of tens of seconds in some cases when using spinlocks, so it now uses a private "os_lock" which is a spinlock that uses SPI to "donate" the QoS of the waiting thread to the one that holds the spinlock.

LeoNatan25 · on April 2, 2017

As of iOS 10 and macOS 10.12, it is no longer private: `os_unfair_lock`.

https://developer.apple.com/reference/os/os_unfair_lock

lilyball · on April 3, 2017

Oh good! I'll have to do some research at some point to find out if this is literally the same thing as the os_lock that the Obj-C runtime uses, but either way, I'm glad it's available now.

LeoNatan25 · on April 3, 2017

It uses the same private API (or at least, what you described in your previous post).

quinnftw · on April 2, 2017

I'd be interested in seeing the rational behind the fairshare policy. I suppose theoretically the firstfit policy could result in starvation via one thread constantly preempting another during the lock contention, but I don't imagine that would happen very often in practise.

alain94040 · on April 2, 2017

Starvation is a serious problem. Imagine two cores both trying to get a lock. The lock is held by a cache line in one of the cores. If both cores try to get the lock very often, the core further away may very well never get the lock.

The synthetic benchmark results don't show that because they behave nicely: each thread waits politely in between each attempt to get the lock. Real code doesn't always do that.

ajross · on April 2, 2017

> Imagine two cores both trying to get a lock. The lock is held by a cache line in one of the cores. If both cores try to get the lock very often, the core further away may very well never get the lock.

If that were true, then your problem is L2-cache-bound and trying to fit your solution into SMP is the Wrong Thing. In fact, the single-threaded behavior you end up with is going to be faster (by definition!) than the "fair" architecture you seem to want.

No one serious thinks this behavior of the default locking primitive is a good thing. Maybe (maybe) it's a good fit for some particular problem somewhere, but I'd want to see a benchmark. It's definitely not a consensus opinion among people who do serious thought about synchronization.

lilyball · on April 2, 2017

> No one serious thinks this behavior of the default locking primitive is a good thing.

Clearly some people do or OS X wouldn't do it.

59nadir · on April 2, 2017

Would it be the first badly implemented thing in OS X?

lilyball · on April 3, 2017

Mutexes are heavily used and get a lot of scrutiny, and work was done to speed them up when 10.11 (or was it 10.10? Whichever one added QoS) came out. I'm positive the people who maintain it are aware of the tradeoffs between fair and unfair locks and made the default fair intentionally.

Or more generally, just because Linux and Windows both behave in a certain manner doesn't make that de facto correct. It's all tradeoffs, and different people value different things. Correct-by-default (i.e. fair locks) is valuable, the only question is whether it's worth the performance hit (although you can always opt in to unfair locks to get better performance).

btschaegg · on April 3, 2017

> I'm positive the people who maintain it are aware of the tradeoffs between fair and unfair locks and made the default fair intentionally.

I think you're neglecting a couple of less-technical factors here. Yes, fairness was certainly made the default for reasons at some point (but still, even then, one might argue that an opt-in solution might have been better). On the other hand, there's the likely possibility that those reasons don't really hold true anymore. Think of Scene Graphs vs. Entity Component Systems in high-performance videogame design - in this example the rise of caching made a whole architecture out-dated.

On the other hand, like removing the GIL in Python, such decisions are not to be taken lightly because of the things you will break. It's very likely that there are applications that would still have problems with starving threads, and just switching from opt-out to opt-in will make them break for no apparent reason in the strangest of circumstances. I know, Apple likes to break things more often than MS, but I'd guess that´s not a risk they're willing to take. Imagine you're updating your OS and a dozen apps that worked for a decade and don´t get updates anymore start behaving strangely.

So, it's not unreasonable to settle for a less-than-optimal solution that still keeps things working and only makes things slower in the worst-case scenario. That doesn't mean it's not open to criticism, though.

qb45 · on April 3, 2017

> Yes, fairness was certainly made the default for reasons at some point (but still, even then, one might argue that an opt-in solution might have been better).

I think an unfair or opt-in-fair solution was deemed worse than fair-by-default for the simple reason that the most straightforward way to implement mutexes is

  enter kernel mode
  maybe grab a spinlock if on SMP
  add yourself to the waiting list
  sleep until woken up

  do critical section

  enter kernel mode
  spinlock
  wake the first waiting thread

That's a functionality every mutex must have and it happens to give fair semantics for free. Making semantics weaker for the purpose of optimization (and not just for the sake of making application developer's life harder or future flexibility which may happen to never be needed) actually takes additional work on top of that.

Since we are talking about a uniprocessor desktop OS developed in the nineties, it's plausible they didn't care about mutex performance as much as today and giving this extra guarantee afforded by their simple implementation seemed reasonable at the time.

gpderetta · on April 3, 2017

note that every half decent mutex implementation will not enter the kernel (in neither lock or unlock) unless the mutex is contended.

qb45 · on April 3, 2017

Sure, but OP's benchmark clearly shows that OS X's default mutex is at most 5% decent ;)

lilyball · on April 3, 2017

I'd love to see them perform the same benchmark on macOS 10.12. Pthread mutexes were sped up in "the new OSes" according to a tweet made last year (sorry no link), though I'm not sure if that was referring to 10.11 or 10.12. Either way, the benchmark should perform better now, and I'm curious how much. I'd run it myself except those numbers wouldn't be comparable to OP's Mac Mini benchmark. Of course, OP's Linux benchmark numbers aren't either, because it's a different (and probably much more powerful) machine.

btschaegg · on April 3, 2017

...which comes back to my point that old truths are more likely just not as true anymore.

lilyball · on April 3, 2017

Apple can and does update APIs in ways that preserve old behavior for apps linked against older SDKs, specifically so old apps continue to work. They could do the same here with versioned symbols, if they wanted to switch mutexes to unfair-by-default. I believe they intentionally go with fair-by-default because it's the safe choice, the choice that guarantees program correctness, since most people don't think about this sort of thing and probably won't be able to figure out themselves when their locks should be fair or unfair. It's the same reason C11 and C++11 atomics default to sequential consistency when not otherwise specified; that's the slowest memory ordering, but it's the one that's guaranteed to be correct for all cases. If you know what you're doing you can override the default and pick something else.

btschaegg · on April 3, 2017

> Apple can and does update APIs in ways that preserve old behavior for apps linked against older SDKs, specifically so old apps continue to work.

Fair point, although that still won't rule out applications that get updates (and thus still would have to be completely re-evaluated on the basis of such a case).

Also, wouldn't e.g. Homebrew also get those problems if you compile against the new SDK? (Non-Mac user here, so maybe I've got the wrong impressions on that...)

> I believe they intentionally go with fair-by-default because it's the safe choice [...]

Safety might be a big concern, but on the other hand, pthreads is a standard - if the article's right, and POSIX doesn't mandate fairness, you might still argue that this "addon" was better put into a custom solution than the other way around.

Then again, I've always suspected that strange implementations (or a total lack thereof) of POSIX must be one of the main reasons why Boost exists...

lilyball · on April 3, 2017

> Also, wouldn't e.g. Homebrew also get those problems if you compile against the new SDK?

Yes, but nearly all Homebrew software is cross-platform unix software, so that software will already have to deal with having unfair mutexes on Linux.

> … pthreads is a standard - if the article's right, and POSIX doesn't mandate fairness …

It doesn't mandate fairness, but it doesn't mandate unfairness either. Apple platforms defaulting to fair is no less conforming than Linux defaulting to unfair.

btschaegg · on April 4, 2017

> Apple platforms defaulting to fair is no less conforming than Linux defaulting to unfair.

Of course, but it makes it less performant in many situations, while you can't count on that feature if you're coding against the standard (i.e. if for cross platform development; hence OP's blogpost). Which - in turn - explains why people are complaining about the performance, since they need to guarantee the fairness themselves anyway if they really need it.

On the other hand: If you're only coding against OSX, having the fairness in an extension shouldn't be a problem anyway.

Of course, this then facilitates other problems, if the standard is not evolving fast enough (the whole OpenGL extension mess would be a good example).

lilyball · on April 5, 2017

I don't think Apple makes decisions about their OS with the mindset of "what will someone who's writing a cross-platform CLI tool that happens to work on OS X expect?". Making mutexes unfair by default would make it behave more like other platforms that cross-platform tools expect, but at the cost of making software written specifically for OS X not be correct-by-default.

In general, when picking defaults and choosing between "correct" or "performant", the decision is usually made in favor of "correct", because incorrectness should always be an opt-in thing. I'm actually kind of surprised Linux and Windows default to unfair locks.

btschaegg · on April 5, 2017

That's the point though, if you're Writing threading code specifically for Windows, you're not writing pthreads code. So all MS has to care about is that their custom threading abstraction is safe.

POSIX is more of a common denominator than a good choice in many usecases (you'll never get far if you're dealing with complicated file handling for example[1]), but it is an essential baseline for cross-platform development. If you're changing too much about it, you're burdening cross-platform devs to do system-specific implementations anyway, so there's not much point to it.

As a flipside, you'll also get devs who see a guarantee on one system, assume the standard guarantees it, and then make statements about their library that are untrue (i.e. "WTF::Lock is fair").

[1]: https://youtu.be/uhRWMGBjlO8

lilyball · on April 5, 2017

Just because you're using a POSIX API doesn't mean you're writing cross-platform code.

btschaegg · on April 5, 2017

No, but nothing is stopping you from using another New&Better™ Apple API if you're not concerned by cross-platform compatibility. Having tons of vendor-specific code in your cross-platform software, on the other hand, is a pain and one of the main reason why standards like POSIX exist.

Conformance with a standard doesn't just happen on a whim - if you negate the benefits that come with having a standard in the first place, any criticism you'll get as the provider of said API you brought on yourself.

Edit: Also, I have to say: I really don't like the attitude of "there are other use cases, so the one you are referring to when voicing criticisms is invalid". This has never been a constructive use of anyone's time. As I said, there are very valid reasons for why the API works as it does, but that doesn't invalidate the argument that the basic decision might not have been the best one to begin with.

lilyball · on April 5, 2017

What New&Better™ Apple API are you thinking of when it comes to mutexes? There's no reason for Apple to duplicate the functionality already provided by POSIX just for the sake of having a proprietary API.

> Conformance with a standard doesn't just happen on a whim

I genuinely don't understand what argument you're trying to make here. Apple is in compliance with POSIX. You seem to still be arguing from the position of "Windows and Linux behave in a certain way, therefore that way is correct", which is, quite frankly, bullshit.

btschaegg · on April 6, 2017

It's not a point on the implentation being incorrect. I honestly don't know if you read any of this - somehow I doubt that more and more.

The original point was: Apple uses fair mutexes by default, and that slows down any algorithm not needing that feature. Your point was: But it is still correct, and yes it is. But it's also still slow, and anyone pointing that out is as much in the right as you are when you refer to correctness.

Apple then introducing an out-of-standard solution (os_unfair_lock) for that (which is the exact duplication you talk about) is also just as suboptimal and worthy of criticism, since it torpedoes the whole reason for the existence of the standard (i.e. you still get #ifdefs or unnecessary abstractions all over the place). And - as I said - I get that no-one would ever want to change an implemented default behavior in their OS, but that doesn't make the situation any better.

Something similar happened in Microsoft's C++11-implementation of std::regex: The library uses strongly recursive function calls and might even cause stack overflows on various (not all that complicated) patterns in some scenarios. That doesn't make the implementation not correct (every regex implementation has its limits, and it never produces invalid results), but it makes it a hell of a lot less usable for non-trivial problems.

So, you can harp on on the correctness thing as long as you want, but you might consider not ignoring valid arguments before throwing words like BS around. "This implementation might be better" is not BS. Aggressively misinterpreting statements can be, though.

lilyball · on April 6, 2017

Accusing me of not reading your comment is kind of offensive, and the implication (that I'm only disagreeing because I'm not reading your arguments) is also offensive.

Given that POSIX does not define a way to select between fair and unfair locks, having any way to do that is by definition not a standard. Your argument against Apple having a way to get unfair locks is just as much an argument against Linux/Windows having a way of getting fair locks (assuming they even have that! I sure hope they do). And in any case, you don't need to #ifdef all your code everywhere to use os_unfair_lock, you just need to #ifdef a single line where you call pthread_mutexattr_setpolicy_np().

> I get that no-one would ever want to change an implemented default behavior in their OS, but that doesn't make the situation any better.

I recognize that having different behaviors on different OSes is suboptimal. The problem is you continue to imply that macOS's behavior is wrong, with the only justification being that Linux and Windows both behave differently. And I do not accept that argument. I'm also not convinced that this is really a big problem anyway. Fairness in locks is probably a little more important on Apple platforms than other platforms because of the QoS system. If you have an unfair lock that both the main thread and a "utility" thread use all the time, the utility thread will probably get starved since the main thread has a much higher QoS. So it's good that Apple platforms default to fair locks, because thread starvation is pretty bad. Personally, I think Linux and Windows platforms should default to fair locks too, with unfair-but-fast being opt-in, but that's an argument I've already made.

> but you might consider not ignoring valid arguments

Once again you are being offensive.

> Aggressively misinterpreting statements can be, though.

And even more offensive.

Seriously, if this is how you're going to argue, I don't see much point in continuing. You're providing fairly weak arguments in favor of all platforms should be the same, without recognizing that the platforms may have reasons for behaving differently, and constantly trying to imply that Apple platforms should be the ones to change while providing literally no reason whatsoever why that change should happen instead of Linux/Windows changing. This is not a compelling argument. And you keep insulting me at the same time, and I genuinely don't understand why you think that's going to change my mind.

btschaegg · on April 6, 2017

> And even more offensive.

Seems like we get the same vibes from each other; I have to point out though that I'm actually trying to be civil here. I honestly can't see where you are getting various of your implications from - I can't agree with half of what you state I'm implying here. Intended offensive statements would look different than that one.

So be it - we most likely both have more productive things to do. Maybe we'll have a more constructive discussion on something else someday.

wbl · on April 2, 2017

Why are two threads trying to get the same lock very often? This seems like a design flaw: you want threads to mostly be uncontended, with lots of time spent outside locks.

socmag · on April 3, 2017

absolutely, I totally agree.

socmag · on April 3, 2017

In my personal experiences I've seen quite the opposite.

Specifically, this kind of starvation mostly occurs only in synthetic lock benchmarks.

In the real world, cores should be actually doing work in between spinning on cache-line locks; and the time before they get back to locking should be fairly random by quite a reasonable number of cycles - which tends to largely mitigate the starvation issue you talk about.

Basically, if you have code that is structured to by default sit on spin-locks, there is likely an architectural issue in play.

I'd go with the other contributor who said something along the lines of "the best policy is to go with the fastest lock possible", unless there is some very niche reason not to.

eternalban · on April 2, 2017

Even if they get the lock, task latencies will be all over the place with seriously high latencies in the high percentiles.

hyperpape · on April 2, 2017

He doesn't give a lot of details, but somewhere in the first 20 minutes of this talk (probably 15 minutes in), Cliff Click says that Hotspot had to create fair locks in user-space because kernel provided locks weren't adequate.

https://youtu.be/Hqw57GJSrac?t=341

foepys · on April 2, 2017

Thank you for the video, very interesting. The time for what you mentioned is here: https://youtu.be/Hqw57GJSrac?t=1104 (from 18:24 to 24:00)

taeric · on April 2, 2017

Its funny that he throws out the "painful" learning of "don't trust the OS." I don't disagree, but it seems that the main learning should be "know your assumptions."

Consider, this just moved up the stack. A good mantra for developers on the JVM is "don't trust the VM."

Obviously, at some level you are trusting them, but you should also have measurements in place that would let you know which key assumption is failing. Yes, this is easier said than done.

dom0 · on April 2, 2017

Starvation and similar problems have plagued software using locks for a long time. A prominent example which is well documented may be the various problems of the GIL implementation when certain workloads are applied (easiest example is one I/O bound thread and one CPU bound thread, which is also a typical real world scenario - ouch!)

[It should be noted that it is commonly rejected that the GIL is a scheduler, acts like a scheduler or should be a proper scheduler; all of which are untrue. Many locks are used as schedulers as well, which is especially true for a lock that controls execution of any and all code.]

cat199 · on April 2, 2017

Which GIL are you talking about?

dom0 · on April 2, 2017

Sorry, I meant the Python GIL. Should've written that.

rogerbinns · on April 2, 2017

Do remember that current CPUs can dynamically run cores/hyper-threads at different clock speeds at the same time. Not to mention significantly mismatched performance ones (eg ARM big.LITTLE).

If you didn't have fairness, and the developer hadn't carefully assigned threads to cpus based on performance / priority of the work the thread is doing (and locked the CPU clock speeds), then the higher performing ones will starve out the others.

Fairness at least ensures that the developers writing simple straightforward code would get simple straightforward behaviour.

gpderetta · on April 3, 2017

Then again, if you are writing simple straightforward code you shouldn't be mucking around with thread priorities. In fact probably you shouldn't be dealing with raw threads and mutices in the first place but using some higher level abstraction.

qb45 · on April 2, 2017

It's probably a very old decision, from times when most desktop computers had one CPU and threads were mostly used to keep GUI responsive while the application was churning data in the background, so there was little interaction between them.

ricardobeat · on April 2, 2017

> I don’t know how firstfit policy locks compare to something like WTF::Lock on my Mac mini

As a layperson, it would have been nice to see at least some simple benchmarks substantiating what is being said, after several paragraphs trying to explain why WTF::Lock is not a good candidate, and given it's in the title of the post.

leksak · on April 2, 2017

> But fairness is not guaranteed for all OS mutexes; in fact, fairness isn’t even guaranteed in the pthreads standard

Is this true? I do not want to have to pay to read the standard

gpderetta · on April 3, 2017

Good thing it is available for free then!

leksak · on April 9, 2017

Where? I didn't find it