It's important to remember that futex is not a lock; it is a conditional sleeping operation. Its name unfortunately looks similar to mutex but it's not a lock.
A thread calls futex(&flag, FUTEX_WAIT) to sleep. The kernel wakes up the caller thread when the futex flag's value might have been changed, by another thread calling futex(&flag, FUTEX_WAKE). When futex() returns, you don't get a lock. It just means the value of the flag might have been changed.
Since it's not a lock, the value of the flag is not guaranteed to be certain value when futex() returns. The flag can be changed again by another thread between the futex's return and the next instruction of the caller thread. It's important to put the futex() call inside a loop calling compare_and_set() repeatedly to check the expected value of the flag.
It's not really different than with condition variables having to re-check the condition after being signaled. Learn the rules of the API, hew to them, and you'll be fine.
Thanks for the creation! But sorry man, the historic description unfortunately added to the confusion. When I read the original description, I thought it’s a user mode lock and was thoroughly confused. Only after I read the implementation of futex in kernel and its usage in the implementation of pthread_mutex_lock in user mode, that I realized futex was not a lock. It’s a waiting and signaling operation.
Lock doesn't have to sleep. The semantic of a lock is that on acquisition it blocks until it's available, and upon unblocking the caller thread has acquired the lock. The blocking can be sleep or be doing busy-spin.
Futex is the lower level mechanism to build higher level locks like mutex/semphore/conditional-variable/etc.
I think what the parent was saying was that, for a futex, the kernel wakes you up when the condition might have been satisfied, but spurious wakes are possible too. (I'm counting the simultaneous waking of multiple threads as spurious wakes here.) That's not the case with locks, which only wake up when the condition is satisfied.
Good question. It's for performance reason. Calling into kernel is expensive. Lock acquired in the user mode is much faster than lock acquired in kernel.
A typical lock acquisition using futex looks like:
(a) while !compare_and_swap(&lock, 0, 1) // see if flag is free as 0, lock it as 1
(b) futex(&lock, WAIT, 1) // sleep until flag changes from 1
(a) runs in user mode and it's very fast. It's just one CMPXCHG assembly instruction. If the lock is free, it's acquired in one instruction as 1 (locked).
If the lock is not free, then do the expensive call into the kernel to sleep via futex at (b). Futex() helps in detecting the change of the value while putting the thread to sleep, to avoid hogging the CPU.
Importantly, prior to this (and, hell, even since) state of the art was to try atomically, then yield (or sleep(0)) then try usleep.
The kernel had no idea what was going on, so had no idea how to schedule such a thing. It particularly didn't know to wake you when the lock (which it has no idea about) became available.
It is worth noting here that this "one assembly instruction" is not that cheap. The hardware on a multicore system does have to perform some locking under the hood to execute that instruction. But yes, it still has enough of an advantage over calling into the kernel to justify the additional usage complexity.
And on ccnuma systems, you end up getting memory contention and huuuuuge memory latencies, as well for the data also residing in the same cacheline. Often we would align and isolate these locks within a cacheline blocke... which also wastes a lot of space if you have a lot of these (I was consulting on an enterprise app that had millions I think. It made a big difference in the software's memory footprint.)
You are right. XCHG can do the job to acquire the lock. At least on x86, XCHG does lock the cache line on the address of the variable, so it should be ok.
The gettimeofday(3) vDSO is pure-userspace code. Why not, then, a futex(3) vDSO that does a while + compare_and_swap(2) in userspace, but then contains a real call to the futex(3) syscall?
This part of the optimisation is well-known and had been for years before futex was invented, so there's no need to provide it as a vDSO or any other special work, userspace can (and was ready to) do that part anyway.
The futex is clever because previously the heavyweight locking is in the kernel, but all you actually needed was this very flimsy wait feature. BeOS for example, has a "Benaphore" construction where you initialise an expensive heavyweight lock but you try the userspace trick before needing to actually take it. So that looks superficially just as good as a futex...
... and then you realise oh, the underlying locks need kernel RAM to store their state, so the OS can only allow each process to have so many of them and thus you can't have so many Benaphores, they were expensive after all. But a futex doesn't use any kernel resources when you aren't calling the OS, Linux doesn't mind if your program uses a billion futexes, that's fine, any particular thread can only be waiting on one of them, and Linux is only tracking the waiting.
> This part of the optimisation is well-known and had been for years before futex was invented, so there's no need to provide it as a vDSO or any other special work, userspace can (and was ready to) do that part anyway.
I guess, to me, the semantics of futex(3) on their own seem ill-formed, without the while loop + cmpxchg being part of them. It feels like the user shouldn't have access to raw "low-level" calls to futex(..., FUTEX_WAIT), since there's only one useful way to use those calls, and it's in combination with other code, where that code is theoretically possible to get wrong. It's an API that doesn't cleanly encapsulate its own concerns.
I suppose I'm used to thinking with "building reusable crypto"-tinted glasses: with crypto libraries, you never want to expose a primitive that needs to be used a certain way to be sound; because people are inevitably going to use it wrong, and that's inevitably going to result in exploitation. Instead, you can just expose a higher-level primitive, that inherently can only be used that one correct way, with no means to ask it to do anything else.
Of course, there's nothing inherently dangerous about calling futex(..., FUTEX_WAIT) without the associated code. (You just get a race condition. But a race condition in userspace doesn't corrupt the kernel or anything.) So I suppose this kind of thinking is meaningless here.
For a long time the user in fact did not have (easy) access to it. glibc has started to provide a syscall wrapper only very recently, before you had to use syscall directly.
The reason is that the CAS is not part of the interface but it is left to the implementor is that the specific logic is very much application specific and in fact there is not only one way to use it. As discussed elsethread mutexes are only one of the many application of futexes.
What you described is what the phread_mutex_lock() does exactly, which is in user space. Application programmers don't deal with futex directly, they call phread_mutex_lock/phread_mutex_unlock.
Futex puts the thread to sleep and wakes it up. Accessing the OS scheduler requires kernel access anyway. The memory access of the flag can be done in user or kernel space.
Incorporating thread-safety into individual calls is not sufficient, because thread-safety is not composable. A sequence of calls to thread-safe operations is not itself thread-safe. Since you need locks in the outermost application level, it would be wasteful and unnecessary to also lock inside each individual call.
This non-composability is why multithreading is so difficult. A library can't abstract away these concerns for you, unless the entire multithreaded operation is one call to the library, which requires the library to know and design for your exact use case. If you want to do something a library didn't plan for, you are required to handle the thread synchronization yourself.
A thread calls futex(&flag, FUTEX_WAIT) to sleep. The kernel wakes up the caller thread when the futex flag's value might have been changed, by another thread calling futex(&flag, FUTEX_WAKE). When futex() returns, you don't get a lock. It just means the value of the flag might have been changed.
Since it's not a lock, the value of the flag is not guaranteed to be certain value when futex() returns. The flag can be changed again by another thread between the futex's return and the next instruction of the caller thread. It's important to put the futex() call inside a loop calling compare_and_set() repeatedly to check the expected value of the flag.