It's important to remember that futex is not a lock; it is a conditional sleepin...

cryptonector · on Feb 10, 2022

It's not really different than with condition variables having to re-check the condition after being signaled. Learn the rules of the API, hew to them, and you'll be fine.

RustyRussell · on Feb 9, 2022

> Its name unfortunately looks similar to mutex but it's not a lock.

Hmm, looking at the latest source code, my original quote has been split, but:

* Fast Userspace Mutexes (which I call "Futexes!").

* (C) Rusty Russell, IBM 2002

...

* "The futexes are also cursed."

* "But they come in a choice of three flavours!"

https://github.com/torvalds/linux/blob/master/kernel/futex/c...

ww520 · on Feb 9, 2022

Thanks for the creation! But sorry man, the historic description unfortunately added to the confusion. When I read the original description, I thought it’s a user mode lock and was thoroughly confused. Only after I read the implementation of futex in kernel and its usage in the implementation of pthread_mutex_lock in user mode, that I realized futex was not a lock. It’s a waiting and signaling operation.

l33t2328 · on Feb 9, 2022

I’ve always imagined a lock as a conditional sleeping operation. What’s the difference?

ww520 · on Feb 10, 2022

Lock doesn't have to sleep. The semantic of a lock is that on acquisition it blocks until it's available, and upon unblocking the caller thread has acquired the lock. The blocking can be sleep or be doing busy-spin.

Futex is the lower level mechanism to build higher level locks like mutex/semphore/conditional-variable/etc.

dataflow · on Feb 9, 2022

I think what the parent was saying was that, for a futex, the kernel wakes you up when the condition might have been satisfied, but spurious wakes are possible too. (I'm counting the simultaneous waking of multiple threads as spurious wakes here.) That's not the case with locks, which only wake up when the condition is satisfied.

dpratt71 · on Feb 9, 2022

Stupid question: Why isn't such logic incorporated into the call itself?

ww520 · on Feb 9, 2022

Good question. It's for performance reason. Calling into kernel is expensive. Lock acquired in the user mode is much faster than lock acquired in kernel.

A typical lock acquisition using futex looks like:

  (a) while !compare_and_swap(&lock, 0, 1) // see if flag is free as 0, lock it as 1  
  (b)     futex(&lock, WAIT, 1)            // sleep until flag changes from 1

(a) runs in user mode and it's very fast. It's just one CMPXCHG assembly instruction. If the lock is free, it's acquired in one instruction as 1 (locked).

If the lock is not free, then do the expensive call into the kernel to sleep via futex at (b). Futex() helps in detecting the change of the value while putting the thread to sleep, to avoid hogging the CPU.

RustyRussell · on Feb 9, 2022

Importantly, prior to this (and, hell, even since) state of the art was to try atomically, then yield (or sleep(0)) then try usleep.

The kernel had no idea what was going on, so had no idea how to schedule such a thing. It particularly didn't know to wake you when the lock (which it has no idea about) became available.

chrchang523 · on Feb 9, 2022

It is worth noting here that this "one assembly instruction" is not that cheap. The hardware on a multicore system does have to perform some locking under the hood to execute that instruction. But yes, it still has enough of an advantage over calling into the kernel to justify the additional usage complexity.

rrauenza · on Feb 10, 2022

And on ccnuma systems, you end up getting memory contention and huuuuuge memory latencies, as well for the data also residing in the same cacheline. Often we would align and isolate these locks within a cacheline blocke... which also wastes a lot of space if you have a lot of these (I was consulting on an enterprise app that had millions I think. It made a big difference in the software's memory footprint.)

chacham15 · on Feb 9, 2022

Im pretty sure the same construct can be implemented without the compare:

  int lock = 0;

  void AcquireLock(int *lock){
    while (ATOMIC_SWAP(lock, 1)){
      sleep(10); //or futex or w/e
    }
  }

  void ReleaseLock(int *lock){
    ATOMIC_SWAP(lock, 0);
  }

ww520 · on Feb 9, 2022

The 'lock' variable is shared among threads. Compare is needed to avoid stomping on the lock acquired by another thread.

dragontamer · on Feb 10, 2022

No it isn't.

If lock = 1, you set the lock to 1 (aka do nothing).

If the lock is 0, you know it is unlocked and know you succeeded in acquiring the lock.

If many threads try to atomic swap, only one of them gets the zero.

------

The real issue here is the subtle memory barrier bug in that code.

ww520 · on Feb 10, 2022

You are right. XCHG can do the job to acquire the lock. At least on x86, XCHG does lock the cache line on the address of the variable, so it should be ok.

ww520 · on Feb 9, 2022

Yes, CMPXCHG is a relatively more expensive instruction. In x86 it is a memory barrier instruction.

derefr · on Feb 9, 2022

The gettimeofday(3) vDSO is pure-userspace code. Why not, then, a futex(3) vDSO that does a while + compare_and_swap(2) in userspace, but then contains a real call to the futex(3) syscall?

tialaramex · on Feb 9, 2022

This part of the optimisation is well-known and had been for years before futex was invented, so there's no need to provide it as a vDSO or any other special work, userspace can (and was ready to) do that part anyway.

The futex is clever because previously the heavyweight locking is in the kernel, but all you actually needed was this very flimsy wait feature. BeOS for example, has a "Benaphore" construction where you initialise an expensive heavyweight lock but you try the userspace trick before needing to actually take it. So that looks superficially just as good as a futex...

... and then you realise oh, the underlying locks need kernel RAM to store their state, so the OS can only allow each process to have so many of them and thus you can't have so many Benaphores, they were expensive after all. But a futex doesn't use any kernel resources when you aren't calling the OS, Linux doesn't mind if your program uses a billion futexes, that's fine, any particular thread can only be waiting on one of them, and Linux is only tracking the waiting.

derefr · on Feb 9, 2022

> This part of the optimisation is well-known and had been for years before futex was invented, so there's no need to provide it as a vDSO or any other special work, userspace can (and was ready to) do that part anyway.

I guess, to me, the semantics of futex(3) on their own seem ill-formed, without the while loop + cmpxchg being part of them. It feels like the user shouldn't have access to raw "low-level" calls to futex(..., FUTEX_WAIT), since there's only one useful way to use those calls, and it's in combination with other code, where that code is theoretically possible to get wrong. It's an API that doesn't cleanly encapsulate its own concerns.

I suppose I'm used to thinking with "building reusable crypto"-tinted glasses: with crypto libraries, you never want to expose a primitive that needs to be used a certain way to be sound; because people are inevitably going to use it wrong, and that's inevitably going to result in exploitation. Instead, you can just expose a higher-level primitive, that inherently can only be used that one correct way, with no means to ask it to do anything else.

Of course, there's nothing inherently dangerous about calling futex(..., FUTEX_WAIT) without the associated code. (You just get a race condition. But a race condition in userspace doesn't corrupt the kernel or anything.) So I suppose this kind of thinking is meaningless here.

gpderetta · on Feb 10, 2022

For a long time the user in fact did not have (easy) access to it. glibc has started to provide a syscall wrapper only very recently, before you had to use syscall directly.

The reason is that the CAS is not part of the interface but it is left to the implementor is that the specific logic is very much application specific and in fact there is not only one way to use it. As discussed elsethread mutexes are only one of the many application of futexes.

ww520 · on Feb 9, 2022

What you described is what the phread_mutex_lock() does exactly, which is in user space. Application programmers don't deal with futex directly, they call phread_mutex_lock/phread_mutex_unlock.

yencabulator · on Feb 11, 2022

The non-syscall parts of futex can be inlined, for great benefit.

gettimeofday calls code in the vDSO to avoid codifying in the kernel ABI the much more complex, architecture-dependent, data format.

ww520 · on Feb 9, 2022

Futex puts the thread to sleep and wakes it up. Accessing the OS scheduler requires kernel access anyway. The memory access of the flag can be done in user or kernel space.

oofabz · on Feb 10, 2022

Incorporating thread-safety into individual calls is not sufficient, because thread-safety is not composable. A sequence of calls to thread-safe operations is not itself thread-safe. Since you need locks in the outermost application level, it would be wasteful and unnecessary to also lock inside each individual call.

This non-composability is why multithreading is so difficult. A library can't abstract away these concerns for you, unless the entire multithreaded operation is one call to the library, which requires the library to know and design for your exact use case. If you want to do something a library didn't plan for, you are required to handle the thread synchronization yourself.

58028641 · on Feb 10, 2022

Haskell’s STM offers composable concurrency