Hacker News new | past | comments | ask | show | jobs | submit login
Landing a new syscall, part 1: What is futex? (collabora.com)
134 points by sph on Feb 9, 2022 | hide | past | favorite | 82 comments



It's important to remember that futex is not a lock; it is a conditional sleeping operation. Its name unfortunately looks similar to mutex but it's not a lock.

A thread calls futex(&flag, FUTEX_WAIT) to sleep. The kernel wakes up the caller thread when the futex flag's value might have been changed, by another thread calling futex(&flag, FUTEX_WAKE). When futex() returns, you don't get a lock. It just means the value of the flag might have been changed.

Since it's not a lock, the value of the flag is not guaranteed to be certain value when futex() returns. The flag can be changed again by another thread between the futex's return and the next instruction of the caller thread. It's important to put the futex() call inside a loop calling compare_and_set() repeatedly to check the expected value of the flag.


It's not really different than with condition variables having to re-check the condition after being signaled. Learn the rules of the API, hew to them, and you'll be fine.


> Its name unfortunately looks similar to mutex but it's not a lock.

Hmm, looking at the latest source code, my original quote has been split, but:

* Fast Userspace Mutexes (which I call "Futexes!").

* (C) Rusty Russell, IBM 2002

...

* "The futexes are also cursed."

* "But they come in a choice of three flavours!"

https://github.com/torvalds/linux/blob/master/kernel/futex/c...


Thanks for the creation! But sorry man, the historic description unfortunately added to the confusion. When I read the original description, I thought it’s a user mode lock and was thoroughly confused. Only after I read the implementation of futex in kernel and its usage in the implementation of pthread_mutex_lock in user mode, that I realized futex was not a lock. It’s a waiting and signaling operation.


I’ve always imagined a lock as a conditional sleeping operation. What’s the difference?


Lock doesn't have to sleep. The semantic of a lock is that on acquisition it blocks until it's available, and upon unblocking the caller thread has acquired the lock. The blocking can be sleep or be doing busy-spin.

Futex is the lower level mechanism to build higher level locks like mutex/semphore/conditional-variable/etc.


I think what the parent was saying was that, for a futex, the kernel wakes you up when the condition might have been satisfied, but spurious wakes are possible too. (I'm counting the simultaneous waking of multiple threads as spurious wakes here.) That's not the case with locks, which only wake up when the condition is satisfied.


Stupid question: Why isn't such logic incorporated into the call itself?


Good question. It's for performance reason. Calling into kernel is expensive. Lock acquired in the user mode is much faster than lock acquired in kernel.

A typical lock acquisition using futex looks like:

  (a) while !compare_and_swap(&lock, 0, 1) // see if flag is free as 0, lock it as 1  
  (b)     futex(&lock, WAIT, 1)            // sleep until flag changes from 1  
(a) runs in user mode and it's very fast. It's just one CMPXCHG assembly instruction. If the lock is free, it's acquired in one instruction as 1 (locked).

If the lock is not free, then do the expensive call into the kernel to sleep via futex at (b). Futex() helps in detecting the change of the value while putting the thread to sleep, to avoid hogging the CPU.


Importantly, prior to this (and, hell, even since) state of the art was to try atomically, then yield (or sleep(0)) then try usleep.

The kernel had no idea what was going on, so had no idea how to schedule such a thing. It particularly didn't know to wake you when the lock (which it has no idea about) became available.


It is worth noting here that this "one assembly instruction" is not that cheap. The hardware on a multicore system does have to perform some locking under the hood to execute that instruction. But yes, it still has enough of an advantage over calling into the kernel to justify the additional usage complexity.


And on ccnuma systems, you end up getting memory contention and huuuuuge memory latencies, as well for the data also residing in the same cacheline. Often we would align and isolate these locks within a cacheline blocke... which also wastes a lot of space if you have a lot of these (I was consulting on an enterprise app that had millions I think. It made a big difference in the software's memory footprint.)


Im pretty sure the same construct can be implemented without the compare:

  int lock = 0;

  void AcquireLock(int *lock){
    while (ATOMIC_SWAP(lock, 1)){
      sleep(10); //or futex or w/e
    }
  }

  void ReleaseLock(int *lock){
    ATOMIC_SWAP(lock, 0);
  }


The 'lock' variable is shared among threads. Compare is needed to avoid stomping on the lock acquired by another thread.


No it isn't.

If lock = 1, you set the lock to 1 (aka do nothing).

If the lock is 0, you know it is unlocked and know you succeeded in acquiring the lock.

If many threads try to atomic swap, only one of them gets the zero.

------

The real issue here is the subtle memory barrier bug in that code.


You are right. XCHG can do the job to acquire the lock. At least on x86, XCHG does lock the cache line on the address of the variable, so it should be ok.


Yes, CMPXCHG is a relatively more expensive instruction. In x86 it is a memory barrier instruction.


The gettimeofday(3) vDSO is pure-userspace code. Why not, then, a futex(3) vDSO that does a while + compare_and_swap(2) in userspace, but then contains a real call to the futex(3) syscall?


This part of the optimisation is well-known and had been for years before futex was invented, so there's no need to provide it as a vDSO or any other special work, userspace can (and was ready to) do that part anyway.

The futex is clever because previously the heavyweight locking is in the kernel, but all you actually needed was this very flimsy wait feature. BeOS for example, has a "Benaphore" construction where you initialise an expensive heavyweight lock but you try the userspace trick before needing to actually take it. So that looks superficially just as good as a futex...

... and then you realise oh, the underlying locks need kernel RAM to store their state, so the OS can only allow each process to have so many of them and thus you can't have so many Benaphores, they were expensive after all. But a futex doesn't use any kernel resources when you aren't calling the OS, Linux doesn't mind if your program uses a billion futexes, that's fine, any particular thread can only be waiting on one of them, and Linux is only tracking the waiting.


> This part of the optimisation is well-known and had been for years before futex was invented, so there's no need to provide it as a vDSO or any other special work, userspace can (and was ready to) do that part anyway.

I guess, to me, the semantics of futex(3) on their own seem ill-formed, without the while loop + cmpxchg being part of them. It feels like the user shouldn't have access to raw "low-level" calls to futex(..., FUTEX_WAIT), since there's only one useful way to use those calls, and it's in combination with other code, where that code is theoretically possible to get wrong. It's an API that doesn't cleanly encapsulate its own concerns.

I suppose I'm used to thinking with "building reusable crypto"-tinted glasses: with crypto libraries, you never want to expose a primitive that needs to be used a certain way to be sound; because people are inevitably going to use it wrong, and that's inevitably going to result in exploitation. Instead, you can just expose a higher-level primitive, that inherently can only be used that one correct way, with no means to ask it to do anything else.

Of course, there's nothing inherently dangerous about calling futex(..., FUTEX_WAIT) without the associated code. (You just get a race condition. But a race condition in userspace doesn't corrupt the kernel or anything.) So I suppose this kind of thinking is meaningless here.


For a long time the user in fact did not have (easy) access to it. glibc has started to provide a syscall wrapper only very recently, before you had to use syscall directly.

The reason is that the CAS is not part of the interface but it is left to the implementor is that the specific logic is very much application specific and in fact there is not only one way to use it. As discussed elsethread mutexes are only one of the many application of futexes.


What you described is what the phread_mutex_lock() does exactly, which is in user space. Application programmers don't deal with futex directly, they call phread_mutex_lock/phread_mutex_unlock.


The non-syscall parts of futex can be inlined, for great benefit.

gettimeofday calls code in the vDSO to avoid codifying in the kernel ABI the much more complex, architecture-dependent, data format.


Futex puts the thread to sleep and wakes it up. Accessing the OS scheduler requires kernel access anyway. The memory access of the flag can be done in user or kernel space.


Incorporating thread-safety into individual calls is not sufficient, because thread-safety is not composable. A sequence of calls to thread-safe operations is not itself thread-safe. Since you need locks in the outermost application level, it would be wasteful and unnecessary to also lock inside each individual call.

This non-composability is why multithreading is so difficult. A library can't abstract away these concerns for you, unless the entire multithreaded operation is one call to the library, which requires the library to know and design for your exact use case. If you want to do something a library didn't plan for, you are required to handle the thread synchronization yourself.


Haskell’s STM offers composable concurrency


I know that account balances were just a "motivating example" here, but please, I beg you: don't handle accounts this way.

Double-entry accounting predates computers by many centuries, and solves a problem which computers still have. Use double-entry accounting internally if you're handling money! This means both the liability and the asset side of the equation are balanced when written, and can be (should be, must be) balanced across both entries as well.


In my mental model of a digital financial ledger, it just looks like a table of transfer events, e.g. {timestamp, from_acct=A, to_acct=B, balance_delta=V}.

Which, yes, tracks the counterparty balances as well as your own. But it's not double entry per se. Each transfer only has its data in one place; there's no second corresponding entry in the table looking like {timestamp, from_acct=B, to_acct=A, balance_delta=-V}. You don't need that second entry; you can already do everything you need with just the one. (E.g. someone's current balance is the sum of the balance_deltas for which they're the receiver, minus the sum of the balance_deltas for which they're the sender.)

Are you suggesting there's a point in having rows in corresponding pairs like this? If so, what is it? (If it's data integrity, IMHO that should be guaranteed at a lower level, with cryptographic checksums et al.)


I'm not sure which of us is confused and it doesn't matter: let us say that there is a confusion about double entry, which I used to have and no longer do.

It's exactly having from: and to: in a single entry which makes for double entry accounting. Or one from: and several to: but that's effectively an elision.

By implication there is a second double-entry on the other side, but that's often their business, not mine: as an example, the Income statements in my personal ledger are balanced by an Expenses:Payroll on the other side... spiritually, I neither know nor care about the details of how that accounting is done.

In the special case that both accounts are internal to the business, then yes, expense on one account becomes asset on the other. Whether that is one or two (double) entries depends on a bunch of things.

In terms of specific data modeling, I would urge you that yes, the rows corresponding to any debit or credit should include a 'from' field. The reasons are numerous, and have more to do with audit than data integrity per se.

The sufficient reason is that there might not be a counterparty inside your database.


> Are you suggesting there's a point in having rows in corresponding pairs like this? If so, what is it?

If you've ever read a balance sheet (e.g. on a 10k), those things are much easier to compile with double-entry bookkeeping because everything is already neatly categorized.

But another thing to think about is that accounting is a very old profession, and when you're totaling everything by hand (or using an abacus, adding machine, etc.), mistakes are easy to make. If you notice your debits not matching your credits, you immediately know you messed something up. Making the same mistake twice (and thus not catching it) is much less likely, so double-entry makes your work more reliable.


Yes, I understand the point of the physical equivalent. I was asking specifically if there was a specific point in doing "double-entry bookkeeping" digitally, in the sense of having two transfer-event rows to represent each transaction in your database, rather than one. (Which is something that people do do — I've seen it in some ledger data-models — but it's always felt redundant to me.)

My understanding aligns with the GP's sibling comment: in a digital ledger, having one {txid, sender, recipient, value} row for each transfer of value, is strictly equivalent to having two separate complementary {txid, party, value_delta} rows.


One of those great moments where domain knowledge results in a substantially different system to the intuitive approach :)


Naive question: how does double-entry accounting help computer accouting systems?


Everything is accounted for, you can't accidentally (or 'accidentally') have money come from or go nowhere, since it must balance - account1 going up by a pound is matched to account2 going down a pound.


(Serious question)

Why don't we see money-out-of-nowhere bugs in banks? How do they ensure they don't happen? Even maliciously, for that reason.


Well we've done the (more direct) question and answer in reverse here really!

Balancing is one part of the answer (I think it would be fair to say it doesn't necessarily have to be - or at least not obviously look like - double-entry).

You could argue immutable (append-only) records help too - no amendments only additional corrective records.

I've never worked in a bank btw so pinch of salt etc., just have a bit of an interest and fiddle on and off with a related project.


It’s the original CRDT.



While futex_waitv is great, I would have also liked a futex_wakev.


As an optimization? Would futex support in io_uring be enough? https://lwn.net/Articles/857835/


Possibly yes! Edit: but I'm having a deja vu. Did we already have this conversation?!?


Not that I remember :)

I went through that LKML thread and it looks like that particular patch isn't going anywhere, but given the general direction of supporting all syscalls in io_uring, I'm confident it will happen eventually.

Support for wait, in particular, would be strictly more powerful than waitv: you could wait for futexes and IO (or any other completions) at the same time!


It would. On the other hand, I never had a need to wait for futexes and IO at the same time. If I need an internal wakeup while blocked on a poll call, I would use an eventfd. YMMV.

In any case generalized support for all or most syscalls in io_uring can only be a good thing. Especially if we get eBPF support for chaining.

I'm also quite excited for the new intel instruction to send IPIs from userspace. That would be great for non blocking io_uring.


Windows has had this one for a while.


Relevant that Valve is sponsoring this for Wine's benefit


I don't think Windows has the equivalent of futex2, waiting on multiple addresses. WaitOnAddress() in Windows is the equivalent of futex, waiting on a single address.

WaitForMultipleObjects() in Windows is for acquiring multiple locks, not for waiting on arbitrary addresses. The parameters to WaitForMultipleObjects() are mutex/semaphore/event/etc handles. When WaitForMultipleObjects() returns, the ownerships of the locks are actually acquired. E.g. for semaphore, the count is decremented by WaitForMultipleObjects().

Futex2 is more generic where it just waits on multiple arbitrary addresses. The locking is done separately.


The windows equivalent of futex_wait is WaitOnAddress. I don't think it supports waiting on multiple addresses, but I'm not a WIN32 programmer.


Yeah from the 10 mins I spent in the docs I do not think there is one baked in. You would probably have to roll your own. It would not be pretty. https://docs.microsoft.com/en-us/windows/win32/api/synchapi/...


WaitForMultipleObjects. And it isn't from the Windows heritage but from VMS.


WFMO is equivalent to select. This is different.


The article is nice, aside from the buggy and suboptimal implementations of the mutex.

But I would like to stress that futexes are not just useful to implement mutexes (nor they are even the most interesting feature). They are a generalized waitqueue abstraction that can be used for all kind of signaling primitives. Also the nice thing about futexes is that they do not consume resources when they are not in use: you could implement a fast pathed mutex on top of, say, an eventfd, but allocating a file descriptor for each mutex can potentially run out of resources quickly.


The mutex example was purely for didactic purposes, and definitely not a complete mutex. It was meant to help demonstrate the basics of how a mutex works. Blog post has been updated to clarify this.


I didn't really get the example, futex is itself a syscall, so how does implementing mutex in terms of futex help you create a "fast userspace" mutex? The example presented looks to do almost nothing in userspace. Am I missing something?


mutex_lock makes 0 system calls in the fast case (when cmpxchg succeeds on the first try). mutex_unlock makes 1 system call always, but TFA mentions that it can be optimized by adding some user-space complexity to check if signalling might be required (if you can demonstrate zero waiters, then the FUTEX_WAKE is not needed).


The example really should have fleshed this out, because most of the point of the feature is that you can have a pathway that involves no kernel operations in the common case. The rest of its point is that if you do end up waiting, the kernel scheduler knows about it.


Yes, usually you keep a (saturating) waiter count, at the limit just one additional bit.


when the cas (compare and swap) succeeds in user space, the subsequent futex call to wait will not be made, i suppose.


What are some use cases for having so many futexes that would exceed the number of available file descriptors? Are you assuming a low FD limit, or do you really mean millions/billions of futexes?


The more futexes you have, the less contention there ought to be for each. In principle. In practice, of course, you often end up with contention for certain items anyway, but it is hard to know in advance which. What matters for this case is that all the unlocked ones hardly cost anything, especially if you have some use for the other bytes in the cache line each is in (or if "false sharing", with multiple of them in the same cache line, is OK). The greater the ratio of threads or cores to locks, the less you probably worry about false sharing or contention.

But it is not a very good design for throughput, because any given lock is in general very unlikely to live in the cache of the core that wants it, and getting it there takes a long time. If, as is common, a tiny subset of the data set is "hot" and could fit in the sum of the caches available, you would like for any hot item to only ever be in one cache, with only the core that cache belongs to ever looking at it. If you can ensure this, then you don't need locks for those. If you can mostly ensure it, the locks will anyway already live in the right cache and will mostly not be contended for.

So, if you do have a hot subset of big data, it may be better to treat the hot subset specially, falling back to simple, general, maybe slow methods for other items until they become hot and are moved to the hot section and assigned to a specific core.

If there is no hot subset, you want to anyway avoid having cores waiting around for items (with their locks) to be fetched into cache so you can lock and use them. You can issue "pre-fetch" instructions for a series of items a core is interested in, while it works on others, and hope that by the time you get back to each it has been fetched into cache and is ready to be worked on immediately; and that no other core will want it right away.


I'm pretty sure the mutex_lock and mutex_unlock are bugged, as they take a mutex int by value. Would be unfortunate if somone copy pasted those.


As mentioned above, the mutex example was purely for didactic purposes, and definitely not a complete mutex. It was meant to help demonstrate the basics of how a mutex works. Blog post has been updated to clarify this.


care to explain more?


Not GP, but the code in the article passes 'unsigned int mutex' by-value, and then operates on its address (which will potentially be at a different stack address on each invocation, and will certainly be different for different threads.) It won't serve its purpose at all.

To my eye, the code is fundamentally broken in a way that's surprising for an article trying to teach threading, and it's likely confuse someone unfamiliar with these concepts.


The articles code for mutex_lock is this

  mutex_lock(unsigned int mutex)
  {
    while (!cmpxchg(&mutex, UNLOCKED, LOCKED))
      futex(&mutex, FUTEX_WAIT, LOCKED);
  }
Now consider what happens if someone calls it

  volatile unsigned int my_mutex = UNLOCKED;
  mutex_lock(my_mutex);
  ...
  mutex_unlock(my_mutex);
my_mutex is copied into the mutex parameter. The mutex parameter is updated to locked. mutex_lock exits, which destroys the mutex variable. my_mutex still has the variable UNLOCKED.


Good catch, corrected!


One more...

> We use &mutex as the first argument


ops, fixed, thanks!


Without having looked into any details I assume the parent is saying that the integer passed by value is a flag that is checked periodically and used to trigger a next step or to release a lock.

If it's passed by value it will never change inside the function and whatever is supposed to happen when it changes will never get executed. The integers should be passed by reference so when they change the function receiving them is aware of the update.


Most importantly wake and wait are keyed by address. The waker must provide to futex_wake the same address that the waiter passed to futex_wait to wake it.


Neat timing, was just running gdb on MongoDB and saw it uses futexes.

Was debugging why the instances were hanging... sigh


That doesn't imply MongoDB mentions futexes anywhere in its source code. Applications that use pthreads for their multithreading (which is the vast majority of them) will also show futex syscalls


Good point!


Wow so poorly written. Hope author does a proof read for part 2. Mutual exclusion is tricky to follow and the bad grammer doesn't help.


> and the bad grammer

I'm sorry but I have to ask; was this intentional?


That's not bad grammar but bad spelling.


There's a law about this. Can't recall the name.


That web site name looked familiar, it's one I used to get tons of recruiter spam from. I added it to my email server's block list some time ago.


You're probably thinking of https://www.collabera.com/

Note the 'e' instead of the 'o'..."bera", not "bora".


You're right thank you.


That's grossly untrue. We never send recruiting spam.


tyingq pointed out my mistake, I was thinking of a very similar web site name.


Thanks.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: