Lock elision in the GNU C library

rayiner · on Feb 6, 2013

I think TSX is one of the most exciting architectural advancements to come around in a long time. There are lots of situations where memory could theoretically be modified by multiple threads simultaneously, but where conflicts will almost never occur in practice. Applications tend to be coded to avoid writes to the same memory from multiple cores not just for correctness reasons but because it causes cache lines to bounce between cores. However, because of the theoretical possibility of simultaneous access, you still have to use a relatively heavyweight lock to ensure correctness.

TSX promises to lower the overhead dramatically in such cases. I don't think it's clear yet how it will be implemented, but Dave Kanter over at RWT has speculated it will leverage the cache coherence protocol: http://www.realworldtech.com/haswell-tm.

To support cache coherency, cores already communicate with each other about reads and writes to cache lines. Multiple cores can have a line in their cache, but if any core modifies their copy, they have to send a message to the other cores signaling that their old copies are invalid (somewhat more technical description: http://www.di.unipi.it/~vannesch/SPA%202010-11/Silvia.pdf).

Kanter speculates that TSX uses this existing mechanism to cheaply support transactions. During a transaction, each core keeps track of the cache lines it has read or written to during the transaction. If it gets a message that another thread has written to a line it read from, or otherwise read from a line it wrote to, it aborts the transaction. Otherwise, the transaction commits successfully. This theoretically allows transactions that don't conflict to be handled completely locally to the core.

haberman · on Feb 6, 2013

This only seems to help the uncontended case though; under contention it sounds like things will be just as expensive as before (or even more so).

Correct me if I'm wrong but this sounds like "shared-state multithreading will be faster, as long as you don't have any parallelism."

There is no free lunch in shared-memory architectures. But there are lots of techniques for addressing the problem. One of my favorite ones these days is to have a set of states, one per CPU, that each program writes to in its fast path, that can then be periodically merged into a global state.

ajross · on Feb 6, 2013

That's exactly right. But the contended case is inherently non-parallel anyway. The goal here isn't to magically make serial-or-near-serial algorithms into parallel ones. It's to eliminate the needless synchronization that happens in correctly-implemented parallel algorithms.

And I agree with the grandparent: this is hugely exciting.

rayiner · on Feb 6, 2013

Rather, it's "shared-state multi-threading will be faster as long as you don't have any mutations to shared memory."

Shared state isn't the problem--mutable shared state is the problem. That's what kills you from a cache point of view. But the mere possibility of mutations to shared state shouldn't needlessly make things more expensive when those mutations are rare or non-existent in practice. That's where TSX comes in.

haberman · on Feb 6, 2013

Ah, I think I see. This would be useful for shared state that is read much more often than it is written. That seems to position it mainly as a competitor to RCU, which is likewise designed to have cheap reads and more expensive writes. But compared to RCU, both reads and writes would be faster with TSX (and in particular, uncontended writes would be especially faster).

sparky · on Feb 6, 2013

The idea behind TSX was originally called speculative lock elision, and comes from this MICRO 2001 paper [1]. The paper provides a more coherent motivation of SLE/TSX than the Intel marketing literature, and contains some concrete (simulated) performance results.

[1] http://pages.cs.wisc.edu/~rajwar/papers/micro01.pdf

haberman · on Feb 6, 2013

I see. So if I may paraphrase, TSX can also help write work-loads by eliminating locks that weren't actually needed because no actual contention over the same data occurred. This can happen if the lock was too coarse-grained and the writes were actually to different memory, or if the dynamically-executed path didn't actually mutate shared memory.

Both of these are problems you can code around, but doing that would likely require greater effort and complexity.

It's unclear to me whether recovering from mis-speculation is more expensive than acquiring the lock up-front. In other words, is there overhead in the case that you almost always encounter actual contention?

It's also unclear to me how this can be done without instruction set support, as the paper claims. Without having the critical sections explicitly marked how can this be implemented?

sparky · on Feb 6, 2013

> I see. So if I may paraphrase, TSX can also help write work-loads by eliminating locks that weren't actually needed because no actual contention over the same data occurred. This can happen if the lock was too coarse-grained and the writes were actually to different memory, or if the dynamically-executed path didn't actually mutate shared memory.

Yes, those two cases are what I'd consider secondary benefits, that could be achieved straightforwardly (but not easily, as you note) without TSX. The primary case -- when the lock is fine-grained, and thread A does mutate a value that thread B does eventually read, but B's read doesn't come until A is already out of the critical section -- is the big win for well-designed software, IMO.

> It's unclear to me whether recovering from mis-speculation is more expensive than acquiring the lock up-front. In other words, is there overhead in the case that you almost always encounter actual contention?

On big x86 cores, there shouldn't be much performance overhead. x86 already has mechanisms for holding tons of in-flight memory operations in the core without releasing and/or committing them to the rest of the memory system. TSX simply specifies a new reason for holding them ("we're in the middle of an SLE transaction"). If your transaction is aborted because another core touched the lock, you can flush that buffer of memory ops and rollback to a checkpointed register file almost instantly. The paper describes a hardware predictor that decides whether or not to initiate a transaction on a given store; in any smart implementation, this predictor would incorporate information like "have I incurred expensive rollbacks due to transaction aborts on this store?" In that way, a smart implementation should tune itself to yield arbitrarily low average- or worst-case overhead over the non-TSX case.

> It's also unclear to me how this can be done without instruction set support, as the paper claims. Without having the critical sections explicitly marked how can this be implemented?

(No pedantry intended, I'm sure you already read this) It's explained in Section 3.3. They look for this pattern:

  Initially, MEM[x] = A
  STORE MEM[x] = B
  ...
  ...
  STORE MEM[x] = A (original value)

They consider anything between these "silent store pairs" to be a potential critical section, and the predictor can choose to try to elide the two stores and form a transaction, if it wishes. It is always OK to elide these two stores, whether it's a "real" critical section or not, as long as nobody tries to read MEM[x] in the middle (in the memory model they assume, anyway).

You definitely require ISA support if:

1) Your silent stores may have other side effects (I/O?)

2) You have synchronization primitives that don't fit the silent store pattern (ticket lock?)

3) You have a strong business interest in making the behavior of existing software, even at the microarchitectural level, as predictable as possible (Intel?)

JoshTriplett · on Feb 6, 2013

Lock elision changes the definition of "uncontended", though. Before, "uncontended" meant "you don't have any parallelism", as you suggested. With lock elision, "uncontended" means "doesn't have multiple accesses to the same data, or they're all reads"; that allows quite a bit of parallelism on most data structures.

davidw · on Feb 6, 2013

Am I the only one with a weird enough mind to have pictured Ellison (Larry) locked in the GNU C Library? Perhaps chained in a buffer with a never ending loop of "join us now and share the software"...

Very interesting article in any case.

hinathan · on Feb 6, 2013

I did that too. Weird.

callan · on Feb 6, 2013

HLE is a pretty nifty implementation (note: I work for Intel) and an interesting use of CISC.

IA instructions can optionally contain a prefix, such as LOCK, address and operand override (use 16, 32, or 64 bits), and repeat (REP). For example, in Linux, a single "REP STOS" (repeat store string) instruction is used to zero-out the kernel's BSS region. REP can also be added to instructions that don't use it, and the CPU will simply ignore it--this is defined behavior.

For HLE, using REP in front of certain instructions on a spinlock will begin or end a single memory transaction without actually writing the spinlock value. Of course, on older CPUs, the REP will be ignored and the spinlock read/write will occur. On CPUs that support HLE: if another memory agent accesses the spinlock, the transaction will be canceled.

jws · on Feb 6, 2013

About using the Intel "Transactional Synchronization Extension"† (TSX) from the upcoming Haswell processors to provide a "fast path" to lock code. The code would first attempt to just do what it wants, but within a memory transaction. If the memory transaction cannot commit, there is a conflict from another core or thread, then it goes to the regular slow path with the locks. If applied at the right granularity almost all of your atomic operations should succeed with the TSX path and you will rarely invoke the heavier lock based code.

There is talk of performance measurement, but no numbers in the article, probably because Haswell isn't released.

␄

† http://software.intel.com/en-us/blogs/2012/11/06/exploring-i...

Symmetry · on Feb 6, 2013

Interesting. Haskell (or at least GHC) uses software transactional memory pretty heavily. They currently use llvm as the back end, though, so won't benefit from these changes in glibc. I wonder if whatever library used there by default will start to incorporate these features anytime soon?

sparky · on Feb 6, 2013

The libc one uses is orthogonal to the compiler, modulo support for any language extensions the libc might use. LLVM supports glibc just fine, and will use it by default on Linux systems.

That said, whether GHC can make more use of glibc's TSX support (as opposed to generating TSX instructions itself) than any arbitrary C program depends on any impedance mismatch that may exist between the elided lock wrappers and GHC's own STM semantics. No idea if GHC has a way to reason about libc-managed locks.