“Once” one-time concurrent initialization with an integer

mtklein · on Aug 2, 2023

My only major objection to C++ magic statics is that you can only use them for statics. Sometimes you find yourself with a task that needs to be done once per object instead of globally, and you've got to roll your own, and then you might as well use that same mechanism for globals too.

My minor objection to magic statics is that you don't have control over the size or packing of the control word, and it usually ends up being something awkwardly mid-sized like a size_t where normally you'd want it either as small as possible or cache-line sized.

I like to handle this something like this:

     void once(char _Atomic *flag, void (*fn)(void *ctx), void *ctx) {
         char st = atomic_load_explicit(flag, memory_order_acquire);
         if (st == 0 && atomic_compare_exchange_strong_explicit(flag, &st, 1,
                                                                memory_order_acquire,
                                                                memory_order_acquire)) {
             fn(ctx);
             atomic_store_explicit(flag, st=2, memory_order_release);
         }
         while (st != 2) { st = atomic_load_explicit(flag, memory_order_acquire); }
      }

There's a commented version here: https://github.com/mtklein/best/blob/main/once.c

Arnavion · on Aug 2, 2023

You can use std::once_flag and std::call_once from <mutex> instead of rolling your own atomic solution.

memefrog · on Aug 2, 2023

What does C++ have to do with this submission?

latchkey · on Aug 1, 2023

This immediately made me think of 'double checked locking', which was badly broken in early Java and then fixed later. Reading the wikipedia again after many years, it mentions a lot of the same things in this article.

https://en.wikipedia.org/wiki/Double-checked_locking

Arnavion · on Aug 2, 2023

Yes, double-checked locking is broken if the code does not use fences to enforce the order of checking the initialized flag and getting the value. TFA uses atomic operations to do that.

Groxx · on Aug 2, 2023

To spell it out a bit more for other readers, this article is doing effectively this:

    get() {
      if needsinit_or_wait_if_initting() { // lock (atomics) then check
         thing = init_thing()
      }
      return thing
    }

Which is fine because the check always occurs before reads or writes.

The classic double-checked lock mistake looks something like this, when following this example:

    get() {
      // "fast" path: get thing if it exists, because
      // the code looks like that means it's done, right?
      // no check because checks are slow and we move fast!
      if thing != nil { // first check
        return thing
      }

      // slow path: use slow stuff which is slow :(
      if needsinit_or_wait_if_initting() { // lock then second check
         thing = init_thing()
      }
      return thing
    }

It's that un-guarded see-if-nil-and-return that makes it unsafe, because the assignment is not guaranteed to be "complete" in many cases. There is no explicit relationship between the initialization, assignment, and read, so the compiler can do stuff that doesn't match what the code looks like, e.g. reorder "save the new memory address to `thing`" before "init the thing" (so you temporarily have an uninitialized non-nil object), and at runtime you can get stuff like "only the first half of the memory synced to that CPU core so you got a busted object".

gpderetta · on Aug 2, 2023

The double checked locking is perfectly fine as long as the fast path 'thing' access has the right semantics (in C and C++ you need an atomic load with consume ordering).

comex · on Aug 2, 2023

> The obvious choice is a compare-and-swap atomic, which will fail if another thread has already made the transition. However, with a more careful selection of state representation, we can do this with just an atomic increment!

But why? Atomic increments and compare-and-swap have the same performance on most hardware, and this is a cold path so it wouldn't matter if they didn't. Using a compare-and-swap you would only need one byte to store your "once" object rather than a whole int.

BeeOnRope · on Aug 2, 2023

Yes because of the "it's the cold of path" aspect. That said, CAS is often considerably slower than an atomic increment replacement even if the usual sources report equal throughout because the dependency chains to both the output value and flags (i.e., the "succeeded?" indicator) usually involves a load of the value, and an operation to manipulate it prior to using it in the CAS.

So even if a CAS never fails it can end up half a dozen cycles slower if this chain matters (and it often does). That aspect fails to be captured by say uops.info which just tries to isolate the effect of the instruction alone.

Of, course, the even more important reason to use unconditional ops on a contended path if you can is that they don't fail: CAS retries will be significant in that case (but this doesn't apply here because "one off cold path").

inetknght · on Aug 2, 2023

> Using a compare-and-swap you would only need one byte to store your "once" object rather than a whole int.

But after alignment requirements of anything around it, it will likely be the size of the platform's natural int anyway

dataflow · on Aug 1, 2023

I think this assumes the first thread will actually succeed. Don't forget complications if you need failure/exception handling...

jeffbee · on Aug 2, 2023

An implementation of call_once that accommodates callbacks that throw: https://github.com/abseil/abseil-cpp/blob/master/absl/base/c...

gpderetta · on Aug 1, 2023

See also N2660[1],a.k.a. "Dynamic Initialization and Destruction with Concurrency", a.k.a. C++ Magic Statics, for a similar feature built-in the language.

The paper has in the appendix a description of a possible implementation (which I believe is used by all major compilers).

[1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n26...

memefrog · on Aug 2, 2023

A similar feature built into a completely different language.

gpderetta · on Aug 2, 2023

C and C++ are hardly completely different.

And C could adopt magic statics without any particular issue.

memefrog · on Aug 2, 2023

Java and JavaScript are more closely related than C and C++ today. They are completely different.

gpderetta · on Aug 2, 2023

Aside for sharing a memory model (concurrent and whatnot), the vast majority of C++ specific constructs have a straightforward if not trivial mapping to C.

C and C++ are very similar in all the aspects that count.

KerrAvon · on Aug 1, 2023

I would be wary of copying this code and using it in my own project. Generally speaking, it's best to avoid rolling your own thread primitives. Even the smartest professionals sometimes get it subtly wrong (and sometimes it only becomes clear at scale via analytics and studying the data, luxuries you probably don't have). Use the battle-tested primitives: pthread_once, whatever the C++ equivalent is, or dispatch_once on the Mac.

cjensen · on Aug 2, 2023

The C++ equivalent is... it just works. Static variables in a function are initialized exactly once the first time a function is entered per the language definition.

This is one of the many great improvements of C++11

variadix · on Aug 2, 2023

Except you pay this cost (locks + codegen) even for PODs with known data at compile time.

cjensen · on Aug 2, 2023

No. PODs that are initialized with no side-effects can be optimized away into compile-time initialization. E.g. if you have an integer that is static inside a function, it can just be initialized old-school.

memefrog · on Aug 2, 2023

What a pointless comment. If you are doing anything with threads it is far more likely you have a bug in the way you use mutexes than that there is a bug in this code.

rcme · on Aug 2, 2023

What about simply making pthread_once take a void *opaque parameter that is passed in to the function? You could use this to manually construct a closure.

kelnos · on Aug 2, 2023

You can't "simply" change the POSIX threads specification...

rcme · on Aug 2, 2023

You could implement my_pthread_once pretty easily, though. Should be around 5 lines of C?