nsync has wrapper macros for all the various atomics libraries that prevented it from using two things.
1. Weak CAS. nsync always uses strong CAS upstream to make the portability abstraction simpler. Being able to use weak CAS when appropriate helps avoid code being generated for an additional loop.
2. Updating the &expected parameter. nsync upstream always manually does another relaxed load when a CAS fails. This isn't necessary with the C11 atomics API, because it gives you a relaxed load of the expected value for free when it fails.
Being able to exploit those two features resulted in a considerable improvement in nsync's mu_test.c benchmark for the contended mutex case, which I measured on RPI5.
C++ insists on providing a generic std::atomic type wrapper. So despite my type Goose being almost four kilobytes, std::atomic<Goose> works in C++
Of course your CPU doesn't actually have four kilobyte atomics. So, this feature actually just wraps the type with a mutex. As a result, you're correct in this sense atomics "use pthreads" to get a mutex to wrap the type.
C++ also provides specializations, and indeed it guarantees specializations for the built-in integer types with certain features. On real computers you can buy today, std::atomic<int> is a specialization which will not in fact be a mutex wrapper around an ordinary int, it will use the CPU's atomics to provide an actual atomic integer.
In principle C++ only requires a single type to be an actual bona fide atomic rather than just a mutex wrapper, std::atomic_flag -- all the integers and so on might be locked by a mutex. In practice on real hardware many obvious atomic types are "lock free", just not std::atomic<Goose> and other ludicrous things.
I am also curious about this and the ambiguity of "AARCH64". There are 64-bit ARM ISA versions without atomic primitives and on these what looks like an atomic op is actually a library retry loop with potentially unbounded runtime. The original AWS Graviton CPU had this behavior. The version of the ISA that you target can have significant performance impact.
It depends on what atomics. In principle most of them should map to an underlying CPU primitive, and only fallback to a mutex if it's not supported on the platform.
> At least in Linux, C++11 atomics use pthreads (not the other way around).
I have no idea what you can possibly mean here.
Edit: Oh, you must have meant the stupid default for large atomic objects that just hashes them to an opaque mutex somewhere. An invisible performance cliff like this is not a useful feature, it's a useless footgun. I can't imagine anyone serious about performance using this thing (that's why I always static_assert() on is_always_lock_free() for my atomic types).
Curious about this -- so what does C11 atomics use to implement? At least in Linux, C++11 atomics use pthreads (not the other way around).