Yes, I'm probably wrong on the semantics of lock free. I should read up on that....

Yes, I'm probably wrong on the semantics of lock free. I should read up on that. My point was that maybe you can gain a factor of 2 by using atomic ops in a "lockless" fashion compared to mutexes but it still scales extremely poorly compared to what you could do if the hardware had real async mechanisms that did not rely on cache coherency. The cache line that is subjected to an atomic operation will be evicted from all other CPUs cache lines, back and forth.

A simple example is just letting all CPUs in a multi core CPU doing an atomic add at a memory address. The scaling will be exactly 1 with the number of available cores. I realize this is very of topic wrt to this cool article an implementation in zig. It's just that this problem can't really be solved in an efficient manner with todays hardware.