Are you sure that you get rid of deadlocks? Instead of a lock it loops on an atomic variable if there is some contention. Granted, you have some more control over the priority of tasks and it will be more efficient - if you are smart enough (still it gets very tricky), but the essence is that you are rolling your own locks with atomic variables.
There's a slight difference between rolling your own spinlocks and simply doing something like a CAS loop to change a variable. They’re usually both built using similar atomic primitives, but…
The difference is that in a true lockless algorithm, no thread "holds" the lock (because there is no lock). Therefore, if a thread gets scheduled out while in the middle of any operation, every other running thread can still continue working. That’s in contrast to a locking implementation where it’s very possible for a thread to get swapped out while holding a lock, thus causing every other thread to spin endlessly until the scheduler ends up swapping the original thread back in so it can release the lock.
This is one key advantage of using the OS supplied mutex facility rather than rolling your own spinlocks. The OS mutex has the ability to know that if a thread is blocking on the mutex and the thread which holds it is currently not running, it might as well start running the original thread again so that it can release the lock (as nothing else can get done until that happens anyway).
I believe this is why (in my personal experience) userspace spinlocks perform better than mutexes up until about the point of where you have many more threads than CPUs. My guess is that the probability of a thread getting scheduled out while holding the lock increases with both the number of threads as well as the overall contention for the lock.
there is the lock-free circular buffer, where the buffer is linear and the indexes are atomic variables; now once the buffer is full, the indexes act like locks - the producer can't insert anything unless the consumer removes entries, now the 'locks' have the crucial role in maintaining the data structure.
I implemented a IPC msg Q for a router company 15 years ago.
Passing msg with different Threads/Process is not "rolling your own locks with atomic variables".
Basically, each T/P need to receive msg will create MsgQ with msgQLen and return Q_id.
msgQ_Send() Api will use atomic_add() to get the next slot on the msgQ to place the value and/or MsgPtr.
There is no lock needed, only Atomic_add().
Performance wise, it works very well in SMP environments and I ported the library to vxWorks, Linux kernel, user space and QNX.
I have tons of validation tests on the correctness of the message orders, contain and constantly run them on multiple 4 core Xeon machine to check the correctness and benchmark performance.