All CPUs commit in order and except precisely, because most other options are insane, or would drive you to it. However: single thread commit order =/= observability order.
Observability order of memory operations --- which are the only operations that matter --- are governed by the memory consistency model of the architecture.
x86 has what's generally referred to as strong ordering on memory operations.
On x86, part of it means that stores from the same core cannot be observed out of order from each other, nor can loads.
So assuming the compiler does not move the `tail++` up, or move the assignment out of the if-statement (both of which are achieved by marking them `volatile`), the code should actually work on x86.
The `tail++` change cannot be observed before the write to the queue and the reading from the queue cannot be observed before the reading of the `tail` and `head` variables.
On RISC-V and Arm, you need more as they have substantially weaker memory consistency. The RISC-V specs have some examples of interesting outcomes you can have. Some of it involves time-travel.
But in the end: yes the reordering done by the CPU is the issue. The compiler can and does reorder stuff when it thinks that it'll unlock more instruction-level parallelism, but no amount of volatile is going to make that queue universally usable on RISC-V. No matter what the compiler does. Even perfectly preserving the single-thread semantics of the code, not reordering a single instruction, the CPU can move stuff around in terms of observability.
The alternative is that the compiler inserts a barrier/fence after every instruction.
There are trade-offs. Poorly written code for x86 can absolutely tank performance because of ordering violations requiring code to be replayed, though that is sometimes a problem in even weaker consistency models as well.
Valid points, although I have another perspective on this bit:
> But in the end: yes the reordering done by the CPU is the issue
I think from a programmer perspective, the CPU side of things is mostly beside the point (unless you're writing assembly), and this contributes to the misunderstanding and air of mystery surrounding thread safety.
At the end of the day the CPU can do anything, really. I'd argue this doesn't matter because the compiler is generating machine code, not us. What does matter is the contract between us and the compiler / language spec.
Without language-level synchronisation the code is not valid C/C++ and we will likely observe unexpected behaviour - either due to CPU reordering or compiler optimisations, doesn't matter.
I think the article is somewhat missing the point by presenting the case somewhat pretending that the compiler is not part of the equation.
It seems like often people think they know how to do thread safety because they know, e.g. what reorderings the CPU may do. "Just need to add volatile here and we're good!" (probably wrong). In reality they need to understand how the language models concurrency.
We could translate that queue code into another language with a different concurrency model - e.g. Python - and now the behaviour is different despite the CPU doing the same fundamental reorderings.
This is true but in practice it's pretty common to find this sort of code seems to work fine on x64 because the compiler doesn't actually reorder things and then sometimes blows up on ARM (or PowerPC, though that's less commonly encountered in the wild these days).
Observability order of memory operations --- which are the only operations that matter --- are governed by the memory consistency model of the architecture. x86 has what's generally referred to as strong ordering on memory operations.
On x86, part of it means that stores from the same core cannot be observed out of order from each other, nor can loads.
So assuming the compiler does not move the `tail++` up, or move the assignment out of the if-statement (both of which are achieved by marking them `volatile`), the code should actually work on x86. The `tail++` change cannot be observed before the write to the queue and the reading from the queue cannot be observed before the reading of the `tail` and `head` variables.
On RISC-V and Arm, you need more as they have substantially weaker memory consistency. The RISC-V specs have some examples of interesting outcomes you can have. Some of it involves time-travel.
But in the end: yes the reordering done by the CPU is the issue. The compiler can and does reorder stuff when it thinks that it'll unlock more instruction-level parallelism, but no amount of volatile is going to make that queue universally usable on RISC-V. No matter what the compiler does. Even perfectly preserving the single-thread semantics of the code, not reordering a single instruction, the CPU can move stuff around in terms of observability. The alternative is that the compiler inserts a barrier/fence after every instruction.
There are trade-offs. Poorly written code for x86 can absolutely tank performance because of ordering violations requiring code to be replayed, though that is sometimes a problem in even weaker consistency models as well.