As a more practical argument, let's suppose x86 had a an atomic CAS that guarant...

As a more practical argument, let's suppose x86 had a an atomic CAS that guarantees that the store is released but the load is relaxed (unlike other x86 normal loads, but like non-temporal loads, it has no implied LoadLoad or LoadStore like other x86 loads).

This relaxed-load-CAS, coupled with some form of control or data dependency would be sufficient to implement your mutex. But would such a CAS be significantly cheaper than the existing CAS? If it was, you would be able to approximate the strong CAS semantics by adding lfence+sfence after the relaxed CAS. These fences are cheap, so if strong CAS was possible to be implemented this way with significant improvements, intel would have already done it.

Finally, it is practically possible to implement the sequence store(A);#StoreLoad;load (B) without an explicit fence by using the colocation trick: have A and B be adjacent in memory (on the same cacheline), store to A, then do a wider load on both A and B. Intel does not give multithread guarantees on this, but my understanding is that it works: the wider load fails to be store-forwarded and stalls the pipeline waiting for the previous store to be flushed. In practice this costs about as much as a fence, so in addition to being undefined, is not cheaper.