I'm surprised that GCC generates that code. Using read-modify-write instructions directly in memory has been better on x86 processors for quite awhile now, since both AMD and Intel support fusing the memory and arithmetic operations into a single micro-op.
Curiously enough, using separate reads and writes, with appropriate nop padding, seems to do better than any read-modify-write variation I tried.
By the way, read-modify-write is not fused altogether: only the write part is fused, so it generates 2 unfused uops (read, add), plus the fused uop for the write + address generation.
EDIT: Intel's compiler also generates read-modify-write code, but it ends up being slower than both gcc and clang.
I also have gotten better results with separate read, modify, and write instructions than with combined. I think this is because the separate assembly statements allow for more explicit scheduling.
I was studying this difference between the two compilers for a different issue, actually. According to the standard, I don't see anything that prevents an increment of a volatile to be performed in a single instruction if the target permits it. Splitting it to a separate load+binop+store by gcc seems like an over-pessimization.