This begs for comparing assembly dumps of both.

ksherlock · on Dec 4, 2013

gcc acts RISCy -- load memory into a register, add registers, store register back to memory.

clang acts CISCy -- use a read-modify-write instruction to add directly in memory.

rayiner · on Dec 4, 2013

I'm surprised that GCC generates that code. Using read-modify-write instructions directly in memory has been better on x86 processors for quite awhile now, since both AMD and Intel support fusing the memory and arithmetic operations into a single micro-op.

pbsd · on Dec 4, 2013

Curiously enough, using separate reads and writes, with appropriate nop padding, seems to do better than any read-modify-write variation I tried.

By the way, read-modify-write is not fused altogether: only the write part is fused, so it generates 2 unfused uops (read, add), plus the fused uop for the write + address generation.

EDIT: Intel's compiler also generates read-modify-write code, but it ends up being slower than both gcc and clang.

nkurz · on Dec 4, 2013

I also have gotten better results with separate read, modify, and write instructions than with combined. I think this is because the separate assembly statements allow for more explicit scheduling.

eliben · on Dec 4, 2013

I was studying this difference between the two compilers for a different issue, actually. According to the standard, I don't see anything that prevents an increment of a volatile to be performed in a single instruction if the target permits it. Splitting it to a separate load+binop+store by gcc seems like an over-pessimization.