Hacker News new | past | comments | ask | show | jobs | submit login

This begs for comparing assembly dumps of both.



gcc acts RISCy -- load memory into a register, add registers, store register back to memory.

clang acts CISCy -- use a read-modify-write instruction to add directly in memory.


I'm surprised that GCC generates that code. Using read-modify-write instructions directly in memory has been better on x86 processors for quite awhile now, since both AMD and Intel support fusing the memory and arithmetic operations into a single micro-op.


Curiously enough, using separate reads and writes, with appropriate nop padding, seems to do better than any read-modify-write variation I tried.

By the way, read-modify-write is not fused altogether: only the write part is fused, so it generates 2 unfused uops (read, add), plus the fused uop for the write + address generation.

EDIT: Intel's compiler also generates read-modify-write code, but it ends up being slower than both gcc and clang.


I also have gotten better results with separate read, modify, and write instructions than with combined. I think this is because the separate assembly statements allow for more explicit scheduling.


I was studying this difference between the two compilers for a different issue, actually. According to the standard, I don't see anything that prevents an increment of a volatile to be performed in a single instruction if the target permits it. Splitting it to a separate load+binop+store by gcc seems like an over-pessimization.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: