Say what you want about Drepper's style and personality, but that paper told me (a) he's an incredibly knowledgable dude and (b) I'll always have more to learn, especially about cache.
Looks like he's not writing to "reserved" field. So CPU will definitely need to read the cache line at current log entry before write (RFO, read for ownership).
I'm not sure whether current CPUs are smart enough, but in theory writing 64 bytes at once to cache aligned address could avoid RFO. If it's L1/L2/L3 miss, that could be 100-200 cycles saving.
Might have also been better to do RDTSCP first. Otherwise it'll also avoid reordering instructions. And sabotage attempt to avoid RFO.
Anyways, not sure, didn't profile. (And by profiling I mean using those countless CPU performance counters to figure out what's going on in the mysterious black box.)
It's aging a bit, but Ulrich Drepper's seminal paper on memory comes to mind: https://www.akkadia.org/drepper/cpumemory.pdf
Say what you want about Drepper's style and personality, but that paper told me (a) he's an incredibly knowledgable dude and (b) I'll always have more to learn, especially about cache.