Isn't this usually where someone sufficiently versed in Assembly would look at t...

pascal_cuoq · on June 24, 2020

You need to look at the disassembly of the generated binary to make sense of this sort of performance variation (paying attention to line cache boundaries for code and data), and even so, it is highly non-trivial. The performance counters found in modern processors sometimes help (https://en.wikipedia.org/wiki/Hardware_performance_counter ).

https://www.agner.org/optimize/microarchitecture.pdf contains the sort of information you need to have absorbed before you even start investigating. In most cases, it's not worth acquiring the expertise for 5% one way or the other in micro-benchmarks. If you care about these 5%, you shouldn't be programming in C in the first place.

And then there is this anecdote:

My job is to make tools to detect subtle undefined behaviors in C programs. I once had the opportunity to report a signed arithmetic overflow in a library that its authors considered, rightly or wrongly, to be performance-critical. My suggestion was:

… this is not one of the subtle undefined behaviors that we are the only ones to detect, UBSan would also have told you that the library was doing something wrong with “x + y” where x and y are ints. The good news is that you can write “(int)((unsigned)x + y)”, this is defined and it behaves exactly like you expected “x + y” to behave (but had no right to).

And the answer was “Ah, no, sorry, we can't apply this change, I ran the benchmarks and the library was 2% slower with it. It's a no, I'm afraid”.

The thing is, I am pretty sure that any modern optimizing C compiler (the interlocutor was using Clang) has been generating the exact same binary code for the two constructs for years (unless it applies an optimization that relies on the addition not overflowing in the “x + y” case, but then the authors would have noticed). I would bet a house that the binary that was 2% slower in benchmarks was byte-identical to the reference one.

voldacar · on June 24, 2020

If I may ask, what was the use case for this code that they cared so much about a 2% difference in benchmarks? Aerospace? Game engine? Packet routing?

sbierwagen · on June 25, 2020

I wouldn't expect aerospace, since I have been told embedded programmers in that field routinely disable compiler optimization, in the chance that a compiler bug or overzealous UB exploitation might introduce a bug into previously working code. Hard realtime requirements demand fast code, but not necessarily efficient code.

cybervasi · on June 25, 2020

I am guessing your tool was source based to even detect this, let alone the fact that the code change would have produced the identical code.

gnufx · on June 24, 2020

Performance counters are vital, and you don't need to grovel the disassembly yourself in association with profiling, even if it's feasible. Get a tool to do it for you; MAQAO is one in that area (x86-specific, and unfortunately part-proprietary).

Anyway, yes, measurements are what you need, rather than guesses, along with correctness checks, indeed.

saagarjha · on June 25, 2020

I've had this exact situation happen to me as well :/ It's frustrating.

moonchild · on June 24, 2020

The presence (or absence) of a global variable has no effect on the rest of the generated code. Nothing changed. You could possibly look at the alignment of the in-memory representation generated by the relocator,runtime dynamic linker. Likely, some bits of code are now on the same cache line that were previously separate, and vice versa.