C performance mystery: delete unused string constant

dzdt · on June 24, 2020

Lots of performance mysteries, particularly in micro-benchmarks, are explained by how either code or data aligns to memory cache lines. A seemingly unrelated change will cause memory locations to move. No idea if that is the case here or not but it looks like a possibility.

userbinator · on June 24, 2020

This is one of the reasons why I think microbenchmarks are either not very useful or completely misleading --- they'll favour things like insanely unrolled loops or aligned data with lots of padding, when in fact on a macro-scale those things will certainly affect performance negatively due to cache misses.

My preferred approach when optimising is to go for size first; then if more speed is desired, carefully apply size-increasing optimisations to areas which may benefit, testing with a macro benchmark to judge any improvement.

YZF · on June 25, 2020

If the macro scale is dominated by the code you're optimizing, which is often the case, let's say some inner loops of a codec, signal processing, etc. then the cache misses that may follow aren't really a concern. Obviously(?) taking some random code that doesn't run frequently and trying to optimize it won't make a difference. Microbenchmarks are pretty useful in this context since they let you iterate faster. More generally the question is what is the bottleneck, if the bottleneck is fetching stuff into caches then that's what you optimize for, if the bottleneck is some computation that is running from caches then that's what you optimize...

jeffbee · on June 25, 2020

The big cautionary tale of microbenchmarking is memcpy. If you look at the Gnu memcpy, it's ridiculously long and obviously been guided by extensive benchmarking. And memcpy often shows up in profiles, so it is a big target. But in real systems a small simple loop will perform better due to icache pressure.

makapuf · on June 25, 2020

Memcpy lights not be a good example however because it's often replaced by a loop (or a move or eliminated altogether) by the compiler, because it's such a commun function that its understood by the compiler, and the tradeoff of the real function are also understood and taken into account (of course I'm sure there are bad optimizations)

MaxBarraclough · on June 25, 2020

> in real systems a small simple loop will perform better due to icache pressure

Do you have a source for this? Seems surprising GCC would leave such low-hanging fruit. G++ makes the effort to reduce std::copy to a memmove call when it can, or at least some of the time when it can (or at least, it did so in 2011). [0]

Related to this: does GCC treat memcpy differently when it can determine at compile-time that it's just a small copy?

[0] https://stackoverflow.com/a/4707028/

Gibbon1 · on June 25, 2020

Problem is superscalar processors the correspondence between number of instructions and speed breaks down. Partly because the processor does it's own optimization on the fly and can do multiple things in parallel.

A programmer should be careful about second guessing the compiler. And a compiler should be careful about second guessing the processor.

MaxBarraclough · on June 26, 2020

I'm not sure if you're implying this is premature optimisation. It isn't.

It's a performance-sensitive standard-library function, the kind of thing that deserves optimisation in assembly. It's also the kind of problem that can be accelerated with SIMD, but that necessarily means more complex code. That's why the standard library implementations aren't always dead simple.

Here's a pretty in-depth discussion [0]. They discuss CPU throttling, caches, and being memory-bound.

[0] https://news.ycombinator.com/item?id=18260154

jeffbee · on June 25, 2020

Only personal experience. If you look at the memcpy in llvm's libc, it was contributed by Googlers who share my experience and perspective.

api · on June 25, 2020

It's amazing to me that every CPU doesn't have specialized memory copying instructions given how common an operation this is and how it can be a bottleneck. We have hundreds of vector instructions to optimize all kinds of special cases but little for this most general thing?

jeffbee · on June 25, 2020

Intel has `rep movs` which is recommended for general use, you can beat it for sizes < 128B using a specialized loop. A very good memcpy for general purposes would just branch over the size to either a specialized small copy or `rep movs` for larger sizes.

The GCC version is just bananas. https://sourceware.org/git/?p=glibc.git;a=blob_plain;f=sysde...

Compare the newer ERMS implementation: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86...

andrewf · on June 25, 2020

What's weird is that if you git blame a bit, a lot of the elaborate stuff was contributed by Intel.

jeffbee · on June 25, 2020

Intel is at least as guilty of chasing misleading benchmarks as anyone else in the history of our field.

saagarjha · on June 25, 2020

No AVX2-optimized version? I'm disappointed…

Edit: it's rolled into the erms version.

userbinator · on June 25, 2020

x86 does, it's called REP MOVS, and it copies entire cachelines at a time (if the size of the copy lets it.) AFAIK it has done this since the P6 family (Pentium II), and has only gotten better over time. As a bonus, it is only 2 bytes long. I believe the Linux kernel uses it as its default memcpy()/memmove().

Here's some very interesting discussion about it: https://stackoverflow.com/questions/43343231/enhanced-rep-mo...

unscaled · on June 25, 2020

It's funny enough that the REP family of instructions has existed at least since the original 8086/8088. It's very much a relic of the CISC heritage of the x86 ISA, and it was probably meant to make writing assembly code easier and to save space - rather to improve performance.

I remember it was actually quite common in early PC programs, but I don't remember well enough whether it was also generated by compilers, or just existed in hand-optimized assembly (which was of course extremely common back then).

api · on June 25, 2020

Why isn't it the default everywhere?

jeffbee · on June 25, 2020

As I tried to indicate, the microbenchmark-guided unrolled loops look better than the simple `rep` instruction because the benchmarks aren't realistic. They don't suffer from mispredicted branches, they don't experience icache pressure. Basically the microarchitectural reality is absent in a small benchmark.

There's also the fact that `rep` startup cost was higher in the past than it is now. I think it started to get really fast around Ivy Bridge.

This is all discussed in Intel's optimization manual, by the way.

xelxebar · on June 25, 2020

Neat. I was unaware of this manual. For the lazy but curious:

https://software.intel.com/content/www/us/en/develop/downloa...

And if you want ALL THE THINGS:

https://software.intel.com/content/www/us/en/develop/article...

yvdriess · on June 25, 2020

Yes, some compilers emit AVX move instructions on memcpy or compiler-generated moves (e.g. pass by value struct). (In addition to the rep move mentioned in the sibling posts)

ckennelly · on June 25, 2020

llvm-libc's memcpy is heavily sized optimized for this reason. A dedicated instruction that is fast for all cases, is ideal, though.

GuB-42 · on June 25, 2020

> My preferred approach when optimising is to go for size first

Agreed, in my experience, more often than not, size is speed. Small memory usage mean less cache misses, smaller memcpys, more chance of having data fit a single word, etc...

scoutt · on June 25, 2020

Except when loops unrolls make things faster.

ants_a · on June 25, 2020

That only tends to be the case for tiny loops where loop iterator updates can be folded into addressing displacements and that frees up an execution port needed for something else.

If you don't know what you are doing unrolling is just as likely to hurt performance because you don't fit into uOP cache and get less decode bandwidth as a result. Or you increase ICache pressure on macro benchmarks and hurt real world performance. Modern cores are really good at hiding loop accounting overhead.

scoutt · on June 25, 2020

In a world where all the electronic devices are composed of x86 cores and/or have a data and/or instruction cache, it could be true.

But there are plenty of architectures out there where space-time tradeoffs (by optimized compilers) are still a thing (my PoV is from the embedded industry).

jhallenworld · on June 25, 2020

I've found an interesting converse with macro benchmarks: small improvements to any of the top functions shown in your profiler output will have a relatively larger than expected effect on the overall speed. I mean you make a change that improves just one function by a little bit on its own (within the benchmark), but the overall is better than this. I think you end up freeing resources used by everything else.

The trick is to optimize the right macro benchmark- one that matches your customer's key use case I suppose.

nwallin · on June 24, 2020

This is probably the answer.

I watched a great talk about this, and how to avoid problems that result from it. The presenter built a framework that made random changes to alignment of various blocks of code with two different versions of the same microbenchmark until it had high statistical confidence that either one was faster than the other, or it would eject and say that the difference between the two was statistically insignificant. Or something along those lines.

I'm looking for the video now, but I can't find it. Pretty sure it was CppCon but I'm not positive.

moonchild · on June 24, 2020

Possibly 'Performance Matters'? (https://www.youtube.com/watch?v=r-TLSBdHe1A)

dkersten · on June 24, 2020

This was the very first thing I thought of when I saw this submission title! Great talk, well worth watching.

Twirrim · on June 25, 2020

My team watched this several months ago now. "Eyeball statistics" he mentions at one stage has been a phrase that has particularly stuck for us from it. So many metrics dashboards etc rely on it to a disturbing degree, if you step back and really think about what it's _actually_ showing you. I almost want to put a warning banner on most of ours. "Beware Eyeball Statistics!"

nwallin · on June 26, 2020

The other one is 99th percentile latency. Lots of people think of 99th percentile latency as the latency that 1% of their users will encounter. But if your site loads 200 assets, almost every site load will hit 99th percentile latency.

nwallin · on June 26, 2020

Yup! That's the one!

kristjansson · on June 24, 2020

https://news.ycombinator.com/item?id=23634519 :)

kristjansson · on June 24, 2020

This[1] talk from Emery Berger at last year's Strange Loop does a good job discussing the impact of alignment and other 'small' changes on performance, and how to estimate the actual impact of changes on performance

[1]: https://www.youtube.com/watch?v=r-TLSBdHe1A

codezero · on June 24, 2020

Yep, I'm not even a hobbyist in this area, and immediately wondered if there was an alignment change because of it.

With that said, it sounds like at least with GCC, a global constant would go in the TEXT section of the binary and I can't think of why that would affect performance, especially since the variable was seemingly unused. Neat, I hope someone finds the thread to pull here :)

tom_mellior · on June 24, 2020

> it sounds like at least with GCC, a global constant would go in the TEXT section of the binary

I don't know why you think it sounds like that, but GCC (on Linux) puts global constants into the .rodata section (read-only data) and other global variables into the .data section.

> I can't think of why that would affect performance, especially since the variable was seemingly unused.

I'm guessing data alignment and corresponding changes in cache hits and misses. But I wonder if there's a scenario where this change changes the offsets of other variables from the base of the data section, which changes the encodings and therefore the sizes of the instructions accessing those variables, which overall changes code size and might cause code alignment issues. Probably too far-fetched.

vivekseth · on June 24, 2020

On macOS, "gcc" puts strings that don't include a null byte in the TEXT section. I put quotes around gcc, since it seems that by default macOS aliases gcc to clang.

You can run the program I built here to see for your self: https://github.com/vivekseth/blog-posts/tree/master/Jump-Add...

Since the string in the TEXT section, we can actually execute it as if it were code!

After you build the program you can run `otool -t ./a.out` to verify that the string `execString` is indeed in the TEXT section.

tom_mellior · on June 24, 2020

Interesting. And if you put a null byte at the end of that character array, it would put it in a different section? I don't have access to MacOS, but I would be interested in seeing the output of clang -S on that code (with and without a null byte), maybe you could add it to the repository?

vivekseth · on June 24, 2020

If I'm remembering correctly (this was from 3 years ago), the string was moved to a different section if it had a null byte in the middle of the string. That's why coming up with the assembly I used was kinda tricky.

Here's the output as-is: https://gist.github.com/vivekseth/20f319d2a9978af57d926b649a...

Here's the output with a null byte in the middle: https://gist.github.com/vivekseth/fc50319aaac24588bcf568209b...

From what I can tell, it seems like both strings are in the TEXT section now. Maybe something changed, or I'm remembering incorrectly.

codezero · on June 24, 2020

Thanks for the correction, hopefully my statement that I'm hardly a hobbyist makes it clear I'm not super sure :)

I guess I don't understand enough about how this variable sitting in memory might affect hit/miss/alignment especially if it's not used in the program as it's running.

Would it be that some variable that is used is pushed off of a page in memory or something by the unused variable, so access it is slower? I guess that makes sense.

kijiki · on June 24, 2020

> ...a global constant would go in the TEXT section of the binary...

Wouldn't it go into the .rodata section, not .text?

codezero · on June 24, 2020

I'm not sure, I am parroting what I thought I read on SO - so you are much more likely to be right, sorry about that!

vivekseth · on June 24, 2020

clang seems to work this way, and macOS aliases gcc -> clang by default. Wrote more about this here: https://news.ycombinator.com/item?id=23634495

codezero · on June 25, 2020

Thanks! Glad there's an explanation for my confusion :)

albertzeyer · on June 25, 2020

But if this is the reason, does it make sense to leave it in then? I.e. to unroll this here: https://github.com/google/wuffs/commit/6537a817b7f8351cded87...

I guess, maybe once another global constant is introduced somewhere else, this might have exactly the same effect.

But also, once there are more unrelated changes, everything will change anyway again.

It seems strange then to keep this in because right now this seems to help.

lmilcin · on June 25, 2020

I have worked on algorithmic trading where you basically stare at instructions and try to ensure the CPU does not stall on branches, that memory is accessed in a strided pattern recognizable to the CPU and that there are no more concurrent patterns that CPU can coordinate, that memory written does not by accident belong to another core, etc.

What is "mystery" in the article is actually bread and butter of anybody who tasked to seriously optimize a piece of non-trivial code.

guerrilla · on June 25, 2020

I'm pretty sure this is it. Changing the number of pointers shouldn't matter as they're already aligned but the string is an odd number of characters, which would change the offset of pretty much every other constant string and thus the performance of the respective lookups. They were probably just very lucky that it was aligned so well before removing it.

muth02446 · on June 24, 2020

here is a related paper:

https://people.cs.umass.edu/~emery/pubs/stabilizer-asplos13....

staycoolboy · on June 25, 2020

That was my first thought. Not sure what compiler OP is using but I would put "#pragma align" all over the place and see if the absence still makes a difference.

guerrilla · on June 25, 2020

That wouldn't help if they're generating ELF files as .rodata is a string table, meaning it's tightly packed and as far as I know neither clang nor gcc pads it.

staycoolboy · on July 3, 2020

Ah, I didn't think of that.

nigeltao · on June 25, 2020

I am the commit author.

I wrote that commit on my laptop. I was unable to reproduce this on my desktop.

Re "why does this happen?", it certainly looks like an alignment thing. A more interesting question for me is "what should I do about it?" If future non-trivial changes affect the micro-benchmark numbers, how do I ensure that it's signal and not noise? Do I sprinkle some alignment directives throughout my code and hope for the best? Once the immediate symptoms are gone, how do I know that I've added enough?

Somebody suggested (off-list): "use compiler flags like the combo `-ffunction-sections -falign-functions=N` for values like 16, 32, 64 to help diagnose these issues quickly. You can also look at perf counters to find the problems, but each problem has a different counter so that can be hard. Once you know you have a problem, you can usually write code defensively against the issue. But it requires knowing a lot about the micro-architecture. Things like minimizing branch density, data dependency graph height, etc."

That's all very well (and better suggestions than nothing), but I'm hesitant to hill-climb using different compiler flags from what my users generally do. I also want to avoid over-fitting to my primary (day-to-day) machine or to a particular version of a particular C compiler.

I've also been pointed to https://github.com/ccurtsinger/stabilizer but it sounds tied to LLVM 3.1 and hasn't had any substantial updates since 2013.

abetusk · on June 25, 2020

I'm no expert, but if it is a memory alignment issue, isn't this undefined behavior?

If the string is unused, isn't the compiler free to allocate or not allocate it (unless it has some 'volatile' directive or something)? This means that doing things like turning off/on optimization could yield inconsistent results across machines, across compilers and even across versions of the same compiler.

Is this used for security so that side channel timing attacks can be used to glean information about private data? Is the speed differential noticeable such that someone has created an issue for it? Is exact consistency in speed across different compilers, architectures, etc. a priority of the project?

If the answer is no to all of those, then I find it difficult to justify 'fixing' the issue.

From what I understand, isn't this what GUIX is trying to do? Create consistent byte-for-byte compiled programs? There has to be a trade off between consistency an speed. I also don't know how usable GUIX is or how valuable it is for your use case.

HelloNurse · on June 25, 2020

The compiler knows how it's aligning variables, It's a pure performance issue: location of things in memory affects alignment to caching unit boundaries (accessing one more cache line for the same data) and cache pressure (data ends up cached in the same location as something else, causing unnecessary evictions while other cache locations remains underused)

acqq · on June 25, 2020

> I wrote that commit on my laptop. I was unable to reproduce this on my desktop.

Then please modify the comment in the source to state that.

The described behavior (the speed results unpredictably slightly changing in both directions) of the measurements is actually "normal" on the notebooks with most of the possible thermal configurations and is not something that should be even tried to be "fixed" until the observed effects are actually consistent and big enough to be repeatable without the tight thermal controls.

Edit: of course the pushed commit can't be changed. But the comment (if it is visible in the source -- haven't checked that) can. There should be some kind of a visible "resolution" of the question in the repo.

nigeltao · on June 25, 2020

I haven't written a comment in the source yet, as I haven't further investigated and implemented a work-around yet to hang the comment on. It's low on the priority list.

saagarjha · on June 25, 2020

> Then please modify the comment to state that.

Please don't modify a commit you've pushed!

mritun · on June 25, 2020

If you’re only able to reproduce the problem on your laptop, check heat, throttling and power saving.

secondcoming · on June 25, 2020

Indeed, micro-profiling code on a laptop is generally not that useful

Twirrim · on June 25, 2020

It's mentioned elsewhere in this HN discussion, but I think I'd encourage watching this talk: https://www.youtube.com/watch?v=r-TLSBdHe1A. It's particularly relevant to what you're seeing.

nigeltao · on June 25, 2020

Yeah, I've watched that video. That's the source of my "I've also been pointed to https://github.com/ccurtsinger/stabilizer but it sounds tied to LLVM 3.1 and hasn't had any substantial updates since 2013."

DougBTX · on June 25, 2020

How consistently reproducible is it on your laptop? Do the same benchmarks change in the same way every time?

ufo · on June 24, 2020

When weird performance things like this happens, it can be helpful to test in other environments. If the performance change does not happen in other operating systems or hardware then it suggests that the weird performance you are observing could be due to an unusual coincidence in your particular system.

That said, if you do want to figure out why deleting that global variable made a difference then using Linux's perf tool might give you more informationto work with. One time I had a weird program where inserting a NOP instruction in a certain location made it run twice as fast. After investigation we found out that the difference was with branch prediction. The presence or absence of that NOP instruction affected the addresses of the jump targets the inner loop's switch statement. For some reason, the version without the NOP instruction those addresses resulted in lots of branch mispredictions. Perhaps because of a collision in the branch predictor's hash tables.

MauranKilom · on June 24, 2020

Are you sure it wasn't just instruction alignment? Inserting nops before loop jump targets to align the first loop body instruction to 8 or 16 bytes is a very common x86 thing most compilers do. See e.g. https://reverseengineering.stackexchange.com/a/2930.

ufo · on June 25, 2020

Would that explain the large difference we observed in the "branch-misses" statistic when we ran it under "perf stat"?

kjgkjhfkjf · on June 25, 2020

Removing `const char* wuffs_base__note__i_o_redirect = "@base: I/O redirect";` removes not only the literal string, but also a global non-constant pointer. This perhaps affects various optimizations.

They should have used `const char* wuffs_base__note__i_o_redirect const =` or (preferably) `const char wuffs_base__note__i_o_redirect[] =`.

tom_mellior · on June 25, 2020

I can't think of a plausible way in which an unused global would affect compiler optimizations. Could you elaborate?

HelloNurse · on June 25, 2020

The pointer probably precedes interesting variables in the same data segment, which therefore shift to a different address that is cached differently.

tom_mellior · on June 25, 2020

Sure, but that's not a compiler optimization issue. The compiler will emit the same instructions, they will just execute at a different speed if the data is elsewhere.

SomeoneFromCA · on June 24, 2020

I remember changing a _constant_ int into a volatile global variable, and my code became 4 times faster. It was on Ubuntu 16.04. I might actually find code and post here later.

sfblah · on June 24, 2020

I worked on some code once for an app that would crash if you added a comment in a specific place. We removed the comment and then added a comment asking developers not to re-add the comment.

IAmLiterallyAB · on June 24, 2020

Was it a language where the comment would be compiled out ahead of time? Or still present at runtime like JS

quietbritishjim · on June 25, 2020

I was thinking the same thing. It would be very odd if it had an effect on C or C++ (even more than the effect in the original article) because comments are removed even before preprocessing. They do affect the __LINE__ macro though, so it is vaguely conceivable for it to have an effect on program behaviour.

gaogao · on June 25, 2020

Oh nice, we had a load bearing comment too at work.

a_t48 · on June 25, 2020

It didn't happen to end with a '\' did it? :)

IvyMike · on June 25, 2020

Or, better yet, '??/', the trigraph for '\'?

staycoolboy · on June 25, 2020

Interesting. Were you writing a kernel driver? I've only ever used 'volatile' in embedded code where an ISR is touching a variable outside the scope to prevent the compiler from optimizing the code away. What were using 'volatile' for?

SomeoneFromCA · on June 25, 2020

No it was some random debugging of Leet Code style excercise, some useless math computation. I just was throwing every possible change, till one worked. I knew that code had to be fast than it was, because with PGO it was very fast.

acqq · on June 24, 2020

As far as I see, the source code of the library is not plain C but .go, which is then transpiled to .c

Anyway, even if what we see in the resulting .c is "just some missing string" it's not necessarily obvious how the strings are supposed to be handled in the whole .c domain. Specifically, it can be that some other constants (even not necessarily sting constants, but the constants in the same segment where the string constants are!) are being used very frequently by the code which is measured (let's call them "hotter" constants), and the removal of that one string constant simply affected the placement of the "hotter" constants).

In short, I suspect that the deletion of the constant is not the only way to get different results, but that the different results would also happen even when changing the order of the definition of constants, without removing any of them.

So the way I would attack that problem, if it would be desired to fix the performance issues, is: I'd measure the access of all the constants in the same segment, and identify which are used during the whole run of these benchmarks -- those are "hot." The solution is then to modify the code to be less dependent on such constants inside of the "hot" loops. Often the number of these that have to be fixed is low, but it's also possible it's otherwise.

So I believe it's not that much a mystery as it appears to be. (I have even more specific experiences and also suggestions. And I'm also looking for a new dream remote job. Anybody needs this kind of expertise, for some reasonable longer term, or shorter term but seriously paid?)

terminalcommand · on June 25, 2020

Watching used constants and trimming them to fit the cache seems like a good idea. I'm a total noob PC-architecture wise so apologies beforehand, but doesn't the CPU handle the caching of most used memory locations itself? If there are an equal number of hot constants, how does changing the order of them help performance? CPU is going to cache the most used locations anyway. In this example, the constant was never used, so there is no reason for the CPU to cache it.

BTW, adding some contact info to your HN profile could help in the job search. Best of luck!

acqq · on June 25, 2020

> doesn't the CPU handle the caching of most used memory locations itself?

Yes but caches aren't "never failing magic". You can imagine them as the mechanisms that allow saving some work in some scenarios. If the work thrown at them is different from their limitations, worse results aren't surprising.

> If there are an equal number of hot constants, how does changing the order of them help performance? CPU is going to cache the most used locations anyway.

The caches have some limitations by design. It's easy to construct the code which stresses the caches more, and sometimes just the position of elements accessed can influence the "congestion" points due to the changed mapping between the addresses and the elements accessed.

> adding some contact info to your HN profile

Thanks. I still hope that if there's a real interest the contact happens in spite of not being exceptionally easy -- also a kind of filter.

robalni · on June 25, 2020

This reminds me of when I made a program faster by adding a print statement to it. I am not able to find the program but I was able to create a new one as similar to it as I remember and it seems that this weird trick works in this program too.

When I uncomment the puts("hi") line in the code below, the time it takes to run the program consistently changes from 5.6 to 5.4 seconds on my machine if I compile without optimizations.

#include <stdio.h>

int main() { long tri_side = 1; long tri_area = 1; long sq_side = 1; long sq_area = 1; while (sq_side < 1000000000) { if (sq_area == tri_area) { printf("tri:%ld sq:%ld area:%ld\n", tri_side, sq_side, sq_area); //puts("hi"); } if (sq_area < tri_area) { sq_area += sq_side * 2 + 1; sq_side += 1; } else { tri_area += tri_side + 1; tri_side += 1; } } }

guerrilla · on June 25, 2020

I wasn't able to reproduce this but I'm wondering if it could have to do with buffer flushing. puts may implicitly do whatever fflush(stdio) does. [1]

[1]. https://linux.die.net/man/3/fflush

gomijacogeo · on June 25, 2020

Looks like other people are onto the same thing - alignment.

Especially if this is threaded code, what's probably happened is something (likely with some sort of locking primitive) that fit on one cache-line, now straddles two. The reverse is also possible where two items were landing on different cache lines and are now creating a false sharing problem.

It's likely a global and likely in the .bss (which comes immediately after .data, which is why it has alignment troubles when static strings change). It's usually pretty easy to binary search your way to the problematic module and variable.

nigeltao · on June 25, 2020

It's single-threaded.

viraptor · on June 24, 2020

I'm not sure I understand why they revert the change. You get random small +/- changes in performance with the change, but you get slightly cleaner code. What's the reason not to want that?

nigeltao · on June 25, 2020

Sorry that the commit message wasn't clearer.

Letting "foo" denote "wuffs_base__note__i_o_redirect", it wasn't really that "foo was unused", it was that "foo was unused, after the previous (artificial) commit renamed all-but-one of the foo references to another existing string constant". That previous commit set it up so that this commit, which exhibits the performance difference, has a trivially small diff. But the two commits combined to remove a "foo" that was actually in use, so the combination was rolled back.

The previous commit is https://github.com/google/wuffs/commit/844fc9b5eeef46edab49c...

rybosworld · on June 25, 2020

They might be hesitant to check in a change that causes a behavior they don't understand. This is a good mindset imo.

viraptor · on June 25, 2020

I feel like there's a (sometimes blurry) line between being hesitant about dangerous changes (good thing) and hoping that if you ignore something you don't understand, it will keep working as you expect (not a good thing).

Purely leaving an unused variable in place due to weird impact on performance is in the second category for me. But maybe there are other aspects - that's why I'm curious and asking about it

d_tr · on June 24, 2020

I wondered the same. These are microbenchmarks after all and it does not even seem like a clear win.

rurban · on June 24, 2020

He should experiment with shortening his ENV vars one byte by one, to see similar benchmarking effects. It process affects alignment, and if you unlucky bad alignment can cost a few percent.

munro · on June 25, 2020

I've never heard of this "wuffs" project before, apparently it's a programming language. It looks like a wild project, does anyone have any details what this is?

nigeltao · on June 25, 2020

Start at https://github.com/google/wuffs/tree/master/doc

AdmiralAsshat · on June 24, 2020

Isn't this usually where someone sufficiently versed in Assembly would look at the generated assembler code and figure out what changed?

pascal_cuoq · on June 24, 2020

You need to look at the disassembly of the generated binary to make sense of this sort of performance variation (paying attention to line cache boundaries for code and data), and even so, it is highly non-trivial. The performance counters found in modern processors sometimes help (https://en.wikipedia.org/wiki/Hardware_performance_counter ).

https://www.agner.org/optimize/microarchitecture.pdf contains the sort of information you need to have absorbed before you even start investigating. In most cases, it's not worth acquiring the expertise for 5% one way or the other in micro-benchmarks. If you care about these 5%, you shouldn't be programming in C in the first place.

And then there is this anecdote:

My job is to make tools to detect subtle undefined behaviors in C programs. I once had the opportunity to report a signed arithmetic overflow in a library that its authors considered, rightly or wrongly, to be performance-critical. My suggestion was:

… this is not one of the subtle undefined behaviors that we are the only ones to detect, UBSan would also have told you that the library was doing something wrong with “x + y” where x and y are ints. The good news is that you can write “(int)((unsigned)x + y)”, this is defined and it behaves exactly like you expected “x + y” to behave (but had no right to).

And the answer was “Ah, no, sorry, we can't apply this change, I ran the benchmarks and the library was 2% slower with it. It's a no, I'm afraid”.

The thing is, I am pretty sure that any modern optimizing C compiler (the interlocutor was using Clang) has been generating the exact same binary code for the two constructs for years (unless it applies an optimization that relies on the addition not overflowing in the “x + y” case, but then the authors would have noticed). I would bet a house that the binary that was 2% slower in benchmarks was byte-identical to the reference one.

voldacar · on June 24, 2020

If I may ask, what was the use case for this code that they cared so much about a 2% difference in benchmarks? Aerospace? Game engine? Packet routing?

sbierwagen · on June 25, 2020

I wouldn't expect aerospace, since I have been told embedded programmers in that field routinely disable compiler optimization, in the chance that a compiler bug or overzealous UB exploitation might introduce a bug into previously working code. Hard realtime requirements demand fast code, but not necessarily efficient code.

cybervasi · on June 25, 2020

I am guessing your tool was source based to even detect this, let alone the fact that the code change would have produced the identical code.

gnufx · on June 24, 2020

Performance counters are vital, and you don't need to grovel the disassembly yourself in association with profiling, even if it's feasible. Get a tool to do it for you; MAQAO is one in that area (x86-specific, and unfortunately part-proprietary).

Anyway, yes, measurements are what you need, rather than guesses, along with correctness checks, indeed.

saagarjha · on June 25, 2020

I've had this exact situation happen to me as well :/ It's frustrating.

moonchild · on June 24, 2020

The presence (or absence) of a global variable has no effect on the rest of the generated code. Nothing changed. You could possibly look at the alignment of the in-memory representation generated by the relocator,runtime dynamic linker. Likely, some bits of code are now on the same cache line that were previously separate, and vice versa.

moonchild · on June 24, 2020

> deleting this one line of code can have a dramatic effect on seemingly unrelated performance micro-benchmarks. Some numbers get better (e.g. +5%), some numbers get worse (e.g. -10%). The same micro-benchmark can get faster on one C compiler but slower on another

There's a fundamental disconnect that makes it difficult for humans to reason about performance in computer programs. Because the speed of light is so slow, computer architecture as we know it will always rely on cache and OoO to be fast. The human brain does seem to work out of order, but it's only used to thinking about a world that runs in order. When we use theory of mind, we don't model other people's minds, we use our own as a model for theirs; see mirror neurons[1].

Because of this, standard code benchmarks are not very useful, unless they can demonstrate order-of-magnitude speedups. Even something like a causal profiler[2][3][4], which attempts to control for the volatile aspects of performance, is of limited use; it cannot control for all variables and its results will likely be invalidated by the same architectural variation it tries to control for. Instead (with respect to performance) we should focus on three factors:

- Code maintainability

- Algorithmic complexity

- Cache coherency

Everything else is a distraction.

1. https://en.wikipedia.org/wiki/Mirror_neuron

2. https://www.youtube.com/watch?v=r-TLSBdHe1A

3. https://arxiv.org/pdf/1608.03676v1.pdf

4. https://github.com/plasma-umass/coz

dleslie · on June 24, 2020

There's a major caveat: if you know the hardware won't change and you control the software stack throughout.

Game consoles, unikernels and the like apply here.

kev009 · on June 25, 2020

VTune will help

thrownaway954 · on June 25, 2020

I have to admit... I get a "geek on" whenever I read C programmer commit messages. I don't know what it is but those people know how to write detailed, interesting and often entertaining commit messages. This one was a perfect example of all 3.