Lots of performance mysteries, particularly in micro-benchmarks, are explained by how either code or data aligns to memory cache lines. A seemingly unrelated change will cause memory locations to move. No idea if that is the case here or not but it looks like a possibility.
This is one of the reasons why I think microbenchmarks are either not very useful or completely misleading --- they'll favour things like insanely unrolled loops or aligned data with lots of padding, when in fact on a macro-scale those things will certainly affect performance negatively due to cache misses.
My preferred approach when optimising is to go for size first; then if more speed is desired, carefully apply size-increasing optimisations to areas which may benefit, testing with a macro benchmark to judge any improvement.
If the macro scale is dominated by the code you're optimizing, which is often the case, let's say some inner loops of a codec, signal processing, etc. then the cache misses that may follow aren't really a concern. Obviously(?) taking some random code that doesn't run frequently and trying to optimize it won't make a difference. Microbenchmarks are pretty useful in this context since they let you iterate faster. More generally the question is what is the bottleneck, if the bottleneck is fetching stuff into caches then that's what you optimize for, if the bottleneck is some computation that is running from caches then that's what you optimize...
The big cautionary tale of microbenchmarking is memcpy. If you look at the Gnu memcpy, it's ridiculously long and obviously been guided by extensive benchmarking. And memcpy often shows up in profiles, so it is a big target. But in real systems a small simple loop will perform better due to icache pressure.
Memcpy lights not be a good example however because it's often replaced by a loop (or a move or eliminated altogether) by the compiler, because it's such a commun function that its understood by the compiler, and the tradeoff of the real function are also understood and taken into account (of course I'm sure there are bad optimizations)
> in real systems a small simple loop will perform better due to icache pressure
Do you have a source for this? Seems surprising GCC would leave such low-hanging fruit. G++ makes the effort to reduce std::copy to a memmove call when it can, or at least some of the time when it can (or at least, it did so in 2011). [0]
Related to this: does GCC treat memcpy differently when it can determine at compile-time that it's just a small copy?
Problem is superscalar processors the correspondence between number of instructions and speed breaks down. Partly because the processor does it's own optimization on the fly and can do multiple things in parallel.
A programmer should be careful about second guessing the compiler. And a compiler should be careful about second guessing the processor.
I'm not sure if you're implying this is premature optimisation. It isn't.
It's a performance-sensitive standard-library function, the kind of thing that deserves optimisation in assembly. It's also the kind of problem that can be accelerated with SIMD, but that necessarily means more complex code. That's why the standard library implementations aren't always dead simple.
Here's a pretty in-depth discussion [0]. They discuss CPU throttling, caches, and being memory-bound.
It's amazing to me that every CPU doesn't have specialized memory copying instructions given how common an operation this is and how it can be a bottleneck. We have hundreds of vector instructions to optimize all kinds of special cases but little for this most general thing?
Intel has `rep movs` which is recommended for general use, you can beat it for sizes < 128B using a specialized loop. A very good memcpy for general purposes would just branch over the size to either a specialized small copy or `rep movs` for larger sizes.
x86 does, it's called REP MOVS, and it copies entire cachelines at a time (if the size of the copy lets it.) AFAIK it has done this since the P6 family (Pentium II), and has only gotten better over time. As a bonus, it is only 2 bytes long. I believe the Linux kernel uses it as its default memcpy()/memmove().
It's funny enough that the REP family of instructions has existed at least since the original 8086/8088. It's very much a relic of the CISC heritage of the x86 ISA, and it was probably meant to make writing assembly code easier and to save space - rather to improve performance.
I remember it was actually quite common in early PC programs, but I don't remember well enough whether it was also generated by compilers, or just existed in hand-optimized assembly (which was of course extremely common back then).
As I tried to indicate, the microbenchmark-guided unrolled loops look better than the simple `rep` instruction because the benchmarks aren't realistic. They don't suffer from mispredicted branches, they don't experience icache pressure. Basically the microarchitectural reality is absent in a small benchmark.
There's also the fact that `rep` startup cost was higher in the past than it is now. I think it started to get really fast around Ivy Bridge.
This is all discussed in Intel's optimization manual, by the way.
Yes, some compilers emit AVX move instructions on memcpy or compiler-generated moves (e.g. pass by value struct). (In addition to the rep move mentioned in the sibling posts)
> My preferred approach when optimising is to go for size first
Agreed, in my experience, more often than not, size is speed. Small memory usage mean less cache misses, smaller memcpys, more chance of having data fit a single word, etc...
That only tends to be the case for tiny loops where loop iterator updates can be folded into addressing displacements and that frees up an execution port needed for something else.
If you don't know what you are doing unrolling is just as likely to hurt performance because you don't fit into uOP cache and get less decode bandwidth as a result. Or you increase ICache pressure on macro benchmarks and hurt real world performance. Modern cores are really good at hiding loop accounting overhead.
In a world where all the electronic devices are composed of x86 cores and/or have a data and/or instruction cache, it could be true.
But there are plenty of architectures out there where space-time tradeoffs (by optimized compilers) are still a thing (my PoV is from the embedded industry).
I've found an interesting converse with macro benchmarks: small improvements to any of the top functions shown in your profiler output will have a relatively larger than expected effect on the overall speed. I mean you make a change that improves just one function by a little bit on its own (within the benchmark), but the overall is better than this. I think you end up freeing resources used by everything else.
The trick is to optimize the right macro benchmark- one that matches your customer's key use case I suppose.
I watched a great talk about this, and how to avoid problems that result from it. The presenter built a framework that made random changes to alignment of various blocks of code with two different versions of the same microbenchmark until it had high statistical confidence that either one was faster than the other, or it would eject and say that the difference between the two was statistically insignificant. Or something along those lines.
I'm looking for the video now, but I can't find it. Pretty sure it was CppCon but I'm not positive.
My team watched this several months ago now. "Eyeball statistics" he mentions at one stage has been a phrase that has particularly stuck for us from it. So many metrics dashboards etc rely on it to a disturbing degree, if you step back and really think about what it's _actually_ showing you. I almost want to put a warning banner on most of ours. "Beware Eyeball Statistics!"
The other one is 99th percentile latency. Lots of people think of 99th percentile latency as the latency that 1% of their users will encounter. But if your site loads 200 assets, almost every site load will hit 99th percentile latency.
This[1] talk from Emery Berger at last year's Strange Loop does a good job discussing the impact of alignment and other 'small' changes on performance, and how to estimate the actual impact of changes on performance
Yep, I'm not even a hobbyist in this area, and immediately wondered if there was an alignment change because of it.
With that said, it sounds like at least with GCC, a global constant would go in the TEXT section of the binary and I can't think of why that would affect performance, especially since the variable was seemingly unused. Neat, I hope someone finds the thread to pull here :)
> it sounds like at least with GCC, a global constant would go in the TEXT section of the binary
I don't know why you think it sounds like that, but GCC (on Linux) puts global constants into the .rodata section (read-only data) and other global variables into the .data section.
> I can't think of why that would affect performance, especially since the variable was seemingly unused.
I'm guessing data alignment and corresponding changes in cache hits and misses. But I wonder if there's a scenario where this change changes the offsets of other variables from the base of the data section, which changes the encodings and therefore the sizes of the instructions accessing those variables, which overall changes code size and might cause code alignment issues. Probably too far-fetched.
On macOS, "gcc" puts strings that don't include a null byte in the TEXT section. I put quotes around gcc, since it seems that by default macOS aliases gcc to clang.
Interesting. And if you put a null byte at the end of that character array, it would put it in a different section? I don't have access to MacOS, but I would be interested in seeing the output of clang -S on that code (with and without a null byte), maybe you could add it to the repository?
If I'm remembering correctly (this was from 3 years ago), the string was moved to a different section if it had a null byte in the middle of the string. That's why coming up with the assembly I used was kinda tricky.
Thanks for the correction, hopefully my statement that I'm hardly a hobbyist makes it clear I'm not super sure :)
I guess I don't understand enough about how this variable sitting in memory might affect hit/miss/alignment especially if it's not used in the program as it's running.
Would it be that some variable that is used is pushed off of a page in memory or something by the unused variable, so access it is slower? I guess that makes sense.
I have worked on algorithmic trading where you basically stare at instructions and try to ensure the CPU does not stall on branches, that memory is accessed in a strided pattern recognizable to the CPU and that there are no more concurrent patterns that CPU can coordinate, that memory written does not by accident belong to another core, etc.
What is "mystery" in the article is actually bread and butter of anybody who tasked to seriously optimize a piece of non-trivial code.
I'm pretty sure this is it. Changing the number of pointers shouldn't matter as they're already aligned but the string is an odd number of characters, which would change the offset of pretty much every other constant string and thus the performance of the respective lookups. They were probably just very lucky that it was aligned so well before removing it.
That was my first thought. Not sure what compiler OP is using but I would put "#pragma align" all over the place and see if the absence still makes a difference.
That wouldn't help if they're generating ELF files as .rodata is a string table, meaning it's tightly packed and as far as I know neither clang nor gcc pads it.
I wrote that commit on my laptop. I was unable to reproduce this on my desktop.
Re "why does this happen?", it certainly looks like an alignment thing. A more interesting question for me is "what should I do about it?" If future non-trivial changes affect the micro-benchmark numbers, how do I ensure that it's signal and not noise? Do I sprinkle some alignment directives throughout my code and hope for the best? Once the immediate symptoms are gone, how do I know that I've added enough?
Somebody suggested (off-list): "use compiler flags like the combo `-ffunction-sections -falign-functions=N` for values like 16, 32, 64 to help diagnose these issues quickly. You can also look at perf counters to find the problems, but each problem has a different counter so that can be hard. Once you know you have a problem, you can usually write code defensively against the issue. But it requires knowing a lot about the micro-architecture. Things like minimizing branch density, data dependency graph height, etc."
That's all very well (and better suggestions than nothing), but I'm hesitant to hill-climb using different compiler flags from what my users generally do. I also want to avoid over-fitting to my primary (day-to-day) machine or to a particular version of a particular C compiler.
I'm no expert, but if it is a memory alignment issue, isn't this undefined behavior?
If the string is unused, isn't the compiler free to allocate or not allocate it (unless it has some 'volatile' directive or something)? This means that doing things like turning off/on optimization could yield inconsistent results across machines, across compilers and even across versions of the same compiler.
Is this used for security so that side channel timing attacks can be used to glean information about private data? Is the speed differential noticeable such that someone has created an issue for it? Is exact consistency in speed across different compilers, architectures, etc. a priority of the project?
If the answer is no to all of those, then I find it difficult to justify 'fixing' the issue.
From what I understand, isn't this what GUIX is trying to do? Create consistent byte-for-byte compiled programs? There has to be a trade off between consistency an speed. I also don't know how usable GUIX is or how valuable it is for your use case.
The compiler knows how it's aligning variables, It's a pure performance issue: location of things in memory affects alignment to caching unit boundaries (accessing one more cache line for the same data) and cache pressure (data ends up cached in the same location as something else, causing unnecessary evictions while other cache locations remains underused)
> I wrote that commit on my laptop. I was unable to reproduce this on my desktop.
Then please modify the comment in the source to state that.
The described behavior (the speed results unpredictably slightly changing in both directions) of the measurements is actually "normal" on the notebooks with most of the possible thermal configurations and is not something that should be even tried to be "fixed" until the observed effects are actually consistent and big enough to be repeatable without the tight thermal controls.
Edit: of course the pushed commit can't be changed. But the comment (if it is visible in the source -- haven't checked that) can. There should be some kind of a visible "resolution" of the question in the repo.
I haven't written a comment in the source yet, as I haven't further investigated and implemented a work-around yet to hang the comment on. It's low on the priority list.
It's mentioned elsewhere in this HN discussion, but I think I'd encourage watching this talk: https://www.youtube.com/watch?v=r-TLSBdHe1A. It's particularly relevant to what you're seeing.
Yeah, I've watched that video. That's the source of my "I've also been pointed to https://github.com/ccurtsinger/stabilizer but it sounds tied to LLVM 3.1 and hasn't had any substantial updates since 2013."
When weird performance things like this happens, it can be helpful to test in other environments. If the performance change does not happen in other operating systems or hardware then it suggests that the weird performance you are observing could be due to an unusual coincidence in your particular system.
That said, if you do want to figure out why deleting that global variable made a difference then using Linux's perf tool might give you more informationto work with. One time I had a weird program where inserting a NOP instruction in a certain location made it run twice as fast. After investigation we found out that the difference was with branch prediction. The presence or absence of that NOP instruction affected the addresses of the jump targets the inner loop's switch statement. For some reason, the version without the NOP instruction those addresses resulted in lots of branch mispredictions. Perhaps because of a collision in the branch predictor's hash tables.
Are you sure it wasn't just instruction alignment? Inserting nops before loop jump targets to align the first loop body instruction to 8 or 16 bytes is a very common x86 thing most compilers do. See e.g. https://reverseengineering.stackexchange.com/a/2930.
Removing `const char* wuffs_base__note__i_o_redirect = "@base: I/O redirect";` removes not only the literal string, but also a global non-constant pointer. This perhaps affects various optimizations.
They should have used `const char* wuffs_base__note__i_o_redirect const =` or (preferably) `const char wuffs_base__note__i_o_redirect[] =`.
Sure, but that's not a compiler optimization issue. The compiler will emit the same instructions, they will just execute at a different speed if the data is elsewhere.
I remember changing a _constant_ int into a volatile global variable, and my code became 4 times faster. It was on Ubuntu 16.04. I might actually find code and post here later.
I worked on some code once for an app that would crash if you added a comment in a specific place. We removed the comment and then added a comment asking developers not to re-add the comment.
I was thinking the same thing. It would be very odd if it had an effect on C or C++ (even more than the effect in the original article) because comments are removed even before preprocessing. They do affect the __LINE__ macro though, so it is vaguely conceivable for it to have an effect on program behaviour.
Interesting. Were you writing a kernel driver? I've only ever used 'volatile' in embedded code where an ISR is touching a variable outside the scope to prevent the compiler from optimizing the code away. What were using 'volatile' for?
No it was some random debugging of Leet Code style excercise, some useless math computation. I just was throwing every possible change, till one worked. I knew that code had to be fast than it was, because with PGO it was very fast.
As far as I see, the source code of the library is not plain C but .go, which is then transpiled to .c
Anyway, even if what we see in the resulting .c is "just some missing string" it's not necessarily obvious how the strings are supposed to be handled in the whole .c domain. Specifically, it can be that some other constants (even not necessarily sting constants, but the constants in the same segment where the string constants are!) are being used very frequently by the code which is measured (let's call them "hotter" constants), and the removal of that one string constant simply affected the placement of the "hotter" constants).
In short, I suspect that the deletion of the constant is not the only way to get different results, but that the different results would also happen even when changing the order of the definition of constants, without removing any of them.
So the way I would attack that problem, if it would be desired to fix the performance issues, is: I'd measure the access of all the constants in the same segment, and identify which are used during the whole run of these benchmarks -- those are "hot." The solution is then to modify the code to be less dependent on such constants inside of the "hot" loops. Often the number of these that have to be fixed is low, but it's also possible it's otherwise.
So I believe it's not that much a mystery as it appears to be. (I have even more specific experiences and also suggestions. And I'm also looking for a new dream remote job. Anybody needs this kind of expertise, for some reasonable longer term, or shorter term but seriously paid?)
Watching used constants and trimming them to fit the cache seems like a good idea. I'm a total noob PC-architecture wise so apologies beforehand, but doesn't the CPU handle the caching of most used memory locations itself? If there are an equal number of hot constants, how does changing the order of them help performance? CPU is going to cache the most used locations anyway. In this example, the constant was never used, so there is no reason for the CPU to cache it.
BTW, adding some contact info to your HN profile could help in the job search. Best of luck!
> doesn't the CPU handle the caching of most used memory locations itself?
Yes but caches aren't "never failing magic". You can imagine them as the mechanisms that allow saving some work in some scenarios. If the work thrown at them is different from their limitations, worse results aren't surprising.
> If there are an equal number of hot constants, how does changing the order of them help performance? CPU is going to cache the most used locations anyway.
The caches have some limitations by design. It's easy to construct the code which stresses the caches more, and sometimes just the position of elements accessed can influence the "congestion" points due to the changed mapping between the addresses and the elements accessed.
> adding some contact info to your HN profile
Thanks. I still hope that if there's a real interest the contact happens in spite of not being exceptionally easy -- also a kind of filter.
This reminds me of when I made a program faster by adding a print statement to it.
I am not able to find the program but I was able to create a new one as similar to it as I remember and it seems that this weird trick works in this program too.
When I uncomment the puts("hi") line in the code below, the time it takes to run the program consistently changes from 5.6 to 5.4 seconds on my machine if I compile without optimizations.
#include <stdio.h>
int main() {
long tri_side = 1;
long tri_area = 1;
long sq_side = 1;
long sq_area = 1;
while (sq_side < 1000000000) {
if (sq_area == tri_area) {
printf("tri:%ld sq:%ld area:%ld\n", tri_side, sq_side, sq_area);
//puts("hi");
}
if (sq_area < tri_area) {
sq_area += sq_side * 2 + 1;
sq_side += 1;
} else {
tri_area += tri_side + 1;
tri_side += 1;
}
}
}
Looks like other people are onto the same thing - alignment.
Especially if this is threaded code, what's probably happened is something (likely with some sort of locking primitive) that fit on one cache-line, now straddles two. The reverse is also possible where two items were landing on different cache lines and are now creating a false sharing problem.
It's likely a global and likely in the .bss (which comes immediately after .data, which is why it has alignment troubles when static strings change). It's usually pretty easy to binary search your way to the problematic module and variable.
I'm not sure I understand why they revert the change. You get random small +/- changes in performance with the change, but you get slightly cleaner code. What's the reason not to want that?
Letting "foo" denote "wuffs_base__note__i_o_redirect", it wasn't really that "foo was unused", it was that "foo was unused, after the previous (artificial) commit renamed all-but-one of the foo references to another existing string constant". That previous commit set it up so that this commit, which exhibits the performance difference, has a trivially small diff. But the two commits combined to remove a "foo" that was actually in use, so the combination was rolled back.
I feel like there's a (sometimes blurry) line between being hesitant about dangerous changes (good thing) and hoping that if you ignore something you don't understand, it will keep working as you expect (not a good thing).
Purely leaving an unused variable in place due to weird impact on performance is in the second category for me. But maybe there are other aspects - that's why I'm curious and asking about it
He should experiment with shortening his ENV vars one byte by one, to see similar benchmarking effects. It process affects alignment, and if you unlucky bad alignment can cost a few percent.
I've never heard of this "wuffs" project before, apparently it's a programming language. It looks like a wild project, does anyone have any details what this is?
You need to look at the disassembly of the generated binary to make sense of this sort of performance variation (paying attention to line cache boundaries for code and data), and even so, it is highly non-trivial. The performance counters found in modern processors sometimes help (https://en.wikipedia.org/wiki/Hardware_performance_counter ).
https://www.agner.org/optimize/microarchitecture.pdf contains the sort of information you need to have absorbed before you even start investigating. In most cases, it's not worth acquiring the expertise for 5% one way or the other in micro-benchmarks. If you care about these 5%, you shouldn't be programming in C in the first place.
And then there is this anecdote:
My job is to make tools to detect subtle undefined behaviors in C programs. I once had the opportunity to report a signed arithmetic overflow in a library that its authors considered, rightly or wrongly, to be performance-critical. My suggestion was:
… this is not one of the subtle undefined behaviors that we are the only ones to detect, UBSan would also have told you that the library was doing something wrong with “x + y” where x and y are ints. The good news is that you can write “(int)((unsigned)x + y)”, this is defined and it behaves exactly like you expected “x + y” to behave (but had no right to).
And the answer was “Ah, no, sorry, we can't apply this change, I ran the benchmarks and the library was 2% slower with it. It's a no, I'm afraid”.
The thing is, I am pretty sure that any modern optimizing C compiler (the interlocutor was using Clang) has been generating the exact same binary code for the two constructs for years (unless it applies an optimization that relies on the addition not overflowing in the “x + y” case, but then the authors would have noticed). I would bet a house that the binary that was 2% slower in benchmarks was byte-identical to the reference one.
I wouldn't expect aerospace, since I have been told embedded programmers in that field routinely disable compiler optimization, in the chance that a compiler bug or overzealous UB exploitation might introduce a bug into previously working code. Hard realtime requirements demand fast code, but not necessarily efficient code.
Performance counters are vital, and you don't need to grovel the disassembly yourself in association with profiling, even if it's feasible. Get a tool to do it for you; MAQAO is one in that area (x86-specific, and unfortunately part-proprietary).
Anyway, yes, measurements are what you need, rather than guesses, along with correctness checks, indeed.
The presence (or absence) of a global variable has no effect on the rest of the generated code. Nothing changed. You could possibly look at the alignment of the in-memory representation generated by the relocator,runtime dynamic linker. Likely, some bits of code are now on the same cache line that were previously separate, and vice versa.
> deleting this one line of code can have a dramatic effect on seemingly unrelated performance micro-benchmarks. Some numbers get better (e.g. +5%), some numbers get worse (e.g. -10%). The same micro-benchmark can get faster on one C compiler but slower on another
There's a fundamental disconnect that makes it difficult for humans to reason about performance in computer programs. Because the speed of light is so slow, computer architecture as we know it will always rely on cache and OoO to be fast. The human brain does seem to work out of order, but it's only used to thinking about a world that runs in order. When we use theory of mind, we don't model other people's minds, we use our own as a model for theirs; see mirror neurons[1].
Because of this, standard code benchmarks are not very useful, unless they can demonstrate order-of-magnitude speedups. Even something like a causal profiler[2][3][4], which attempts to control for the volatile aspects of performance, is of limited use; it cannot control for all variables and its results will likely be invalidated by the same architectural variation it tries to control for. Instead (with respect to performance) we should focus on three factors:
I have to admit... I get a "geek on" whenever I read C programmer commit messages. I don't know what it is but those people know how to write detailed, interesting and often entertaining commit messages. This one was a perfect example of all 3.