The fact that GCC and Clang can identify a manual implementation of count-bits-set and substitute it with an invocation of popcount is fascinating. What other idioms of similar or higher complexity can these compilers recognize and convert to instructions?
It is should be pointed that they can identify some implementations but miss many common idiomatic implementations that may be superior if you do not have a popcount() instruction. I am often surprised by the algorithms they can't see. You will need to check the code generation. This true for many of the bit-twiddling instructions available on modern CPUs.
As for why someone might care, most bit-twiddling intrinsics cannot be used in "constexpr" functions in C++ (though this is starting to change). Finding a C++ code implementation of bit-twiddling that is reliably converted to the underlying instruction allows you to use the same constexpr implementation in both compile-time and run-time contexts with correct code generation.
Does this imply that it might be better to use the "worse" or "naïve" implementation, if you know that the compiler will identify it and do something architecture-appropriate?
Yes. Surprisingly, many naive implementations you might assume would be identified aren't. You are far more likely to write a naive implementation that is not identified than one that is. I've spent a few weekends testing many different implementations to better understand the limitations of the compiler at identifying bit-twiddling instructions and it still doesn't work nearly as well as you might hope.
Also, not every compiler will identify a specific bit-twiddling implementation, Clang seems to identify more of these idiomatic implementations than GCC in my experience, but that is anecdotal. The set of code patterns they can identify is not documented anywhere that I've seen, though it must be in the compiler source code somewhere.
I strongly disagree with the sibling comments. It is better and more clear to call the compiler built-in than do some magic incantation that gets the compiler to optimize it and hope that future compilers don’t regress.
This does not address the real-world case where built-ins will not compile in some valid code contexts. It is not "better" to do something that literally doesn't work. No one is doing this for fun, it is working around a well-known limitation in current compilers.
Because it is an unfortunate limitation on built-ins that reduces their utility, fixing this is on the roadmap for most popular compilers.
On top of that, the popcount built-in will always compile even if there is no instruction for it, as it can generate fallback code. It actually does so for a naive invocation of gcc or clang for x64 as the original x64 IS did not contain popcount. You need to pass some arch that supports the instruction, or -mpopcount to explicitly enable it. Handling all those builtins properly is tedious.
Assuming they are correct that the 'worse on paper' is the only one detected (I don't know if this is true or not) that would be the logical conclusion, yes.
> What other idioms of similar or higher complexity can these compilers recognize and convert to instructions?
Both GCC and clang (really LLVM) will try to autovectorize your loops, replacing operations on individual words with SIMD instructions (provided the compilation target supports them).
Similarly, both should replace `popcnt`-style functions (e.g., count-leading-zeros) with their instruction equivalents, provided the implementation is idiomatic enough. The overarching challenge is that users who don't know about instructions like `popcnt` and `clz` are also less likely to write implementations that are easy to detect and swap.
I don't recall seeing bit counting in the LLVM idiom recognition, but it's been years since I pored over it. Could you maybe Godbolt an example where this substitution happens? I'd be happy to be wrong in this case :)
It could be a bug; I don't see an immediate reason to zero eax in this fragment. Sometimes it's easy to miss the reasons why compilers zero apparently unused things though: it can be about breaking data dependencies or making sure flags are in a known state.
It is a bug, but in the CPU. Until recently, on intel cpu, the popcnt instruction had an implicit dependency on its output register, which can be a performance issue. Explicitly clearing the register (which is very fast and almost free), is an easy way to workaround it.
A lot of things. Even somewhat complicated arithmetic will become a constant if possible. Certain formulas are hardcoded (e.g. the constant-time sum of the first n positive integers). It’ll perform inlining and tail call recursion elimination. Certain loops can be vectorized automatically. And a lot of library functions are “intrinsics” that the compiler knows a lot about: if you implement you own it might replace it with the standard one, or it may specialize your call to the standard library function to suit your specific use (e.g. a 16-byte memcpy for local variables might just be a load into the XMM registers). Godbolt is a really great resource to see what the compiler will and will not do for you.
Even somewhat complicated arithmetic will become a constant if possible.
That's one of the simplest optimisations to implement even in a toy compiler, can be done at the asme time as parsing, and comes practically "for free" --- if all operands of an operator are constant, perform the operation and replace the operator and its operands with the constant result; and repeat this recursively.
Everything else is done based on pattern-matching the AST and performing the appropriate transformations.
Hm, last time I wrote a constant elimination I did so in the optimizer pass over the AST (as part of a lecture, we built a custom C frontend for LLVM and added some optimization passes). What you describe replaces e.g. "x = 3 + 4;", but you get that for free once your optimizer is able to replace more complicated things (e.g. "a = 3; b = 4; x = a + b;").
Also, your parser becomes more complicated; when writing a toy compiler, I think having a clean parser is worth a lot.
The question of whether the CPU can identify an implementation of it and perform it quicker in dedicated hardware is more interesting, and leads to the argument for CISC --- if POPCNT was there since early x86 but microcoded, it would only get faster once the microarchitecture provides dedicated hardware, and there would be no need to recompile or detect CPU features. On older CPUs without the hardware, it wouldn't be any slower, since the CPU would internally execute the equivalent of a manual implementation, only without needing to fetch and decode each individual macro-instruction of it.
The argument for RISC was that the CPU was simpler but could be made much faster. That proved to be overly simplistic. Like the rest of computing it was solved with another layer of abstraction, where CISC processors break instructions into µ-ops.
It also turned out that memory latency would not scale so CISC can be thought of as an instruction stream compression mechanism. Specifically x86 decode is more or less a fixed size block but we have ever more transistors available.
Then RISC ended up adding in a bunch of the complexity it was supposedly going to avoid because in the real world we just need to make the damn thing fast, purity be damned. It didn't help that some architectures made really bad design decisions (hello branch delay slots). Also many RISC cpus have microcode and break down some instructions, though not nearly to the degree a classic CISC cpu does.
It's all very messy and all the armchair theorizing in the world turns out not to matter. What matters is taking the silicon and measuring your workload on it.
It's not that fast; IIRC the microcode implementation is either just a loop over all bits or a nibble/byte-wide lookup table for the result. It's not that fast, from my experience, atleast compared to manually doing it.
Byte (endian) swapping is another one. GCC and Clang can recognize (some) implementations of that and replace with single instruction version (like bswap).
On the other hand until relatively recently they have been bad at recognizing loads/stores of larger integers implemented using byte loads and shifts and then optimizing that into larger loads (for architectures where unaligned loads are allowed). Doing it this way allows you to completely delegate any little/big/weird endian concerns to the compiler.
It's not recognizing a loop, but I've always liked the "turn integer division by a constant into a multiplication by a magic number" optimization: https://godbolt.org/z/Z0FgPW
This and many many other bit-twiddling hacks are described in Hacker's Delight: https://hackersdelight.org/. Compiler developers tend to go through this book and implement all the ones that seem interesting to them.
Yes; in fact memcpy/move are intrinsics in LLVM IR and are one of the more common targets of its idiom recognizer. Fun fact: many loop replacements and vectorizations depend on signed induction variable, so that the compiler doesn't have to prove it doesn't overflow. (Unsigned overflow is defined, signed overflow is UB and thus can be ignored)
That kind of thing is actually fully general, and called the scalar evolution (SCEV) analysis. LLVM for example has a full engine [1] for solving all sort of polynomial recurrences, so you can solve, say, sum i^2 with clang too: https://godbolt.org/z/-nHe78
I assume this is advantageous because then they can unfurl the loops and parallelize instructions across loop 'boundaries'? I wonder if there are other advantages.