The fact that GCC and Clang can identify a manual implementation of count-bits-s...

jandrewrogers · on Sept 9, 2019

It is should be pointed that they can identify some implementations but miss many common idiomatic implementations that may be superior if you do not have a popcount() instruction. I am often surprised by the algorithms they can't see. You will need to check the code generation. This true for many of the bit-twiddling instructions available on modern CPUs.

As for why someone might care, most bit-twiddling intrinsics cannot be used in "constexpr" functions in C++ (though this is starting to change). Finding a C++ code implementation of bit-twiddling that is reliably converted to the underlying instruction allows you to use the same constexpr implementation in both compile-time and run-time contexts with correct code generation.

myself248 · on Sept 9, 2019

Does this imply that it might be better to use the "worse" or "naïve" implementation, if you know that the compiler will identify it and do something architecture-appropriate?

jandrewrogers · on Sept 9, 2019

Yes. Surprisingly, many naive implementations you might assume would be identified aren't. You are far more likely to write a naive implementation that is not identified than one that is. I've spent a few weekends testing many different implementations to better understand the limitations of the compiler at identifying bit-twiddling instructions and it still doesn't work nearly as well as you might hope.

Also, not every compiler will identify a specific bit-twiddling implementation, Clang seems to identify more of these idiomatic implementations than GCC in my experience, but that is anecdotal. The set of code patterns they can identify is not documented anywhere that I've seen, though it must be in the compiler source code somewhere.

garmaine · on Sept 9, 2019

I strongly disagree with the sibling comments. It is better and more clear to call the compiler built-in than do some magic incantation that gets the compiler to optimize it and hope that future compilers don’t regress.

jandrewrogers · on Sept 9, 2019

This does not address the real-world case where built-ins will not compile in some valid code contexts. It is not "better" to do something that literally doesn't work. No one is doing this for fun, it is working around a well-known limitation in current compilers.

Because it is an unfortunate limitation on built-ins that reduces their utility, fixing this is on the roadmap for most popular compilers.

iforgotpassword · on Sept 9, 2019

On top of that, the popcount built-in will always compile even if there is no instruction for it, as it can generate fallback code. It actually does so for a naive invocation of gcc or clang for x64 as the original x64 IS did not contain popcount. You need to pass some arch that supports the instruction, or -mpopcount to explicitly enable it. Handling all those builtins properly is tedious.

thenewnewguy · on Sept 9, 2019

Assuming they are correct that the 'worse on paper' is the only one detected (I don't know if this is true or not) that would be the logical conclusion, yes.

gpderetta · on Sept 9, 2019

in c++20 you can use

  if constexpr(std::is_constant_evaluated()) { 
       // hand coded implementation
  } else {
       // use builtin
  }

but yes, it is annoying.

woodruffw · on Sept 9, 2019

> What other idioms of similar or higher complexity can these compilers recognize and convert to instructions?

Both GCC and clang (really LLVM) will try to autovectorize your loops, replacing operations on individual words with SIMD instructions (provided the compilation target supports them).

Similarly, both should replace `popcnt`-style functions (e.g., count-leading-zeros) with their instruction equivalents, provided the implementation is idiomatic enough. The overarching challenge is that users who don't know about instructions like `popcnt` and `clz` are also less likely to write implementations that are easy to detect and swap.

vnorilo · on Sept 9, 2019

I don't recall seeing bit counting in the LLVM idiom recognition, but it's been years since I pored over it. Could you maybe Godbolt an example where this substitution happens? I'd be happy to be wrong in this case :)

kardos · on Sept 9, 2019

From the article,

GCC: https://godbolt.org/z/JUzmD8

Clang: https://godbolt.org/z/AVqMGl

account42 · on Sept 9, 2019

Any idea why GCC zeroes out eax before overwriting it?

It never fails to amaze me how smart compilers can be while also being so stupid sometimes.

vnorilo · on Sept 9, 2019

It could be a bug; I don't see an immediate reason to zero eax in this fragment. Sometimes it's easy to miss the reasons why compilers zero apparently unused things though: it can be about breaking data dependencies or making sure flags are in a known state.

gpderetta · on Sept 9, 2019

It is a bug, but in the CPU. Until recently, on intel cpu, the popcnt instruction had an implicit dependency on its output register, which can be a performance issue. Explicitly clearing the register (which is very fast and almost free), is an easy way to workaround it.

account42 · on Sept 9, 2019

Oh, nice info. It's gone when using e.g. -march=znver1.

vnorilo · on Sept 9, 2019

Indeed, thanks! Skimmed a bit too quickly.

saagarjha · on Sept 9, 2019

A lot of things. Even somewhat complicated arithmetic will become a constant if possible. Certain formulas are hardcoded (e.g. the constant-time sum of the first n positive integers). It’ll perform inlining and tail call recursion elimination. Certain loops can be vectorized automatically. And a lot of library functions are “intrinsics” that the compiler knows a lot about: if you implement you own it might replace it with the standard one, or it may specialize your call to the standard library function to suit your specific use (e.g. a 16-byte memcpy for local variables might just be a load into the XMM registers). Godbolt is a really great resource to see what the compiler will and will not do for you.

userbinator · on Sept 9, 2019

Even somewhat complicated arithmetic will become a constant if possible.

That's one of the simplest optimisations to implement even in a toy compiler, can be done at the asme time as parsing, and comes practically "for free" --- if all operands of an operator are constant, perform the operation and replace the operator and its operands with the constant result; and repeat this recursively.

Everything else is done based on pattern-matching the AST and performing the appropriate transformations.

archi42 · on Sept 9, 2019

Hm, last time I wrote a constant elimination I did so in the optimizer pass over the AST (as part of a lecture, we built a custom C frontend for LLVM and added some optimization passes). What you describe replaces e.g. "x = 3 + 4;", but you get that for free once your optimizer is able to replace more complicated things (e.g. "a = 3; b = 4; x = a + b;").

Also, your parser becomes more complicated; when writing a toy compiler, I think having a clean parser is worth a lot.

But of course it can be done ;-)

saagarjha · on Sept 9, 2019

Sorry, by “arithmetic” I really meant “arithmetic in a loop or behind other kinds of control flow”. My bad.

userbinator · on Sept 9, 2019

The question of whether the CPU can identify an implementation of it and perform it quicker in dedicated hardware is more interesting, and leads to the argument for CISC --- if POPCNT was there since early x86 but microcoded, it would only get faster once the microarchitecture provides dedicated hardware, and there would be no need to recompile or detect CPU features. On older CPUs without the hardware, it wouldn't be any slower, since the CPU would internally execute the equivalent of a manual implementation, only without needing to fetch and decode each individual macro-instruction of it.

xenadu02 · on Sept 9, 2019

The argument for RISC was that the CPU was simpler but could be made much faster. That proved to be overly simplistic. Like the rest of computing it was solved with another layer of abstraction, where CISC processors break instructions into µ-ops.

It also turned out that memory latency would not scale so CISC can be thought of as an instruction stream compression mechanism. Specifically x86 decode is more or less a fixed size block but we have ever more transistors available.

Then RISC ended up adding in a bunch of the complexity it was supposedly going to avoid because in the real world we just need to make the damn thing fast, purity be damned. It didn't help that some architectures made really bad design decisions (hello branch delay slots). Also many RISC cpus have microcode and break down some instructions, though not nearly to the degree a classic CISC cpu does.

It's all very messy and all the armchair theorizing in the world turns out not to matter. What matters is taking the silicon and measuring your workload on it.

zaarn · on Sept 9, 2019

It's not that fast; IIRC the microcode implementation is either just a loop over all bits or a nibble/byte-wide lookup table for the result. It's not that fast, from my experience, atleast compared to manually doing it.

drfuchs · on Sept 9, 2019

I about fell out of my chair when I saw gcc and clang both turn (u>>s)|(u<<(32-s)) into a single "rorl" (rotate right logical) instruction.

ncmncm · on Sept 9, 2019

Even pcc (on version 7 pdp-11 unix) knew that one.

neerajsi · on Sept 9, 2019

That's a really easy one for compilers to detect. They start with the OR as the root and look backward to recognize the shifts and subtract.

zzo38computer · on Sept 9, 2019

That I expected, actually. (It seems common enough to me, that it would be implemented like that.)

vinkelhake · on Sept 9, 2019

Byte (endian) swapping is another one. GCC and Clang can recognize (some) implementations of that and replace with single instruction version (like bswap).

account42 · on Sept 9, 2019

On the other hand until relatively recently they have been bad at recognizing loads/stores of larger integers implemented using byte loads and shifts and then optimizing that into larger loads (for architectures where unaligned loads are allowed). Doing it this way allows you to completely delegate any little/big/weird endian concerns to the compiler.

bluecheese33 · on Sept 9, 2019

There's a neat example in [1] where clang replaces a loop summing integers 0..i with the closed form version.

[1] https://youtu.be/bSkpMdDe4g4?t=2533

wumpus · on Sept 9, 2019

The cynical answer is "anything that's in the SPECcpu benchmarks".

saagarjha · on Sept 9, 2019

SPEC is full of undefined behavior, so as far as I know many compiler engineers dislike it because it “breaks” when compilers get better.

garmaine · on Sept 9, 2019

That’s the wrong attitude. If it has undefined behavior, then a valid response is to inject malware to cause the benchmark to return a fast result.

gpderetta · on Sept 9, 2019

that would be amazing :)

eru · on Sept 9, 2019

Fortran compilers recognise naive three-loop matrix multiplication.

tom_mellior · on Sept 9, 2019

It's not recognizing a loop, but I've always liked the "turn integer division by a constant into a multiplication by a magic number" optimization: https://godbolt.org/z/Z0FgPW

This and many many other bit-twiddling hacks are described in Hacker's Delight: https://hackersdelight.org/. Compiler developers tend to go through this book and implement all the ones that seem interesting to them.

rrss · on Sept 9, 2019

I'm pretty sure many compilers can replace naive implementations of memcpy with one of the compiler's built-in optimized versions.

vnorilo · on Sept 9, 2019

Yes; in fact memcpy/move are intrinsics in LLVM IR and are one of the more common targets of its idiom recognizer. Fun fact: many loop replacements and vectorizations depend on signed induction variable, so that the compiler doesn't have to prove it doesn't overflow. (Unsigned overflow is defined, signed overflow is UB and thus can be ignored)

account42 · on Sept 9, 2019

If only you could specify overflow behavior and value range separately.

nwallin · on Sept 9, 2019

replacing sum 1..n with n\*(n-1): https://godbolt.org/z/nLV1vb

lifthrasiir · on Sept 9, 2019

That kind of thing is actually fully general, and called the scalar evolution (SCEV) analysis. LLVM for example has a full engine [1] for solving all sort of polynomial recurrences, so you can solve, say, sum i^2 with clang too: https://godbolt.org/z/-nHe78

[1] https://github.com/llvm-mirror/llvm/blob/master/lib/Analysis...

onedognight · on Sept 9, 2019

Both gcc and clang can often detect tail recursive functions and implement them with plain loops with no recursive call.

cryptoz · on Sept 9, 2019

I assume this is advantageous because then they can unfurl the loops and parallelize instructions across loop 'boundaries'? I wonder if there are other advantages.

onedognight · on Sept 9, 2019

Bounded stack usage is the main advantage.