Popcount CPU instruction

jongalloway2 · on Sept 9, 2019

.NET Core 3 (in preview, due out Sep 23) includes support for popcount as part of the BitOperations class: https://docs.microsoft.com/en-us/dotnet/api/system.numerics..... There's a recent post describing hardware intrinsics in .NET Core: https://devblogs.microsoft.com/dotnet/hardware-intrinsics-in...

There are a lot of potential use cases, but one of the early discussions was around quickly scanning for known HTTP headers. You can see that in use here: https://github.com/aspnet/AspNetCore/blob/caa910ceeba5f2b2c0...

If you really need help falling asleep, here's the discussion going back to 2015: https://github.com/dotnet/corefx/issues/2209

Disclaimers: Microsoft employee, Nazgûl

vaibhavsagar · on Sept 9, 2019

GHC also includes a `popCount` function that is implemented in terms of the built-in instruction: https://haskell-works.github.io/posts/2018-08-22-pdep-and-pe...

omgtehlion · on Sept 9, 2019

Cool!

It would be nice to get bitscanforward/reverse too.

I created this library long ago to get popcnt on desktop .net: https://github.com/omgtehlion/netintrinsics

Blog post with more details (in Russian): https://m.habr.com/ru/post/239619/

EDIT: found lzcnt/tzcnt on a linked page, thanks! But what about bswap or I'm asking too much? ))

non-entity · on Sept 9, 2019

.NET is getting some really cool stuff lately. Sadly, it'll likely be irrelevant to me for the time to come

pacaro · on Sept 9, 2019

I used to use "implement popcount" as an interview question (for c programmers), I'm not super convinced that it was a good question, although it definitely tells you quite quickly how comfortable a candidate is with bit manipulation.

There are three or four reasonable ways to implement it in c, including one weird trick, but which is most efficient tends to be very CPU dependent (at the time I worked on a project that targeted four or five esoteric CPUs so I had the luxury of being able to verify this)

castratikron · on Sept 9, 2019

I remember in one interview I was asked to write a function that returns true iff x is a power of two, so I wrote return 1 == __builtin_popcount(x). They liked that.

nwallin · on Sept 9, 2019

I did the same thing!

But they didn't like it. So I wrote the naivest possible popcount and returned 1 == naive_bitcount(n).

They didn't like that. I explained to them that the compiler will optimize or the loop so this will be fast and that they should check it on godbolt. (did I mention that I was coding on the white board?) There's a loop in the source code, but there isn't a loop in the machine code, and it's like three instructions.

They didn't like that either, so I worked out the n & (n-1) method from first principals. I didn't do it optimally, there was some redundancy in my code, which the interviewer smugly explained to me. I can't remember if I told him "that's nice, but it's still an extra instruction compared to the trivial, obvious, easy to maintain method I showed you ten minutes ago." I'm pretty sure I didn't, too afraid to make a bad impression.

I didn't get the job. In hindsight, I'm glad I didn't. Those guys are dicks.

jcranmer · on Sept 9, 2019

> I explained to them that the compiler will optimize or the loop so this will be fast and that they should check it on godbolt. (did I mention that I was coding on the white board?) There's a loop in the source code, but there isn't a loop in the machine code, and it's like three instructions.

Did you check this yourself? These sorts of "grand" pattern-matches tend to be low-productivity optimizations for compilers (the speedups you get are not worth the extra compile-time). LLVM tends to have a higher tolerance for this than GCC. But even looking at LLVM's source code, the "naivest possible popcount" [1] is not going to pass the loop idiom recognizer for a popcount.

> I can't remember if I told him "that's nice, but it's still an extra instruction compared to the trivial, obvious, easy to maintain method I showed you ten minutes ago."

n & (n - 1) should come out to 2 cycles of latency. 1 == popcount(n) would come out to 4 cycles of latency (3 latency for the POPCNT, 1 for the comparison at the end).

[1] I'm assuming by this that you mean iterating over all bits and individually checking each one for set/unset. Both gcc (since version 9) and LLVM handle the case where you iterate through set bits in a for (; x; x = x & (x - 1)) loop.

saagarjha · on Sept 9, 2019

I would not be surprised if the compiler detected the second version as well and replaced it with a popcnt. In any case, it seems like you ended up with the stereotypical “bad interviewer” so I doubt it would have mattered to point this out.

garmaine · on Sept 9, 2019

That was you interviewing them. They failed.

quickthrower2 · on Sept 9, 2019

The interviewer felt smug, so their mission was accomplished.

AnimalMuppet · on Sept 9, 2019

If that was their mission (rather than to hire someone), they succeeded in their personal mission, at the expense of the company's intended mission for them.

dooglius · on Sept 9, 2019

popcnt is generally going to take multiple cycles though, I'd expect the n&(n-1) method to be more performant. It's also vectorizeable and more portable.

castratikron · on Sept 9, 2019

Intel AVX-512 has a vector popcount, VPOPCNTDQ.

dalke · on Sept 9, 2019

builtin popcount when compiled for an architecture with hardware popcount should be 4 bytes per 1 cycle.

The n&(n-1) method to test for a power of 2 cannot be faster than that.

jcranmer · on Sept 9, 2019

POPCNT has a latency of 3 cycles on most x86 hardware.

dalke · on Sept 9, 2019

You are right. I was thinking of the number of operations.

Is the popcnt test slower than the n&(n-1) test? (Edit: Ooops! I see you already addressed that at https://news.ycombinator.com/item?id=20918136 .)

jcranmer · on Sept 9, 2019

A popcnt/test/jmp cycle should have roughly 4 cycles of latency: a test+jmp can be fused into one uop, and popcnt would be 3 cycles of latency. n&(n-1) == 0 would compile down into a DEC, TEST, JMP, which is 2 cycles of latency (again, test+jmp are fused into one uop), as dec is just 1 cycle of latency.

So n&(n-1) is faster for checking if it's a power of two.

dalke · on Sept 9, 2019

In the following I confirm that n&(n-1) is faster than popcount (compiled with "cc -march=haswell check_pow2.c -O3"), at 699 vs. 1016 microseconds, respectively:

  #include <stdio.h>
  #include <stdlib.h>
  #include <sys/time.h>
  
  const int N = 2*1000*1000;
  
  int count_popcnt(unsigned int *data) {
    int sum = 0;
    int i;
    for (i=0; i<N; i+=8) {
      sum += (
              (__builtin_popcount((data+i)[0]) == 1) +
              (__builtin_popcount((data+i)[1]) == 1) +
              (__builtin_popcount((data+i)[2]) == 1) +
              (__builtin_popcount((data+i)[3]) == 1) +
              (__builtin_popcount((data+i)[4]) == 1) +
              (__builtin_popcount((data+i)[5]) == 1) +
              (__builtin_popcount((data+i)[6]) == 1) +
              (__builtin_popcount((data+i)[7]) == 1)
              );
    }
    return sum;
  }
  
  int count_wegner(unsigned int *data) {
    int sum = 0;
    int i;
    for (i=0; i<N; i+=8) {
      sum += (
              ((((data+i)[0]) & ((data+i)[0]-1)) == 0) +
              ((((data+i)[1]) & ((data+i)[1]-1)) == 0) +
              ((((data+i)[2]) & ((data+i)[2]-1)) == 0) +
              ((((data+i)[3]) & ((data+i)[3]-1)) == 0) +
              ((((data+i)[4]) & ((data+i)[4]-1)) == 0) +
              ((((data+i)[5]) & ((data+i)[5]-1)) == 0) +
              ((((data+i)[6]) & ((data+i)[6]-1)) == 0) +
              ((((data+i)[7]) & ((data+i)[7]-1)) == 0)
              );
    }
    return sum;
  }
  
  
  int main(void) {
    unsigned int data[N];
    FILE *f = fopen("/dev/urandom", "rb");
    if (!f) {
      perror("Cannot open /dev/urandom");
      exit(1);
    }
    if (fread(data, sizeof(unsigned int), N, f) != N) {
      perror("Cannot read enough items");
      exit(1);
    }
    /* Use only a few bits */
    for (int i=0; i<N; i++) {
      data[i] = data[i] & 31;
      /* Don't worry about handling 0 */
      if (data[i] == 0) {
        data[i] = i + 1;
      }
    }
  
    struct timeval time1, time2;
    int n;
    
    gettimeofday(&time1, NULL);
    n = count_popcnt(data);
    gettimeofday(&time2, NULL);
    
    printf(" popcnt: %d in %8ld us\n",
           n,
           (time2.tv_sec-time1.tv_sec) * 1000*1000 +
           (time2.tv_usec-time1.tv_usec));
  
    gettimeofday(&time1, NULL);
    n = count_wegner(data);
    gettimeofday(&time2, NULL);
  
    printf("n&(n-1): %d in %8ld us\n",
           n,
           (time2.tv_sec-time1.tv_sec) * 1000*1000 +
           (time2.tv_usec-time1.tv_usec));
           
    return 0;
  }

Thank you for teaching me something new!

nkurz · on Sept 9, 2019

Besides just timing it (excellent!) did you take a look at what it compiles to?

https://godbolt.org/z/C0ZeE5

It's faster because both GCC and Clang now optimize loops with n&(n-1) to use AVX2 SIMD! I haven't looked closely to confirm, but I think they may in fact even do Harley-Seal for "naive popcount" loops.

C is no longer portable assembly. If you want to test whether a particular algorithm is faster than another, you probably need to write assembly --- or at least confirm that the compiler did what you thought it did.

dalke · on Sept 10, 2019

I think I'll stick with paying people like you to deal with these sorts of issues in my code. ;)

piadodjanho · on Sept 9, 2019

Maybe they wanted something that worked with non integer types.

saagarjha · on Sept 9, 2019

Both methods only work on integers.

saagarjha · on Sept 9, 2019

Another quick way that doesn’t rely on builtins is x && !(x & (x - 1)).

tromp · on Sept 9, 2019

I'd first ask the interviewer whether they're ok with a function is_zero_or_two_power...

mkl · on Sept 9, 2019

Isn't that what the "x &&" is for? For x = 0, (x - 1) is 0xffffffff, x & (x-1) is 0, !(x & (x-1)) is 1, and x && !(x & (x-1)) is 0.

tromp · on Sept 9, 2019

Yes, that's exactly what it's for. The answer I hope for is that they don't care about the case of x==0, so that I can leave out the "x &&" part, leaving the noticeably simpler answer

(x & x-1) == 0

jayd16 · on Sept 9, 2019

This fails for negative max int, assuming twos compliment, right?

castratikron · on Sept 9, 2019

The GNU __builtin_popcount takes an unsigned int.

https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

jayd16 · on Sept 9, 2019

Right. If he naively casts, it will cast to 2^(int bits size) and return 1 but that's wrong. Its not a complete solution.

ajxs · on Sept 9, 2019

Can you share the different ways it was done? Everything in your post sounds quite interesting.

dalke · on Sept 9, 2019

I can. About 10 years ago I put together a series of blog posts on the topic. It's even referenced from the linked-to article.

Most of the algorithms are described at http://www.dalkescientific.com/writings/diary/archive/2008/0... .

http://dalkescientific.com/writings/diary/archive/2011/11/02... has a link to http://dalkescientific.com/writings/diary/popcnt.cpp containing implementations in a benchmark.

That code only goes up to the POPCNT instruction. Faster implementations use AVX2 instruction set to implement the Harley-Seal algorithm, and there's VPOPCNTDQ in AVX 512 which does 512 bits at a time.

Table of my preprint at https://chemrxiv.org/articles/The_Chemfp_Project/7877846 shows the comparison of the most reasonable algorithms (for my definition of 'reasonable') on relatively modern hardware. For my case, the AVX2 implementation, fully optimized, is 11x faster than a lookup table and 30% faster than AVX2.

Note that my case handles (typically) multiword popcounts containing 512-2048 bits. Some of the implementations sum the popcount of each word, a few of them use techniques to work on multiple words at a time.

(For example, the Gillies-Miller method for "sideways addition" - an old term for popcount - dates from the 1950s. https://archive.org/details/programsforelect00wilk/page/190 . )

pacaro · on Sept 9, 2019

Sure!

The simplest is a loop that increments a counter if the least significant bit is 1 and right shifts the word by 1 each iteration. Bonus points for early termination

One weird trick is to consider (w & (w - 1)) which unsets the lowest set bit

The other major approach is to use lookup tables for some segment of the word, usually nybble or byte, if this is done on a divide and conquer approach you can short-circuit parts of the word that are zero

[Edit] dalke's answer is much more detailed! I last thought about this stuff in about 2001 or so

eru · on Sept 9, 2019

Btw, depending on application, sometimes you don't want early termination.

Eg for some cryptographic applications you don't want to leak information about what data you are working on in the runtime of your algorithm.

Another (more far-fetched?) scenario might come up with either branch prediction and CPU pipelining, or your compiler's optimizer recognising the simple loop but not the one that has early abort.

a1369209993 · on Sept 9, 2019

> either branch prediction and CPU pipelining

Not that far-fetched; the non-early-aborting loop can be branch-predicted perfectly by a smart enough CPU.

pacaro · on Sept 9, 2019

Yeah, creating constant time code is its own challenge, if a candidate were to bring that up it could take the interview down a very different path

missblit · on Sept 9, 2019

See also: https://graphics.stanford.edu/~seander/bithacks.html#CountBi...

collyw · on Sept 9, 2019

Why? Its the sort of thing that is used infrequently and you can look it up online. From what I remember the quickest way to do it would be with a lookup hash.

pacaro · on Sept 9, 2019

I agree that it's a relatively low signal question, its basically a fizzbuzz for bit twiddling.

That being said "Its the sort of thing that is used infrequently and you can look it up online" pretty much describes interview coding questions in general

I was definitely just using it as a warm-up/screener question. At the time other popular questions were:

• Implement malloc

• Pack this nested data structure into contiguous memory

• Implement a splay tree

Those weren't warm-up questions

bogomipz · on Sept 9, 2019

>"There are three or four reasonable ways to implement it in c, including one weird trick,..."

Might you or anyone else have any links or resources you could share on these three or four ways?

eesmith · on Sept 9, 2019

Several have already been posted/mentioned. There's also https://en.wikipedia.org/wiki/Hamming_weight with a few implementations.

monday_ · on Sept 9, 2019

One way popcount is useful not mentioned in the OP is in the graph analysis. For example, you can use the following

std::bitset<N>* graph = new std::bitset<N> [N];

to store the adjacency matrix for a graph with up to N vertices in N^2 bits of memory. The nice thing is that std::bitset::count uses popcount to compute the number of bits set to one. This makes some graph operations extremely fast even for a pretty large N. For example, graph[i].count() will produce a degree of a vertex and (graph[i] & graph[j]).count() will produce a number of vertices adjacent to both i and j.

snovv_crash · on Sept 9, 2019

Careful, std::bitset.count() doesn't use popcnt on msvc, only gcc and clang.

ncmncm · on Sept 9, 2019

Popcount was not supported on the original AMD64 ISA. You have to tell your compiler to generate code for a recent chip before it will generate the instruction.

snovv_crash · on Sept 9, 2019

That still doesn't work. Here is an example with codegen for AVX2, and an intrinsic call to popcnt, that still doesn't use popcnt for std::bitset.count() with a bitset size of 32 bits. Compare it to the GCC generated code.

https://godbolt.org/z/27TmQY

ncmncm · on Sept 9, 2019

I guess you're talking about MSVC. I find it hard to care much what that does. Code where performance matters much is not built with it.

But you're right, their std lib not using their own intrinsic is pathetic.

snovv_crash · on Sept 10, 2019

Sometimes we don't have a choice of toolchain used due to distribution targets or other dependencies :-( I'd prefer to be using GCC / Clang for everything too...

ncmncm · on Sept 13, 2019

I wrote to a maintainer of the MSVC lib. He says their lib has to work on all amd64s, but some (specifically, AMD before K10, and Intel before SSSE3) have no popcount. He says their intrinsics are defined to emit exactly the instruction named, unlike Gcc's, so they can't use that in their library.

No explanation why they use the loop form, except that the code hasn't been touched in a long, long time.

snovv_crash · on Sept 15, 2019

Surely they can check if AVX is enabled and use the intrinsic if so?

ncmncm · on Sept 18, 2019

That would involve changing code not touched since before AVX or even SSSE3 existed. Probably not even since before amd64 existed.

But it's hard to switch on use of a single instruction. Checking at the use site consumes a branch predictor slot. Switching in a function pointer interferes with inlining. Self-modifying code would have been the old way. The modern way might be rewriting in the linker or loader, or JIT compiling.

I have discovered that compilers are extremely bad at recognizing hand-coded byte-order swapping and dropping in movbe or bswap instructions. That Gcc and Clang recognized ham-handed pop counting loops seems miraculous now.

monday_ · on Sept 9, 2019

Thanks for the warning. Fortunately, I'm using it with gcc.

yvdriess · on Sept 9, 2019

Succinct datastructures (mentioned in the article) are used for storing graphs, using a similar principle. Popcount and its count leading/trailing zeroes siblings are used heavily in those rank/select datastructures. His link to it is worth a read.

ncmncm · on Sept 9, 2019

The Bitmanip extension definition document for RISC-V explains a wide variety of what were, for me, unfamiliar but terribly useful bitwise operations. Well worth reading even if you have no particular interest in RISC-V. Many of the instructions detailed have analogs in AVX, NEON, and recent POWER instruction sets. Often these instructions are poorly explained in the official manuals.

zzo38computer · on Sept 9, 2019

MMIX has a "SADD" instruction, which does popcount(x&~y). (You could use this to count trailing zeros with two instructions, each taking only one clock cycle.)

I thought JavaScript ought to have a built-in popcount function for the new integer type, but last I checked, it doesn't have. (It would be faster than writing JavaScript code to emulate it.) (I can think of uses for popcount on arbitrary length integers. Of course this won't work with negative numbers (I don't care what it does if the input is negative, although they should define what it does in that case), but for nonnegative numbers would be good to have.)

I have also used the __builtin_popcount function in GNU C. I did not know that it can detect an implementation of popcount and replace it; I have just used the built-in function.

I also did not know of all of the uses mentioned in the linked document (but I did know of some other uses).

zamadatix · on Sept 9, 2019

WASM has ctz and it's honestly the better place for such a low level instruction. BigInt is getting some of the bitwise ops (https://github.com/tc39/proposal-bigint/blob/master/ADVANCED...) but just the higher level ones at this time.

JavaScript does have clz32 from back before WASM was the new target for compile-system-language-to-browser but not ctz32 and I doubt it'd be added as that's no longer the focus.

geoffhill · on Sept 9, 2019

I like popcount for converting a 2^N-bit uniformly-distributed random number into a N-bit binomially-distributed one. Each bit of the input simulates a random coin flip.

eru · on Sept 9, 2019

You are a wasting a lot of random bits this way, don't you?

adtac · on Sept 9, 2019

Not if you already have 2^n bits at hand. In fact, if you have 2^n bits of entropy, popcount is probably more efficient than generating n more bits randomly.

JoachimSchipper · on Sept 9, 2019

Sure, but generating random bits is fast with e.g. AES-NI, RdRand or a software implementation of e.g. ChaCha.

hjorthjort · on Sept 9, 2019

It's also an instruction in WebAssembly, which has a pretty sparse instruction set. There are only 29 integer operations, and this is one of them. There are also "clz" and "ctz" which count leading and trailing zeroes. Other than that all the instructions are run-of-the-mill arithmetic, comparison, bitmasking and shifting.

kardos · on Sept 9, 2019

The fact that GCC and Clang can identify a manual implementation of count-bits-set and substitute it with an invocation of popcount is fascinating. What other idioms of similar or higher complexity can these compilers recognize and convert to instructions?

jandrewrogers · on Sept 9, 2019

It is should be pointed that they can identify some implementations but miss many common idiomatic implementations that may be superior if you do not have a popcount() instruction. I am often surprised by the algorithms they can't see. You will need to check the code generation. This true for many of the bit-twiddling instructions available on modern CPUs.

As for why someone might care, most bit-twiddling intrinsics cannot be used in "constexpr" functions in C++ (though this is starting to change). Finding a C++ code implementation of bit-twiddling that is reliably converted to the underlying instruction allows you to use the same constexpr implementation in both compile-time and run-time contexts with correct code generation.

myself248 · on Sept 9, 2019

Does this imply that it might be better to use the "worse" or "naïve" implementation, if you know that the compiler will identify it and do something architecture-appropriate?

jandrewrogers · on Sept 9, 2019

Yes. Surprisingly, many naive implementations you might assume would be identified aren't. You are far more likely to write a naive implementation that is not identified than one that is. I've spent a few weekends testing many different implementations to better understand the limitations of the compiler at identifying bit-twiddling instructions and it still doesn't work nearly as well as you might hope.

Also, not every compiler will identify a specific bit-twiddling implementation, Clang seems to identify more of these idiomatic implementations than GCC in my experience, but that is anecdotal. The set of code patterns they can identify is not documented anywhere that I've seen, though it must be in the compiler source code somewhere.

garmaine · on Sept 9, 2019

I strongly disagree with the sibling comments. It is better and more clear to call the compiler built-in than do some magic incantation that gets the compiler to optimize it and hope that future compilers don’t regress.

jandrewrogers · on Sept 9, 2019

This does not address the real-world case where built-ins will not compile in some valid code contexts. It is not "better" to do something that literally doesn't work. No one is doing this for fun, it is working around a well-known limitation in current compilers.

Because it is an unfortunate limitation on built-ins that reduces their utility, fixing this is on the roadmap for most popular compilers.

iforgotpassword · on Sept 9, 2019

On top of that, the popcount built-in will always compile even if there is no instruction for it, as it can generate fallback code. It actually does so for a naive invocation of gcc or clang for x64 as the original x64 IS did not contain popcount. You need to pass some arch that supports the instruction, or -mpopcount to explicitly enable it. Handling all those builtins properly is tedious.

thenewnewguy · on Sept 9, 2019

Assuming they are correct that the 'worse on paper' is the only one detected (I don't know if this is true or not) that would be the logical conclusion, yes.

gpderetta · on Sept 9, 2019

in c++20 you can use

  if constexpr(std::is_constant_evaluated()) { 
       // hand coded implementation
  } else {
       // use builtin
  }

but yes, it is annoying.

woodruffw · on Sept 9, 2019

> What other idioms of similar or higher complexity can these compilers recognize and convert to instructions?

Both GCC and clang (really LLVM) will try to autovectorize your loops, replacing operations on individual words with SIMD instructions (provided the compilation target supports them).

Similarly, both should replace `popcnt`-style functions (e.g., count-leading-zeros) with their instruction equivalents, provided the implementation is idiomatic enough. The overarching challenge is that users who don't know about instructions like `popcnt` and `clz` are also less likely to write implementations that are easy to detect and swap.

vnorilo · on Sept 9, 2019

I don't recall seeing bit counting in the LLVM idiom recognition, but it's been years since I pored over it. Could you maybe Godbolt an example where this substitution happens? I'd be happy to be wrong in this case :)

kardos · on Sept 9, 2019

From the article,

GCC: https://godbolt.org/z/JUzmD8

Clang: https://godbolt.org/z/AVqMGl

account42 · on Sept 9, 2019

Any idea why GCC zeroes out eax before overwriting it?

It never fails to amaze me how smart compilers can be while also being so stupid sometimes.

vnorilo · on Sept 9, 2019

It could be a bug; I don't see an immediate reason to zero eax in this fragment. Sometimes it's easy to miss the reasons why compilers zero apparently unused things though: it can be about breaking data dependencies or making sure flags are in a known state.

gpderetta · on Sept 9, 2019

It is a bug, but in the CPU. Until recently, on intel cpu, the popcnt instruction had an implicit dependency on its output register, which can be a performance issue. Explicitly clearing the register (which is very fast and almost free), is an easy way to workaround it.

account42 · on Sept 9, 2019

Oh, nice info. It's gone when using e.g. -march=znver1.

vnorilo · on Sept 9, 2019

Indeed, thanks! Skimmed a bit too quickly.

saagarjha · on Sept 9, 2019

A lot of things. Even somewhat complicated arithmetic will become a constant if possible. Certain formulas are hardcoded (e.g. the constant-time sum of the first n positive integers). It’ll perform inlining and tail call recursion elimination. Certain loops can be vectorized automatically. And a lot of library functions are “intrinsics” that the compiler knows a lot about: if you implement you own it might replace it with the standard one, or it may specialize your call to the standard library function to suit your specific use (e.g. a 16-byte memcpy for local variables might just be a load into the XMM registers). Godbolt is a really great resource to see what the compiler will and will not do for you.

userbinator · on Sept 9, 2019

Even somewhat complicated arithmetic will become a constant if possible.

That's one of the simplest optimisations to implement even in a toy compiler, can be done at the asme time as parsing, and comes practically "for free" --- if all operands of an operator are constant, perform the operation and replace the operator and its operands with the constant result; and repeat this recursively.

Everything else is done based on pattern-matching the AST and performing the appropriate transformations.

archi42 · on Sept 9, 2019

Hm, last time I wrote a constant elimination I did so in the optimizer pass over the AST (as part of a lecture, we built a custom C frontend for LLVM and added some optimization passes). What you describe replaces e.g. "x = 3 + 4;", but you get that for free once your optimizer is able to replace more complicated things (e.g. "a = 3; b = 4; x = a + b;").

Also, your parser becomes more complicated; when writing a toy compiler, I think having a clean parser is worth a lot.

But of course it can be done ;-)

saagarjha · on Sept 9, 2019

Sorry, by “arithmetic” I really meant “arithmetic in a loop or behind other kinds of control flow”. My bad.

userbinator · on Sept 9, 2019

The question of whether the CPU can identify an implementation of it and perform it quicker in dedicated hardware is more interesting, and leads to the argument for CISC --- if POPCNT was there since early x86 but microcoded, it would only get faster once the microarchitecture provides dedicated hardware, and there would be no need to recompile or detect CPU features. On older CPUs without the hardware, it wouldn't be any slower, since the CPU would internally execute the equivalent of a manual implementation, only without needing to fetch and decode each individual macro-instruction of it.

xenadu02 · on Sept 9, 2019

The argument for RISC was that the CPU was simpler but could be made much faster. That proved to be overly simplistic. Like the rest of computing it was solved with another layer of abstraction, where CISC processors break instructions into µ-ops.

It also turned out that memory latency would not scale so CISC can be thought of as an instruction stream compression mechanism. Specifically x86 decode is more or less a fixed size block but we have ever more transistors available.

Then RISC ended up adding in a bunch of the complexity it was supposedly going to avoid because in the real world we just need to make the damn thing fast, purity be damned. It didn't help that some architectures made really bad design decisions (hello branch delay slots). Also many RISC cpus have microcode and break down some instructions, though not nearly to the degree a classic CISC cpu does.

It's all very messy and all the armchair theorizing in the world turns out not to matter. What matters is taking the silicon and measuring your workload on it.

zaarn · on Sept 9, 2019

It's not that fast; IIRC the microcode implementation is either just a loop over all bits or a nibble/byte-wide lookup table for the result. It's not that fast, from my experience, atleast compared to manually doing it.

drfuchs · on Sept 9, 2019

I about fell out of my chair when I saw gcc and clang both turn (u>>s)|(u<<(32-s)) into a single "rorl" (rotate right logical) instruction.

ncmncm · on Sept 9, 2019

Even pcc (on version 7 pdp-11 unix) knew that one.

neerajsi · on Sept 9, 2019

That's a really easy one for compilers to detect. They start with the OR as the root and look backward to recognize the shifts and subtract.

zzo38computer · on Sept 9, 2019

That I expected, actually. (It seems common enough to me, that it would be implemented like that.)

vinkelhake · on Sept 9, 2019

Byte (endian) swapping is another one. GCC and Clang can recognize (some) implementations of that and replace with single instruction version (like bswap).

account42 · on Sept 9, 2019

On the other hand until relatively recently they have been bad at recognizing loads/stores of larger integers implemented using byte loads and shifts and then optimizing that into larger loads (for architectures where unaligned loads are allowed). Doing it this way allows you to completely delegate any little/big/weird endian concerns to the compiler.

bluecheese33 · on Sept 9, 2019

There's a neat example in [1] where clang replaces a loop summing integers 0..i with the closed form version.

[1] https://youtu.be/bSkpMdDe4g4?t=2533

wumpus · on Sept 9, 2019

The cynical answer is "anything that's in the SPECcpu benchmarks".

saagarjha · on Sept 9, 2019

SPEC is full of undefined behavior, so as far as I know many compiler engineers dislike it because it “breaks” when compilers get better.

garmaine · on Sept 9, 2019

That’s the wrong attitude. If it has undefined behavior, then a valid response is to inject malware to cause the benchmark to return a fast result.

gpderetta · on Sept 9, 2019

that would be amazing :)

eru · on Sept 9, 2019

Fortran compilers recognise naive three-loop matrix multiplication.

tom_mellior · on Sept 9, 2019

It's not recognizing a loop, but I've always liked the "turn integer division by a constant into a multiplication by a magic number" optimization: https://godbolt.org/z/Z0FgPW

This and many many other bit-twiddling hacks are described in Hacker's Delight: https://hackersdelight.org/. Compiler developers tend to go through this book and implement all the ones that seem interesting to them.

rrss · on Sept 9, 2019

I'm pretty sure many compilers can replace naive implementations of memcpy with one of the compiler's built-in optimized versions.

vnorilo · on Sept 9, 2019

Yes; in fact memcpy/move are intrinsics in LLVM IR and are one of the more common targets of its idiom recognizer. Fun fact: many loop replacements and vectorizations depend on signed induction variable, so that the compiler doesn't have to prove it doesn't overflow. (Unsigned overflow is defined, signed overflow is UB and thus can be ignored)

account42 · on Sept 9, 2019

If only you could specify overflow behavior and value range separately.

nwallin · on Sept 9, 2019

replacing sum 1..n with n\*(n-1): https://godbolt.org/z/nLV1vb

lifthrasiir · on Sept 9, 2019

That kind of thing is actually fully general, and called the scalar evolution (SCEV) analysis. LLVM for example has a full engine [1] for solving all sort of polynomial recurrences, so you can solve, say, sum i^2 with clang too: https://godbolt.org/z/-nHe78

[1] https://github.com/llvm-mirror/llvm/blob/master/lib/Analysis...

onedognight · on Sept 9, 2019

Both gcc and clang can often detect tail recursive functions and implement them with plain loops with no recursive call.

cryptoz · on Sept 9, 2019

I assume this is advantageous because then they can unfurl the loops and parallelize instructions across loop 'boundaries'? I wonder if there are other advantages.

onedognight · on Sept 9, 2019

Bounded stack usage is the main advantage.

tom_mellior · on Sept 9, 2019

For whatever it's worth, this article is a good example of why HN's (only theoretical, by now) policy of keeping the original headline intact makes sense. I wouldn't have clicked on an article with the title "You Won’t Believe This One Weird CPU Instruction!", and I would not feel regret.

MaxBarraclough · on Sept 15, 2019

> For whatever it's worth, this article is a good example of why HN's (only theoretical, by now) policy of keeping the original headline intact makes sense.

Did you mean doesn't make sense?

majke · on Sept 9, 2019

There is more! You can be _faster_ than __builtin_popcount()!

Wojciech Muła did an excellent SSSE3 implementation:

http://0x80.pl/articles/sse-popcount.html

My understanding is that the set-up time is slower than the POPCNT cpu instruction, but throughput is higher. Useful if you need to count bits not on a register, but on a whole array.

dalke · on Sept 9, 2019

It looks the conclusion is that the POPCNT instruction ("cpu") is faster than any of the SSE implementations. Only AVX2 outperforms POPCNT, and only for large enough bitstrings.

majke · on Sept 9, 2019

On i7:

> AVX2 code is faster than the dedicated instruction for input size 512 bytes and larger

The difference is indeed tiny. Still - it's very cool the generic AVX2 code can beat the instruction burned in the silicon!

dboreham · on Sept 9, 2019

Yikes I remember reading that thread on usenet. John Nagle, Rob Warnock and Roger Shepherd. Good times. Roger oddly didn't mention the story about guys in black suits and mirrored sunglasses showing up as the reason the T414 lacked popcount but the T800 added it.

vchak1 · on Sept 9, 2019

Popcount is also very valuable in search (bitmap) indexes, for example roaring bitmap by Lemire uses it quite a bit.

eesmith · on Sept 9, 2019

Though, interestingly, if AVX2 is available then Lemire's CRoaring package uses a Harley-Seal popcount implementation, at https://github.com/RoaringBitmap/CRoaring/blob/master/includ... , rather than the POPCNT64 instruction.

Because it's faster - https://arxiv.org/abs/1611.07612 .

robocat · on Sept 9, 2019

From the arxiv pdf: "We disabled Turbo Boost and set the processor to run at its highest clock speed" for their comparison.

Doesn't their setup bias towards the AVX2 solution?

dalke · on Sept 9, 2019

For what it's worth, I've tested my AVX2 copy of their code on a couple of different machines, and found it faster than my POPCNT-based implementation. (See my comments elsewhere here.)

Here's my laptop numbers for my benchmark:

  2048 Tanimoto (RDKit Morgan radius=2) 1000 queries
  chemfp search using threshold Tanimoto arena, index, single threaded (popcnt_128_128)
    threshold T=0.4 popcnt 19.33 ms/query  (T2: 19.81 ms check: 189655.09 run-time: 39.2 s) (allow specialized)
  chemfp search using threshold Tanimoto arena, index, single threaded (avx2_256)
    threshold T=0.4 avx2 12.63 ms/query  (T2: 14.04 ms check: 189655.09 run-time: 26.7 s) (allow specialized)

That's 19.33 ms/query on 2048-bit (256 byte) bitstrings using POPCNT unrolled to two loops of 128 bytes each, and 12.63 ms/query using AVX2 specialized for 256 bytes.

My testing showed there was no advantage for a fully unrolled POPCNT implementation. One thing to know is that there is only one execution port for POPCNT on my Intel processor. An AMD processor with 4 execution ports (Ryzen, I think?) may be faster. I don't have that processor, and my limited understanding of the gcc-generated assembly suggests it isn't optimized for that case.

robocat · on Sept 9, 2019

> An AMD processor with 4 execution ports (Ryzen, I think?) may be faster.

FYI m0zg agreed: "AMD can retire 4 (!) popcounts per cycle per core, if your code is able to feed it": https://news.ycombinator.com/item?id=20916023

nkurz · on Sept 9, 2019

I'm one of the authors on the paper. This is a good point. I think the answer is "it might". In practice (as 'dalke' points out) the AVX2 approach still ends up faster on common Intel hardware with Turbo enabled, but there might be cases where it doesn't, and the measured degree of difference certainly might change with Turbo on or off.

Complicating things, some but not all AVX2 operations benefit from Turbo frequencies as well. It depends on other things on the exact mix of instructions. The technical term is "license", and regardless of the setting of Turbo you processor chooses a frequency depending on the license that the instruction mix allows. Daniel has a blog post (co-written with Travis) on how this affects AVX512, but it also affects AVX2: https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-us.... The specifics are not well documented, and Travis probably has the best understanding of the exact behavior of anyone I know.

Anyway, we turn Turbo off and try to run the processor at a constant frequency because it makes some of our other measurements more reliable, but yes, when making claims that a SIMD solution is faster than a scalar solution, we should probably be testing whether the conclusion holds even when we turn Turbo back on. I think it's still best practice to do most measurements with it off, but I'll try to remember to do an extra test with it turned of if comes up in future papers.

robocat · on Sept 9, 2019

Thanks for your answer - great blog link - most relevant:

"Intel made more aggressive use of AVX512 instructions in earlier versions of the icc compiler, but has since removed most use unless the user asks for it with a special command line option.".

For other readers, short summary copied from the link with very light editing:

* There are heavy and light AVX instructions. "Heavy" instructions roughly are those involving floating point operations or integer multiplications operating on 512 bits.

* Intel cores can run in one of three modes: license 0 (L0) [normal], license 1 (L1) is slower, and license 2 (L2) is the slowest. To get into license 2, you need sustained use of heavy 512-bit instructions, where sustained means approximately one such instruction every cycle.

* The processor does not immediately move to a higher license when encountering heavy instructions: it will first execute these instructions with reduced performance (say 4x slower) and only when there are many of them will the processor change its frequency. Otherwise, any other 512-bit instructions will move the core to L1.

* Downclocking is per core and for a short time after you have used particular instructions (e.g., ~2ms).

* The downclocking of a core is based on: the current license level of that core, and also the total number of active cores on the same CPU socket (irrespective of the license level of the other cores).

The constraints and suggested workarounds are complex. Also the word choice "license" by Intel is bizarre (since it implies the throttling is to reduce performance for business reasons rather than technical reasons?).

m0zg · on Sept 9, 2019

Be aware though that Intel POPCNT instruction throughput is pretty low by today's standards. When you need to popcount a lot of bits, on Intel it might make sense to do it without the use of POPCNT. This oldie but goodie still holds, on Intel: https://danluu.com/assembly-intrinsics/

In contrast, AMD can retire 4 (!) popcounts per cycle per core, if your code is able to feed it.

There's vectorized popcount in some of the optional AVX512 instruction sets, but for those to make it into consumer CPUs AMD would really have to hurt Intel pretty bad, marketshare-wise. :-)

dclowd9901 · on Sept 9, 2019

We used a “popcount” style operation + bitwise AND to determine how similar two people were by interest. We had a bitmap where each bit represented an interest and 1 meant they had that interest and 0 meant they did not. Bitwise AND filters out the non overlapping interests and popcount to measure the “intensity” of their similarity (to tease out people who had almost every or almost no interests selected matching others like that — bell curve kind of thing).

gpderetta · on Sept 9, 2019

generally, popcnt is useful to implement the Jaccard Distance [1]:

    J(X,Y) = |X∩Y| / |X∪Y|

So, implementing the sets X and Y as bitsets it becomes:

[1]: https://en.wikipedia.org/wiki/Jaccard_index

    J(X,Y) = popcnt(X&Y) / popcnt(X|Y)

dalke · on Sept 9, 2019

If anyone needs that to go fast, for fixed-length (up to ~4k bits) try out my code, chemfp at http://chemfp.com/ . It's designed for cheminformatics, but can work with any data set which can be described by a fixed-length byte string and an identifier.

_pvxk · on Sept 9, 2019

https://haskell-works.github.io/posts/2018-08-22-pdep-and-pe... from the series https://haskell-works.github.io/ shows how to use popcnt (and rank-select) in a Haskell CSV parser

twoodfin · on Sept 9, 2019

The same computer architect who first told me about popcount's desirability to the NSA also believed that the massive simultaneous multithreading first introduced to the HPC world commercially by Tera—now owner of the "Cray" name—was also a big win for the spooks.

nabla9 · on Sept 9, 2019

Revisiting POPCOUNT Operations in CPUs/GPUs https://homes.cs.washington.edu/~cdel/papers/spost106s1-pape...

winrid · on Sept 9, 2019

The coolest part of this for me is that compilers can optimize this stuff out and replace it with the machine instructions.

I know that optimizing compilers do some amazing things, but detecting this particular case is pretty cool.

snailmailman · on Sept 9, 2019

I highly recommend Matt Godbolt's talk "What has my compiler done for me lately?"[0] and its follow-up[1]. Both are great presentations going into details on some of the crazy optimizations compilers will do. As someone who knows very little about the inner workings of compilers my mind was blown when I saw some them.

[0] https://www.youtube.com/watch?v=bSkpMdDe4g4

[1] https://www.youtube.com/watch?v=nAbCKa0FzjQ

winrid · on Sept 10, 2019

Fantastic. Adding to my watch list.

oconnor663 · on Sept 9, 2019

If you're building a binary tree incrementally, comparing the length of your stack of partially completed subtrees to the popcnt of the number of leaves so far tells you how many subtrees you need to merge :)

kristianp · on Sept 9, 2019

The Popcount instruction seems to be a meme on hacker news. You're cool if know about it and cooler if you've used it. That aside, this article is a good one for applications of it.

ncmncm · on Sept 9, 2019

Knowing when to use bitmap representation, ops, and popcount is a kind of superpower. When it is the right thing, it makes your program 10x faster.

gmueckl · on Sept 9, 2019

Thus can be said about almost all CPU instructions. x64 has tons of SIMD intrinics. You can do absolutely crazy stuff with them if you are willing to invest the time.

ncmncm · on Sept 9, 2019

True enough, but bitmaps are very easy to use portably, with just &, |, ~, ^, and ++ operations. The only exotic needed very frequently is popcount, but you get that portably from std::bitset (except, evidently, in MSVC).

The compilers are perfectly happy to keep an unsigned and a bitset variable in the same register, so that assigning from one to the other is a no-op.

ncmncm · on Sept 9, 2019

It's kind of pathetic that the RISC-V core instruction set lacks popcount.

Besides all its immediate uses, it is essential to a fast integer logarithm.

eleson · on Sept 9, 2019

It is part of the Bit Manipulation extension. So is count leading zeros. Unfortunately it is still in draft.

The math extensions are not in draft.

ncmncm · on Sept 9, 2019

Being in an extension means you can't count on it being there.

dralley · on Sept 10, 2019

That's a feature (for hardware designers) not a bug.

RISC V is not really designed to be a good high performance oriented ISA. They're aimed square at the embedded market.

ncmncm · on Sept 10, 2019

It's a feature for hardware designers working with 250 nm transistors.

Modern designers working with modern processes have the problem of discovering what they can throw a million transistors at that will actually make real programs faster. Hint: bitmanips!

thom · on Sept 9, 2019

Popcount comes up in chess engine programming, where boards are regularly represented as a series of bits for each type of piece. The Chess Programming Wiki has a lot of information about both usage and implementation:

https://www.chessprogramming.org/Population_Count

markh1967 · on Sept 9, 2019

I think the most probable reason for this instruction is for calculating parity bits. This would need to be done fast so it makes sense that there would be a CPU instruction to do most of the work.

kps · on Sept 9, 2019

Parity is much easier than counting. It's so easy that you get it for free in the x86 flags register. (… and because the 8008 was designed to run a terminal.)

bogomipz · on Sept 9, 2019

>"(… and because the 8008 was designed to run a terminal.)"

Could you elaborate on this? How does the 8008 being designed to run a terminal relate to the parity and the flags register?

kps · on Sept 9, 2019

Each iteration from the 8080 through x64 have a parity bit in the flags register for backwards compatibility with the previous generation. The 8008 was a microprocessor implementation of the Datapoint 2200 architecture.

bogomipz · on Sept 9, 2019

Excellent, thanks for the insights. Cheers.

ryacko · on Sept 9, 2019

Early protocols didn’t have error correction in the lower layers. The parity flag was equivalent to a CRC instruction nowadays. Presumably if parity was incorrect, that would mean the byte was transmitted incorrectly.

MauranKilom · on Sept 9, 2019

I've used _BitScanReverse (popcount-ish MSVC intrinsic) with a custom octree indexing scheme, where higher level voxels have more unset bits from the left. Neat stuff.

collyw · on Sept 9, 2019

MySQl includes a native popcount instruction, Postgres didn't when I checked a few years back. I actually needed it at the time.

jonatron · on Sept 9, 2019

It's surprisingly easy to implement a popcount user defined function in C for postgres. It would be a problem for hosted postgres with extension restrictions though.

classichasclass · on Sept 9, 2019

Power ISA has it too. I'm hard pressed to think of a modern architecture that doesn't have it.

CalChris · on Sept 9, 2019

RISC-V delegates pcnt to the “B” Standard Extension for Bit Manipulation, and the status of B version 0.90 is Open, meaning it hasn't been finalized. So realistically, RISC-V doesn't have the pcnt instruction. Versions supporting B will have it but that will be at some point in the future.

saagarjha · on Sept 9, 2019

I don’t think ARM does, but there might be some NEON thing that you can use to do something equivalent.

vaibhavsagar · on Sept 9, 2019

ARM NEON includes it as VCNT: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....

saagarjha · on Sept 9, 2019

Huh, it looks like that only works on 1-byte values? That’s an interesting choice.

m0zg · on Sept 9, 2019

Worse, it's a fertile ground for "interesting" bugs, because VADDV (which sum-reduces the result) reduces into an 8 bit uint. So if you e.g. accumulate two or more quadword VCNTs into a uint8x16_t and then VADDV it, you could end up with something other than the actual overall bit count (because 2 quadwords can have _256_ bits set). Same with accumulating 8 or more VADDVs, except now individual bytes could wrap around if you don't widen in between.

cozzyd · on Sept 9, 2019

Speaking of ARM, it has an rbit instruction which for some reason doesn't exist on x86.

saagarjha · on Sept 9, 2019

Apparently x86 isn't "web scale" enough to have to deal with endianness ;)

gpderetta · on Sept 9, 2019

It is not really about endianness. The instruction reverses bits not bytes. IIRC is needed as a required step of FFT.

itodd · on Sept 9, 2019

I've used this in an effort to calculate hamming distances of encoded equal length strings of dna.

Vizarddesky · on Sept 9, 2019

Popcount is not at all unusual in ordinary bit-twiddling.

guytv · on Sept 9, 2019

popcount reminds me of my first Map-Reduce program. It seems that I "reduced" by popcounting.

amelius · on Sept 9, 2019

Peculiar name. I would have named that "bitsum".

dalke · on Sept 9, 2019

The original name was "sideways addition". There are many other names, including Hamming weight. The Wikipedia entry for Hamming weight says the HP-16C developers agree with you:

> Some programmable scientific pocket calculators feature special commands to calculate the number of set bits, e.g. #B on the HP-16C[3][17] and WP 43S,[18][19] #BITS[20][21] or BITSUM[22][23] on HP-16C emulators, and nBITS on the WP 34S.[24][25]