Hacker News new | past | comments | ask | show | jobs | submit login
Popcount CPU instruction (vaibhavsagar.com)
350 points by mmastrac on Sept 9, 2019 | hide | past | favorite | 162 comments



.NET Core 3 (in preview, due out Sep 23) includes support for popcount as part of the BitOperations class: https://docs.microsoft.com/en-us/dotnet/api/system.numerics..... There's a recent post describing hardware intrinsics in .NET Core: https://devblogs.microsoft.com/dotnet/hardware-intrinsics-in...

There are a lot of potential use cases, but one of the early discussions was around quickly scanning for known HTTP headers. You can see that in use here: https://github.com/aspnet/AspNetCore/blob/caa910ceeba5f2b2c0...

If you really need help falling asleep, here's the discussion going back to 2015: https://github.com/dotnet/corefx/issues/2209

Disclaimers: Microsoft employee, Nazgûl


GHC also includes a `popCount` function that is implemented in terms of the built-in instruction: https://haskell-works.github.io/posts/2018-08-22-pdep-and-pe...


Cool!

It would be nice to get bitscanforward/reverse too.

I created this library long ago to get popcnt on desktop .net: https://github.com/omgtehlion/netintrinsics

Blog post with more details (in Russian): https://m.habr.com/ru/post/239619/

EDIT: found lzcnt/tzcnt on a linked page, thanks! But what about bswap or I'm asking too much? ))


.NET is getting some really cool stuff lately. Sadly, it'll likely be irrelevant to me for the time to come


I used to use "implement popcount" as an interview question (for c programmers), I'm not super convinced that it was a good question, although it definitely tells you quite quickly how comfortable a candidate is with bit manipulation.

There are three or four reasonable ways to implement it in c, including one weird trick, but which is most efficient tends to be very CPU dependent (at the time I worked on a project that targeted four or five esoteric CPUs so I had the luxury of being able to verify this)


I remember in one interview I was asked to write a function that returns true iff x is a power of two, so I wrote return 1 == __builtin_popcount(x). They liked that.


I did the same thing!

But they didn't like it. So I wrote the naivest possible popcount and returned 1 == naive_bitcount(n).

They didn't like that. I explained to them that the compiler will optimize or the loop so this will be fast and that they should check it on godbolt. (did I mention that I was coding on the white board?) There's a loop in the source code, but there isn't a loop in the machine code, and it's like three instructions.

They didn't like that either, so I worked out the n & (n-1) method from first principals. I didn't do it optimally, there was some redundancy in my code, which the interviewer smugly explained to me. I can't remember if I told him "that's nice, but it's still an extra instruction compared to the trivial, obvious, easy to maintain method I showed you ten minutes ago." I'm pretty sure I didn't, too afraid to make a bad impression.

I didn't get the job. In hindsight, I'm glad I didn't. Those guys are dicks.


> I explained to them that the compiler will optimize or the loop so this will be fast and that they should check it on godbolt. (did I mention that I was coding on the white board?) There's a loop in the source code, but there isn't a loop in the machine code, and it's like three instructions.

Did you check this yourself? These sorts of "grand" pattern-matches tend to be low-productivity optimizations for compilers (the speedups you get are not worth the extra compile-time). LLVM tends to have a higher tolerance for this than GCC. But even looking at LLVM's source code, the "naivest possible popcount" [1] is not going to pass the loop idiom recognizer for a popcount.

> I can't remember if I told him "that's nice, but it's still an extra instruction compared to the trivial, obvious, easy to maintain method I showed you ten minutes ago."

n & (n - 1) should come out to 2 cycles of latency. 1 == popcount(n) would come out to 4 cycles of latency (3 latency for the POPCNT, 1 for the comparison at the end).

[1] I'm assuming by this that you mean iterating over all bits and individually checking each one for set/unset. Both gcc (since version 9) and LLVM handle the case where you iterate through set bits in a for (; x; x = x & (x - 1)) loop.


I would not be surprised if the compiler detected the second version as well and replaced it with a popcnt. In any case, it seems like you ended up with the stereotypical “bad interviewer” so I doubt it would have mattered to point this out.


That was you interviewing them. They failed.


The interviewer felt smug, so their mission was accomplished.


If that was their mission (rather than to hire someone), they succeeded in their personal mission, at the expense of the company's intended mission for them.


popcnt is generally going to take multiple cycles though, I'd expect the n&(n-1) method to be more performant. It's also vectorizeable and more portable.


Intel AVX-512 has a vector popcount, VPOPCNTDQ.


builtin popcount when compiled for an architecture with hardware popcount should be 4 bytes per 1 cycle.

The n&(n-1) method to test for a power of 2 cannot be faster than that.


POPCNT has a latency of 3 cycles on most x86 hardware.


You are right. I was thinking of the number of operations.

Is the popcnt test slower than the n&(n-1) test? (Edit: Ooops! I see you already addressed that at https://news.ycombinator.com/item?id=20918136 .)


A popcnt/test/jmp cycle should have roughly 4 cycles of latency: a test+jmp can be fused into one uop, and popcnt would be 3 cycles of latency. n&(n-1) == 0 would compile down into a DEC, TEST, JMP, which is 2 cycles of latency (again, test+jmp are fused into one uop), as dec is just 1 cycle of latency.

So n&(n-1) is faster for checking if it's a power of two.


In the following I confirm that n&(n-1) is faster than popcount (compiled with "cc -march=haswell check_pow2.c -O3"), at 699 vs. 1016 microseconds, respectively:

  #include <stdio.h>
  #include <stdlib.h>
  #include <sys/time.h>
  
  const int N = 2*1000*1000;
  
  int count_popcnt(unsigned int *data) {
    int sum = 0;
    int i;
    for (i=0; i<N; i+=8) {
      sum += (
              (__builtin_popcount((data+i)[0]) == 1) +
              (__builtin_popcount((data+i)[1]) == 1) +
              (__builtin_popcount((data+i)[2]) == 1) +
              (__builtin_popcount((data+i)[3]) == 1) +
              (__builtin_popcount((data+i)[4]) == 1) +
              (__builtin_popcount((data+i)[5]) == 1) +
              (__builtin_popcount((data+i)[6]) == 1) +
              (__builtin_popcount((data+i)[7]) == 1)
              );
    }
    return sum;
  }
  
  int count_wegner(unsigned int *data) {
    int sum = 0;
    int i;
    for (i=0; i<N; i+=8) {
      sum += (
              ((((data+i)[0]) & ((data+i)[0]-1)) == 0) +
              ((((data+i)[1]) & ((data+i)[1]-1)) == 0) +
              ((((data+i)[2]) & ((data+i)[2]-1)) == 0) +
              ((((data+i)[3]) & ((data+i)[3]-1)) == 0) +
              ((((data+i)[4]) & ((data+i)[4]-1)) == 0) +
              ((((data+i)[5]) & ((data+i)[5]-1)) == 0) +
              ((((data+i)[6]) & ((data+i)[6]-1)) == 0) +
              ((((data+i)[7]) & ((data+i)[7]-1)) == 0)
              );
    }
    return sum;
  }
  
  
  int main(void) {
    unsigned int data[N];
    FILE *f = fopen("/dev/urandom", "rb");
    if (!f) {
      perror("Cannot open /dev/urandom");
      exit(1);
    }
    if (fread(data, sizeof(unsigned int), N, f) != N) {
      perror("Cannot read enough items");
      exit(1);
    }
    /* Use only a few bits */
    for (int i=0; i<N; i++) {
      data[i] = data[i] & 31;
      /* Don't worry about handling 0 */
      if (data[i] == 0) {
        data[i] = i + 1;
      }
    }
  
    struct timeval time1, time2;
    int n;
    
    gettimeofday(&time1, NULL);
    n = count_popcnt(data);
    gettimeofday(&time2, NULL);
    
    printf(" popcnt: %d in %8ld us\n",
           n,
           (time2.tv_sec-time1.tv_sec) * 1000*1000 +
           (time2.tv_usec-time1.tv_usec));
  
    gettimeofday(&time1, NULL);
    n = count_wegner(data);
    gettimeofday(&time2, NULL);
  
    printf("n&(n-1): %d in %8ld us\n",
           n,
           (time2.tv_sec-time1.tv_sec) * 1000*1000 +
           (time2.tv_usec-time1.tv_usec));
           
    return 0;
  }
Thank you for teaching me something new!


Besides just timing it (excellent!) did you take a look at what it compiles to?

https://godbolt.org/z/C0ZeE5

It's faster because both GCC and Clang now optimize loops with n&(n-1) to use AVX2 SIMD! I haven't looked closely to confirm, but I think they may in fact even do Harley-Seal for "naive popcount" loops.

C is no longer portable assembly. If you want to test whether a particular algorithm is faster than another, you probably need to write assembly --- or at least confirm that the compiler did what you thought it did.


I think I'll stick with paying people like you to deal with these sorts of issues in my code. ;)


Maybe they wanted something that worked with non integer types.


Both methods only work on integers.


Another quick way that doesn’t rely on builtins is x && !(x & (x - 1)).


I'd first ask the interviewer whether they're ok with a function is_zero_or_two_power...


Isn't that what the "x &&" is for? For x = 0, (x - 1) is 0xffffffff, x & (x-1) is 0, !(x & (x-1)) is 1, and x && !(x & (x-1)) is 0.


Yes, that's exactly what it's for. The answer I hope for is that they don't care about the case of x==0, so that I can leave out the "x &&" part, leaving the noticeably simpler answer

(x & x-1) == 0


This fails for negative max int, assuming twos compliment, right?


The GNU __builtin_popcount takes an unsigned int.

https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html


Right. If he naively casts, it will cast to 2^(int bits size) and return 1 but that's wrong. Its not a complete solution.


Can you share the different ways it was done? Everything in your post sounds quite interesting.


I can. About 10 years ago I put together a series of blog posts on the topic. It's even referenced from the linked-to article.

Most of the algorithms are described at http://www.dalkescientific.com/writings/diary/archive/2008/0... .

http://dalkescientific.com/writings/diary/archive/2011/11/02... has a link to http://dalkescientific.com/writings/diary/popcnt.cpp containing implementations in a benchmark.

That code only goes up to the POPCNT instruction. Faster implementations use AVX2 instruction set to implement the Harley-Seal algorithm, and there's VPOPCNTDQ in AVX 512 which does 512 bits at a time.

Table of my preprint at https://chemrxiv.org/articles/The_Chemfp_Project/7877846 shows the comparison of the most reasonable algorithms (for my definition of 'reasonable') on relatively modern hardware. For my case, the AVX2 implementation, fully optimized, is 11x faster than a lookup table and 30% faster than AVX2.

Note that my case handles (typically) multiword popcounts containing 512-2048 bits. Some of the implementations sum the popcount of each word, a few of them use techniques to work on multiple words at a time.

(For example, the Gillies-Miller method for "sideways addition" - an old term for popcount - dates from the 1950s. https://archive.org/details/programsforelect00wilk/page/190 . )


Sure!

The simplest is a loop that increments a counter if the least significant bit is 1 and right shifts the word by 1 each iteration. Bonus points for early termination

One weird trick is to consider (w & (w - 1)) which unsets the lowest set bit

The other major approach is to use lookup tables for some segment of the word, usually nybble or byte, if this is done on a divide and conquer approach you can short-circuit parts of the word that are zero

[Edit] dalke's answer is much more detailed! I last thought about this stuff in about 2001 or so


Btw, depending on application, sometimes you don't want early termination.

Eg for some cryptographic applications you don't want to leak information about what data you are working on in the runtime of your algorithm.

Another (more far-fetched?) scenario might come up with either branch prediction and CPU pipelining, or your compiler's optimizer recognising the simple loop but not the one that has early abort.


> either branch prediction and CPU pipelining

Not that far-fetched; the non-early-aborting loop can be branch-predicted perfectly by a smart enough CPU.


Yeah, creating constant time code is its own challenge, if a candidate were to bring that up it could take the interview down a very different path



Why? Its the sort of thing that is used infrequently and you can look it up online. From what I remember the quickest way to do it would be with a lookup hash.


I agree that it's a relatively low signal question, its basically a fizzbuzz for bit twiddling.

That being said "Its the sort of thing that is used infrequently and you can look it up online" pretty much describes interview coding questions in general

I was definitely just using it as a warm-up/screener question. At the time other popular questions were:

• Implement malloc

• Pack this nested data structure into contiguous memory

• Implement a splay tree

Those weren't warm-up questions


>"There are three or four reasonable ways to implement it in c, including one weird trick,..."

Might you or anyone else have any links or resources you could share on these three or four ways?


Several have already been posted/mentioned. There's also https://en.wikipedia.org/wiki/Hamming_weight with a few implementations.


One way popcount is useful not mentioned in the OP is in the graph analysis. For example, you can use the following

std::bitset<N>* graph = new std::bitset<N> [N];

to store the adjacency matrix for a graph with up to N vertices in N^2 bits of memory. The nice thing is that std::bitset::count uses popcount to compute the number of bits set to one. This makes some graph operations extremely fast even for a pretty large N. For example, graph[i].count() will produce a degree of a vertex and (graph[i] & graph[j]).count() will produce a number of vertices adjacent to both i and j.


Careful, std::bitset.count() doesn't use popcnt on msvc, only gcc and clang.


Popcount was not supported on the original AMD64 ISA. You have to tell your compiler to generate code for a recent chip before it will generate the instruction.


That still doesn't work. Here is an example with codegen for AVX2, and an intrinsic call to popcnt, that still doesn't use popcnt for std::bitset.count() with a bitset size of 32 bits. Compare it to the GCC generated code.

https://godbolt.org/z/27TmQY


I guess you're talking about MSVC. I find it hard to care much what that does. Code where performance matters much is not built with it.

But you're right, their std lib not using their own intrinsic is pathetic.


Sometimes we don't have a choice of toolchain used due to distribution targets or other dependencies :-( I'd prefer to be using GCC / Clang for everything too...


I wrote to a maintainer of the MSVC lib. He says their lib has to work on all amd64s, but some (specifically, AMD before K10, and Intel before SSSE3) have no popcount. He says their intrinsics are defined to emit exactly the instruction named, unlike Gcc's, so they can't use that in their library.

No explanation why they use the loop form, except that the code hasn't been touched in a long, long time.


Surely they can check if AVX is enabled and use the intrinsic if so?


That would involve changing code not touched since before AVX or even SSSE3 existed. Probably not even since before amd64 existed.

But it's hard to switch on use of a single instruction. Checking at the use site consumes a branch predictor slot. Switching in a function pointer interferes with inlining. Self-modifying code would have been the old way. The modern way might be rewriting in the linker or loader, or JIT compiling.

I have discovered that compilers are extremely bad at recognizing hand-coded byte-order swapping and dropping in movbe or bswap instructions. That Gcc and Clang recognized ham-handed pop counting loops seems miraculous now.


Thanks for the warning. Fortunately, I'm using it with gcc.


Succinct datastructures (mentioned in the article) are used for storing graphs, using a similar principle. Popcount and its count leading/trailing zeroes siblings are used heavily in those rank/select datastructures. His link to it is worth a read.


The Bitmanip extension definition document for RISC-V explains a wide variety of what were, for me, unfamiliar but terribly useful bitwise operations. Well worth reading even if you have no particular interest in RISC-V. Many of the instructions detailed have analogs in AVX, NEON, and recent POWER instruction sets. Often these instructions are poorly explained in the official manuals.


MMIX has a "SADD" instruction, which does popcount(x&~y). (You could use this to count trailing zeros with two instructions, each taking only one clock cycle.)

I thought JavaScript ought to have a built-in popcount function for the new integer type, but last I checked, it doesn't have. (It would be faster than writing JavaScript code to emulate it.) (I can think of uses for popcount on arbitrary length integers. Of course this won't work with negative numbers (I don't care what it does if the input is negative, although they should define what it does in that case), but for nonnegative numbers would be good to have.)

I have also used the __builtin_popcount function in GNU C. I did not know that it can detect an implementation of popcount and replace it; I have just used the built-in function.

I also did not know of all of the uses mentioned in the linked document (but I did know of some other uses).


WASM has ctz and it's honestly the better place for such a low level instruction. BigInt is getting some of the bitwise ops (https://github.com/tc39/proposal-bigint/blob/master/ADVANCED...) but just the higher level ones at this time.

JavaScript does have clz32 from back before WASM was the new target for compile-system-language-to-browser but not ctz32 and I doubt it'd be added as that's no longer the focus.


I like popcount for converting a 2^N-bit uniformly-distributed random number into a N-bit binomially-distributed one. Each bit of the input simulates a random coin flip.


You are a wasting a lot of random bits this way, don't you?


Not if you already have 2^n bits at hand. In fact, if you have 2^n bits of entropy, popcount is probably more efficient than generating n more bits randomly.


Sure, but generating random bits is fast with e.g. AES-NI, RdRand or a software implementation of e.g. ChaCha.


It's also an instruction in WebAssembly, which has a pretty sparse instruction set. There are only 29 integer operations, and this is one of them. There are also "clz" and "ctz" which count leading and trailing zeroes. Other than that all the instructions are run-of-the-mill arithmetic, comparison, bitmasking and shifting.


The fact that GCC and Clang can identify a manual implementation of count-bits-set and substitute it with an invocation of popcount is fascinating. What other idioms of similar or higher complexity can these compilers recognize and convert to instructions?


It is should be pointed that they can identify some implementations but miss many common idiomatic implementations that may be superior if you do not have a popcount() instruction. I am often surprised by the algorithms they can't see. You will need to check the code generation. This true for many of the bit-twiddling instructions available on modern CPUs.

As for why someone might care, most bit-twiddling intrinsics cannot be used in "constexpr" functions in C++ (though this is starting to change). Finding a C++ code implementation of bit-twiddling that is reliably converted to the underlying instruction allows you to use the same constexpr implementation in both compile-time and run-time contexts with correct code generation.


Does this imply that it might be better to use the "worse" or "naïve" implementation, if you know that the compiler will identify it and do something architecture-appropriate?


Yes. Surprisingly, many naive implementations you might assume would be identified aren't. You are far more likely to write a naive implementation that is not identified than one that is. I've spent a few weekends testing many different implementations to better understand the limitations of the compiler at identifying bit-twiddling instructions and it still doesn't work nearly as well as you might hope.

Also, not every compiler will identify a specific bit-twiddling implementation, Clang seems to identify more of these idiomatic implementations than GCC in my experience, but that is anecdotal. The set of code patterns they can identify is not documented anywhere that I've seen, though it must be in the compiler source code somewhere.


I strongly disagree with the sibling comments. It is better and more clear to call the compiler built-in than do some magic incantation that gets the compiler to optimize it and hope that future compilers don’t regress.


This does not address the real-world case where built-ins will not compile in some valid code contexts. It is not "better" to do something that literally doesn't work. No one is doing this for fun, it is working around a well-known limitation in current compilers.

Because it is an unfortunate limitation on built-ins that reduces their utility, fixing this is on the roadmap for most popular compilers.


On top of that, the popcount built-in will always compile even if there is no instruction for it, as it can generate fallback code. It actually does so for a naive invocation of gcc or clang for x64 as the original x64 IS did not contain popcount. You need to pass some arch that supports the instruction, or -mpopcount to explicitly enable it. Handling all those builtins properly is tedious.


Assuming they are correct that the 'worse on paper' is the only one detected (I don't know if this is true or not) that would be the logical conclusion, yes.


in c++20 you can use

  if constexpr(std::is_constant_evaluated()) { 
       // hand coded implementation
  } else {
       // use builtin
  }
but yes, it is annoying.


> What other idioms of similar or higher complexity can these compilers recognize and convert to instructions?

Both GCC and clang (really LLVM) will try to autovectorize your loops, replacing operations on individual words with SIMD instructions (provided the compilation target supports them).

Similarly, both should replace `popcnt`-style functions (e.g., count-leading-zeros) with their instruction equivalents, provided the implementation is idiomatic enough. The overarching challenge is that users who don't know about instructions like `popcnt` and `clz` are also less likely to write implementations that are easy to detect and swap.


I don't recall seeing bit counting in the LLVM idiom recognition, but it's been years since I pored over it. Could you maybe Godbolt an example where this substitution happens? I'd be happy to be wrong in this case :)



Any idea why GCC zeroes out eax before overwriting it?

It never fails to amaze me how smart compilers can be while also being so stupid sometimes.


It could be a bug; I don't see an immediate reason to zero eax in this fragment. Sometimes it's easy to miss the reasons why compilers zero apparently unused things though: it can be about breaking data dependencies or making sure flags are in a known state.


It is a bug, but in the CPU. Until recently, on intel cpu, the popcnt instruction had an implicit dependency on its output register, which can be a performance issue. Explicitly clearing the register (which is very fast and almost free), is an easy way to workaround it.


Oh, nice info. It's gone when using e.g. -march=znver1.


Indeed, thanks! Skimmed a bit too quickly.


A lot of things. Even somewhat complicated arithmetic will become a constant if possible. Certain formulas are hardcoded (e.g. the constant-time sum of the first n positive integers). It’ll perform inlining and tail call recursion elimination. Certain loops can be vectorized automatically. And a lot of library functions are “intrinsics” that the compiler knows a lot about: if you implement you own it might replace it with the standard one, or it may specialize your call to the standard library function to suit your specific use (e.g. a 16-byte memcpy for local variables might just be a load into the XMM registers). Godbolt is a really great resource to see what the compiler will and will not do for you.


Even somewhat complicated arithmetic will become a constant if possible.

That's one of the simplest optimisations to implement even in a toy compiler, can be done at the asme time as parsing, and comes practically "for free" --- if all operands of an operator are constant, perform the operation and replace the operator and its operands with the constant result; and repeat this recursively.

Everything else is done based on pattern-matching the AST and performing the appropriate transformations.


Hm, last time I wrote a constant elimination I did so in the optimizer pass over the AST (as part of a lecture, we built a custom C frontend for LLVM and added some optimization passes). What you describe replaces e.g. "x = 3 + 4;", but you get that for free once your optimizer is able to replace more complicated things (e.g. "a = 3; b = 4; x = a + b;").

Also, your parser becomes more complicated; when writing a toy compiler, I think having a clean parser is worth a lot.

But of course it can be done ;-)


Sorry, by “arithmetic” I really meant “arithmetic in a loop or behind other kinds of control flow”. My bad.


The question of whether the CPU can identify an implementation of it and perform it quicker in dedicated hardware is more interesting, and leads to the argument for CISC --- if POPCNT was there since early x86 but microcoded, it would only get faster once the microarchitecture provides dedicated hardware, and there would be no need to recompile or detect CPU features. On older CPUs without the hardware, it wouldn't be any slower, since the CPU would internally execute the equivalent of a manual implementation, only without needing to fetch and decode each individual macro-instruction of it.


The argument for RISC was that the CPU was simpler but could be made much faster. That proved to be overly simplistic. Like the rest of computing it was solved with another layer of abstraction, where CISC processors break instructions into µ-ops.

It also turned out that memory latency would not scale so CISC can be thought of as an instruction stream compression mechanism. Specifically x86 decode is more or less a fixed size block but we have ever more transistors available.

Then RISC ended up adding in a bunch of the complexity it was supposedly going to avoid because in the real world we just need to make the damn thing fast, purity be damned. It didn't help that some architectures made really bad design decisions (hello branch delay slots). Also many RISC cpus have microcode and break down some instructions, though not nearly to the degree a classic CISC cpu does.

It's all very messy and all the armchair theorizing in the world turns out not to matter. What matters is taking the silicon and measuring your workload on it.


It's not that fast; IIRC the microcode implementation is either just a loop over all bits or a nibble/byte-wide lookup table for the result. It's not that fast, from my experience, atleast compared to manually doing it.


I about fell out of my chair when I saw gcc and clang both turn (u>>s)|(u<<(32-s)) into a single "rorl" (rotate right logical) instruction.


Even pcc (on version 7 pdp-11 unix) knew that one.


That's a really easy one for compilers to detect. They start with the OR as the root and look backward to recognize the shifts and subtract.


That I expected, actually. (It seems common enough to me, that it would be implemented like that.)


Byte (endian) swapping is another one. GCC and Clang can recognize (some) implementations of that and replace with single instruction version (like bswap).


On the other hand until relatively recently they have been bad at recognizing loads/stores of larger integers implemented using byte loads and shifts and then optimizing that into larger loads (for architectures where unaligned loads are allowed). Doing it this way allows you to completely delegate any little/big/weird endian concerns to the compiler.


There's a neat example in [1] where clang replaces a loop summing integers 0..i with the closed form version.

[1] https://youtu.be/bSkpMdDe4g4?t=2533


The cynical answer is "anything that's in the SPECcpu benchmarks".


SPEC is full of undefined behavior, so as far as I know many compiler engineers dislike it because it “breaks” when compilers get better.


That’s the wrong attitude. If it has undefined behavior, then a valid response is to inject malware to cause the benchmark to return a fast result.


that would be amazing :)


Fortran compilers recognise naive three-loop matrix multiplication.


It's not recognizing a loop, but I've always liked the "turn integer division by a constant into a multiplication by a magic number" optimization: https://godbolt.org/z/Z0FgPW

This and many many other bit-twiddling hacks are described in Hacker's Delight: https://hackersdelight.org/. Compiler developers tend to go through this book and implement all the ones that seem interesting to them.


I'm pretty sure many compilers can replace naive implementations of memcpy with one of the compiler's built-in optimized versions.


Yes; in fact memcpy/move are intrinsics in LLVM IR and are one of the more common targets of its idiom recognizer. Fun fact: many loop replacements and vectorizations depend on signed induction variable, so that the compiler doesn't have to prove it doesn't overflow. (Unsigned overflow is defined, signed overflow is UB and thus can be ignored)


If only you could specify overflow behavior and value range separately.


replacing sum 1..n with n\*(n-1): https://godbolt.org/z/nLV1vb


That kind of thing is actually fully general, and called the scalar evolution (SCEV) analysis. LLVM for example has a full engine [1] for solving all sort of polynomial recurrences, so you can solve, say, sum i^2 with clang too: https://godbolt.org/z/-nHe78

[1] https://github.com/llvm-mirror/llvm/blob/master/lib/Analysis...


Both gcc and clang can often detect tail recursive functions and implement them with plain loops with no recursive call.


I assume this is advantageous because then they can unfurl the loops and parallelize instructions across loop 'boundaries'? I wonder if there are other advantages.


Bounded stack usage is the main advantage.


For whatever it's worth, this article is a good example of why HN's (only theoretical, by now) policy of keeping the original headline intact makes sense. I wouldn't have clicked on an article with the title "You Won’t Believe This One Weird CPU Instruction!", and I would not feel regret.


> For whatever it's worth, this article is a good example of why HN's (only theoretical, by now) policy of keeping the original headline intact makes sense.

Did you mean doesn't make sense?


There is more! You can be _faster_ than __builtin_popcount()!

Wojciech Muła did an excellent SSSE3 implementation:

http://0x80.pl/articles/sse-popcount.html

My understanding is that the set-up time is slower than the POPCNT cpu instruction, but throughput is higher. Useful if you need to count bits not on a register, but on a whole array.


It looks the conclusion is that the POPCNT instruction ("cpu") is faster than any of the SSE implementations. Only AVX2 outperforms POPCNT, and only for large enough bitstrings.


On i7:

> AVX2 code is faster than the dedicated instruction for input size 512 bytes and larger

The difference is indeed tiny. Still - it's very cool the generic AVX2 code can beat the instruction burned in the silicon!


Yikes I remember reading that thread on usenet. John Nagle, Rob Warnock and Roger Shepherd. Good times. Roger oddly didn't mention the story about guys in black suits and mirrored sunglasses showing up as the reason the T414 lacked popcount but the T800 added it.


Popcount is also very valuable in search (bitmap) indexes, for example roaring bitmap by Lemire uses it quite a bit.


Though, interestingly, if AVX2 is available then Lemire's CRoaring package uses a Harley-Seal popcount implementation, at https://github.com/RoaringBitmap/CRoaring/blob/master/includ... , rather than the POPCNT64 instruction.

Because it's faster - https://arxiv.org/abs/1611.07612 .


From the arxiv pdf: "We disabled Turbo Boost and set the processor to run at its highest clock speed" for their comparison.

Doesn't their setup bias towards the AVX2 solution?


For what it's worth, I've tested my AVX2 copy of their code on a couple of different machines, and found it faster than my POPCNT-based implementation. (See my comments elsewhere here.)

Here's my laptop numbers for my benchmark:

  2048 Tanimoto (RDKit Morgan radius=2) 1000 queries
  chemfp search using threshold Tanimoto arena, index, single threaded (popcnt_128_128)
    threshold T=0.4 popcnt 19.33 ms/query  (T2: 19.81 ms check: 189655.09 run-time: 39.2 s) (allow specialized)
  chemfp search using threshold Tanimoto arena, index, single threaded (avx2_256)
    threshold T=0.4 avx2 12.63 ms/query  (T2: 14.04 ms check: 189655.09 run-time: 26.7 s) (allow specialized)
That's 19.33 ms/query on 2048-bit (256 byte) bitstrings using POPCNT unrolled to two loops of 128 bytes each, and 12.63 ms/query using AVX2 specialized for 256 bytes.

My testing showed there was no advantage for a fully unrolled POPCNT implementation. One thing to know is that there is only one execution port for POPCNT on my Intel processor. An AMD processor with 4 execution ports (Ryzen, I think?) may be faster. I don't have that processor, and my limited understanding of the gcc-generated assembly suggests it isn't optimized for that case.


> An AMD processor with 4 execution ports (Ryzen, I think?) may be faster.

FYI m0zg agreed: "AMD can retire 4 (!) popcounts per cycle per core, if your code is able to feed it": https://news.ycombinator.com/item?id=20916023


I'm one of the authors on the paper. This is a good point. I think the answer is "it might". In practice (as 'dalke' points out) the AVX2 approach still ends up faster on common Intel hardware with Turbo enabled, but there might be cases where it doesn't, and the measured degree of difference certainly might change with Turbo on or off.

Complicating things, some but not all AVX2 operations benefit from Turbo frequencies as well. It depends on other things on the exact mix of instructions. The technical term is "license", and regardless of the setting of Turbo you processor chooses a frequency depending on the license that the instruction mix allows. Daniel has a blog post (co-written with Travis) on how this affects AVX512, but it also affects AVX2: https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-us.... The specifics are not well documented, and Travis probably has the best understanding of the exact behavior of anyone I know.

Anyway, we turn Turbo off and try to run the processor at a constant frequency because it makes some of our other measurements more reliable, but yes, when making claims that a SIMD solution is faster than a scalar solution, we should probably be testing whether the conclusion holds even when we turn Turbo back on. I think it's still best practice to do most measurements with it off, but I'll try to remember to do an extra test with it turned of if comes up in future papers.


Thanks for your answer - great blog link - most relevant:

"Intel made more aggressive use of AVX512 instructions in earlier versions of the icc compiler, but has since removed most use unless the user asks for it with a special command line option.".

For other readers, short summary copied from the link with very light editing:

* There are heavy and light AVX instructions. "Heavy" instructions roughly are those involving floating point operations or integer multiplications operating on 512 bits.

* Intel cores can run in one of three modes: license 0 (L0) [normal], license 1 (L1) is slower, and license 2 (L2) is the slowest. To get into license 2, you need sustained use of heavy 512-bit instructions, where sustained means approximately one such instruction every cycle.

* The processor does not immediately move to a higher license when encountering heavy instructions: it will first execute these instructions with reduced performance (say 4x slower) and only when there are many of them will the processor change its frequency. Otherwise, any other 512-bit instructions will move the core to L1.

* Downclocking is per core and for a short time after you have used particular instructions (e.g., ~2ms).

* The downclocking of a core is based on: the current license level of that core, and also the total number of active cores on the same CPU socket (irrespective of the license level of the other cores).

The constraints and suggested workarounds are complex. Also the word choice "license" by Intel is bizarre (since it implies the throttling is to reduce performance for business reasons rather than technical reasons?).


Be aware though that Intel POPCNT instruction throughput is pretty low by today's standards. When you need to popcount a lot of bits, on Intel it might make sense to do it without the use of POPCNT. This oldie but goodie still holds, on Intel: https://danluu.com/assembly-intrinsics/

In contrast, AMD can retire 4 (!) popcounts per cycle per core, if your code is able to feed it.

There's vectorized popcount in some of the optional AVX512 instruction sets, but for those to make it into consumer CPUs AMD would really have to hurt Intel pretty bad, marketshare-wise. :-)


We used a “popcount” style operation + bitwise AND to determine how similar two people were by interest. We had a bitmap where each bit represented an interest and 1 meant they had that interest and 0 meant they did not. Bitwise AND filters out the non overlapping interests and popcount to measure the “intensity” of their similarity (to tease out people who had almost every or almost no interests selected matching others like that — bell curve kind of thing).


generally, popcnt is useful to implement the Jaccard Distance [1]:

    J(X,Y) = |X∩Y| / |X∪Y|
So, implementing the sets X and Y as bitsets it becomes:

[1]: https://en.wikipedia.org/wiki/Jaccard_index

    J(X,Y) = popcnt(X&Y) / popcnt(X|Y)


If anyone needs that to go fast, for fixed-length (up to ~4k bits) try out my code, chemfp at http://chemfp.com/ . It's designed for cheminformatics, but can work with any data set which can be described by a fixed-length byte string and an identifier.


https://haskell-works.github.io/posts/2018-08-22-pdep-and-pe... from the series https://haskell-works.github.io/ shows how to use popcnt (and rank-select) in a Haskell CSV parser


The same computer architect who first told me about popcount's desirability to the NSA also believed that the massive simultaneous multithreading first introduced to the HPC world commercially by Tera—now owner of the "Cray" name—was also a big win for the spooks.


Revisiting POPCOUNT Operations in CPUs/GPUs https://homes.cs.washington.edu/~cdel/papers/spost106s1-pape...


The coolest part of this for me is that compilers can optimize this stuff out and replace it with the machine instructions.

I know that optimizing compilers do some amazing things, but detecting this particular case is pretty cool.


I highly recommend Matt Godbolt's talk "What has my compiler done for me lately?"[0] and its follow-up[1]. Both are great presentations going into details on some of the crazy optimizations compilers will do. As someone who knows very little about the inner workings of compilers my mind was blown when I saw some them.

[0] https://www.youtube.com/watch?v=bSkpMdDe4g4

[1] https://www.youtube.com/watch?v=nAbCKa0FzjQ


Fantastic. Adding to my watch list.


If you're building a binary tree incrementally, comparing the length of your stack of partially completed subtrees to the popcnt of the number of leaves so far tells you how many subtrees you need to merge :)


The Popcount instruction seems to be a meme on hacker news. You're cool if know about it and cooler if you've used it. That aside, this article is a good one for applications of it.


Knowing when to use bitmap representation, ops, and popcount is a kind of superpower. When it is the right thing, it makes your program 10x faster.


Thus can be said about almost all CPU instructions. x64 has tons of SIMD intrinics. You can do absolutely crazy stuff with them if you are willing to invest the time.


True enough, but bitmaps are very easy to use portably, with just &, |, ~, ^, and ++ operations. The only exotic needed very frequently is popcount, but you get that portably from std::bitset (except, evidently, in MSVC).

The compilers are perfectly happy to keep an unsigned and a bitset variable in the same register, so that assigning from one to the other is a no-op.


It's kind of pathetic that the RISC-V core instruction set lacks popcount.

Besides all its immediate uses, it is essential to a fast integer logarithm.


It is part of the Bit Manipulation extension. So is count leading zeros. Unfortunately it is still in draft.

The math extensions are not in draft.


Being in an extension means you can't count on it being there.


That's a feature (for hardware designers) not a bug.

RISC V is not really designed to be a good high performance oriented ISA. They're aimed square at the embedded market.


It's a feature for hardware designers working with 250 nm transistors.

Modern designers working with modern processes have the problem of discovering what they can throw a million transistors at that will actually make real programs faster. Hint: bitmanips!


Popcount comes up in chess engine programming, where boards are regularly represented as a series of bits for each type of piece. The Chess Programming Wiki has a lot of information about both usage and implementation:

https://www.chessprogramming.org/Population_Count


I think the most probable reason for this instruction is for calculating parity bits. This would need to be done fast so it makes sense that there would be a CPU instruction to do most of the work.


Parity is much easier than counting. It's so easy that you get it for free in the x86 flags register. (… and because the 8008 was designed to run a terminal.)


>"(… and because the 8008 was designed to run a terminal.)"

Could you elaborate on this? How does the 8008 being designed to run a terminal relate to the parity and the flags register?


Each iteration from the 8080 through x64 have a parity bit in the flags register for backwards compatibility with the previous generation. The 8008 was a microprocessor implementation of the Datapoint 2200 architecture.


Excellent, thanks for the insights. Cheers.


Early protocols didn’t have error correction in the lower layers. The parity flag was equivalent to a CRC instruction nowadays. Presumably if parity was incorrect, that would mean the byte was transmitted incorrectly.


I've used _BitScanReverse (popcount-ish MSVC intrinsic) with a custom octree indexing scheme, where higher level voxels have more unset bits from the left. Neat stuff.


MySQl includes a native popcount instruction, Postgres didn't when I checked a few years back. I actually needed it at the time.


It's surprisingly easy to implement a popcount user defined function in C for postgres. It would be a problem for hosted postgres with extension restrictions though.


Power ISA has it too. I'm hard pressed to think of a modern architecture that doesn't have it.


RISC-V delegates pcnt to the “B” Standard Extension for Bit Manipulation, and the status of B version 0.90 is Open, meaning it hasn't been finalized. So realistically, RISC-V doesn't have the pcnt instruction. Versions supporting B will have it but that will be at some point in the future.


I don’t think ARM does, but there might be some NEON thing that you can use to do something equivalent.



Huh, it looks like that only works on 1-byte values? That’s an interesting choice.


Worse, it's a fertile ground for "interesting" bugs, because VADDV (which sum-reduces the result) reduces into an 8 bit uint. So if you e.g. accumulate two or more quadword VCNTs into a uint8x16_t and then VADDV it, you could end up with something other than the actual overall bit count (because 2 quadwords can have _256_ bits set). Same with accumulating 8 or more VADDVs, except now individual bytes could wrap around if you don't widen in between.


Speaking of ARM, it has an rbit instruction which for some reason doesn't exist on x86.


Apparently x86 isn't "web scale" enough to have to deal with endianness ;)


It is not really about endianness. The instruction reverses bits not bytes. IIRC is needed as a required step of FFT.


I've used this in an effort to calculate hamming distances of encoded equal length strings of dna.


Popcount is not at all unusual in ordinary bit-twiddling.


popcount reminds me of my first Map-Reduce program. It seems that I "reduced" by popcounting.


Peculiar name. I would have named that "bitsum".


The original name was "sideways addition". There are many other names, including Hamming weight. The Wikipedia entry for Hamming weight says the HP-16C developers agree with you:

> Some programmable scientific pocket calculators feature special commands to calculate the number of set bits, e.g. #B on the HP-16C[3][17] and WP 43S,[18][19] #BITS[20][21] or BITSUM[22][23] on HP-16C emulators, and nBITS on the WP 34S.[24][25]




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: