Hacker News new | past | comments | ask | show | jobs | submit login

Obviously using a dedicated instruction is fastest in normal cases.

But if you need to implement popcount or many other bit manipulation algorithms in software, a good book to look at is "Hacker's Delight" by Henry S. Warren, Jr, 2003.

"Hacker's Delight' page 65+ discuss "Counting 1-bits" (population counts). There are a lot of software algorithms to do this.

One approach is to set each 2-bit field to the count of 2 1-bit fields, then each 4-bit field to the count of 2 2-bit fields, etc., like this:

    x = (x & 0x55555555) + ((x >> 1) & 0x55555555);
    x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
    x = (x & 0x0f0f0f0f) + ((x >> 4) & 0x0f0f0f0f);
    x = (x & 0x00ff00ff) + ((x >> 8) & 0x00ff00ff);
    x = (x & 0x0000ffff) + (x >> 16);
assuming x is 32 bits.

I think this approach is a classic divide-and-conquer solution.




A neat one for large, sparse integers is:

  for(i=0; !x; ++i) {
    x = (x-1)&x
  }
  return i;
Which runs only a number of iterations equal to the number of 1 bits. This works because (x-1) actually just flips the rightmost 1 bit and all zeroes to its right, then the & zeroes all of those.

It's not that fast unless your integer is really sparse (since it has a branch), but I've always liked the bit hack.


When the integer is expected to be dense, you have the corresponding trick

    size_t count = sizeof(x) * 8;
    while(x != -1) {
        x |= x+1;
        --count;
    }
    return count;


This is essentially equivalent to feeding the input through bitwise-NOT first. Unfortunately, there are far more integers that are neither sparse nor dense than integers that are sparse or dense.


But I can certainly imagine there could be problem domains where most of some collection of integers being manipulated are expected to be sparse or dense.


You can also go faster using SIMD instructions if you need to compute wider population counts (beyond 64 bits):

Faster Population Counts Using AVX2 Instructions, Computer Journal, Volume 61, Issue 1, 2018 https://arxiv.org/abs/1611.07612


> But if you need to implement popcount or many other bit manipulation algorithms in software

Power9, ARM, x86 BMI, Nvidia PTX, AMD GCN, and AMD RDNA all have a popcount instruction.

Yeah, all mainstream CPUs and GPUs made in the past decade...

Unfortunately, there's no system I can think of where you'd need the software solution anymore... Maybe if you wanted popcount on an Arduino??


Yet, practically all software running on 64-bit x86 machines is compiled without, because the original amd64 released in 2003 lacked it, and distributions still target that. Likewise, MSVC. There would be good reasons for Apple XCode not to, but that doesn't mean they don't.

If you tell MSVC to issue a popcount instruction with "__popcnt64()" (etc.), it will. If you ask Gcc to issue a popcount instruction with "__builtin_popcount()", it will only do it if you have also told it to target an ISA that has one; otherwise it emulates.

The only portable way to get a popcount instruction, thus far, is to use C++'s std::bitset::count() in circumstances where the compiler believes the instruction would work. Pleasingly, Gcc and Clang are both happy to hold an integer type and its std::bitset representation in the same register at the same time, so there is no runtime penalty for a round-trip through std::bitset.

MSVC's standard library implementation of std::bitset does not use the popcount instruction.


Try the Cosmopolitan Libc implementation of popcnt(). It uses CPUID checks for compatibility which get hoisted out of a tight loop by the optimizer so there's no performance loss when building for -march=k8. If you build for -march=native then they get DCE'd entirely. See https://github.com/jart/cosmopolitan/blob/master/libc/bits/p...


Because of a bug in a bunch of Intel chips, something like the following is probably better.

      asm(
  "mov\t%1,%0\n"
  "popcnt\t%1,%1"
    : "=r"(Res) : "r"(Pop) : "cc");
(assuming intel syntax)


Thanks you just helped tons of my code go faster. Turns out false output dependency is a much bigger issue for bsr and bsf with 32-bit operands since it impacts all models and GCC builtins like __builtin_clz(x)^31 won't xor the output unless you use -march=native. So once again I find myself replacing compiler apis with stallman notation. For what it's worth:

    asm("popcnt\t%0,%0" : "=r"(res) : "0"(pop) : "cc");
Can generate the mov statement automatically.


C++20 added std::popcount() in the new header <bit>

If you haven't told it that the target ISA has the POPCNT instruction, std::popcount() on MSVC will use runtime feature detection: https://godbolt.org/z/qab4Mjv1v


Curiously, for std::bitset<32>(i).count(), MSVC (VS 16.9) still generates a loop, regardless.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: