Obviously using a dedicated instruction is fastest in normal cases. But if you n...

amalcon · on June 11, 2021

A neat one for large, sparse integers is:

  for(i=0; !x; ++i) {
    x = (x-1)&x
  }
  return i;

Which runs only a number of iterations equal to the number of 1 bits. This works because (x-1) actually just flips the rightmost 1 bit and all zeroes to its right, then the & zeroes all of those.

It's not that fast unless your integer is really sparse (since it has a branch), but I've always liked the bit hack.

pbsd · on June 11, 2021

When the integer is expected to be dense, you have the corresponding trick

    size_t count = sizeof(x) * 8;
    while(x != -1) {
        x |= x+1;
        --count;
    }
    return count;

throw5away · on June 11, 2021

This is essentially equivalent to feeding the input through bitwise-NOT first. Unfortunately, there are far more integers that are neither sparse nor dense than integers that are sparse or dense.

jffry · on June 12, 2021

But I can certainly imagine there could be problem domains where most of some collection of integers being manipulated are expected to be sparse or dense.

dlemire · on June 11, 2021

You can also go faster using SIMD instructions if you need to compute wider population counts (beyond 64 bits):

Faster Population Counts Using AVX2 Instructions, Computer Journal, Volume 61, Issue 1, 2018 https://arxiv.org/abs/1611.07612

dragontamer · on June 11, 2021

> But if you need to implement popcount or many other bit manipulation algorithms in software

Power9, ARM, x86 BMI, Nvidia PTX, AMD GCN, and AMD RDNA all have a popcount instruction.

Yeah, all mainstream CPUs and GPUs made in the past decade...

Unfortunately, there's no system I can think of where you'd need the software solution anymore... Maybe if you wanted popcount on an Arduino??

ncmncm · on June 11, 2021

Yet, practically all software running on 64-bit x86 machines is compiled without, because the original amd64 released in 2003 lacked it, and distributions still target that. Likewise, MSVC. There would be good reasons for Apple XCode not to, but that doesn't mean they don't.

If you tell MSVC to issue a popcount instruction with "__popcnt64()" (etc.), it will. If you ask Gcc to issue a popcount instruction with "__builtin_popcount()", it will only do it if you have also told it to target an ISA that has one; otherwise it emulates.

The only portable way to get a popcount instruction, thus far, is to use C++'s std::bitset::count() in circumstances where the compiler believes the instruction would work. Pleasingly, Gcc and Clang are both happy to hold an integer type and its std::bitset representation in the same register at the same time, so there is no runtime penalty for a round-trip through std::bitset.

MSVC's standard library implementation of std::bitset does not use the popcount instruction.

jart · on June 12, 2021

Try the Cosmopolitan Libc implementation of popcnt(). It uses CPUID checks for compatibility which get hoisted out of a tight loop by the optimizer so there's no performance loss when building for -march=k8. If you build for -march=native then they get DCE'd entirely. See https://github.com/jart/cosmopolitan/blob/master/libc/bits/p...

ncmncm · on June 13, 2021

Because of a bug in a bunch of Intel chips, something like the following is probably better.

      asm(
  "mov\t%1,%0\n"
  "popcnt\t%1,%1"
    : "=r"(Res) : "r"(Pop) : "cc");

(assuming intel syntax)

jart · on June 15, 2021

Thanks you just helped tons of my code go faster. Turns out false output dependency is a much bigger issue for bsr and bsf with 32-bit operands since it impacts all models and GCC builtins like __builtin_clz(x)^31 won't xor the output unless you use -march=native. So once again I find myself replacing compiler apis with stallman notation. For what it's worth:

    asm("popcnt\t%0,%0" : "=r"(res) : "0"(pop) : "cc");

Can generate the mov statement automatically.

ynik · on June 14, 2021

C++20 added std::popcount() in the new header <bit>

If you haven't told it that the target ISA has the POPCNT instruction, std::popcount() on MSVC will use runtime feature detection: https://godbolt.org/z/qab4Mjv1v

ncmncm · on June 14, 2021

Curiously, for std::bitset<32>(i).count(), MSVC (VS 16.9) still generates a loop, regardless.