Hacker News new | past | comments | ask | show | jobs | submit login

> But if you need to implement popcount or many other bit manipulation algorithms in software

Power9, ARM, x86 BMI, Nvidia PTX, AMD GCN, and AMD RDNA all have a popcount instruction.

Yeah, all mainstream CPUs and GPUs made in the past decade...

Unfortunately, there's no system I can think of where you'd need the software solution anymore... Maybe if you wanted popcount on an Arduino??




Yet, practically all software running on 64-bit x86 machines is compiled without, because the original amd64 released in 2003 lacked it, and distributions still target that. Likewise, MSVC. There would be good reasons for Apple XCode not to, but that doesn't mean they don't.

If you tell MSVC to issue a popcount instruction with "__popcnt64()" (etc.), it will. If you ask Gcc to issue a popcount instruction with "__builtin_popcount()", it will only do it if you have also told it to target an ISA that has one; otherwise it emulates.

The only portable way to get a popcount instruction, thus far, is to use C++'s std::bitset::count() in circumstances where the compiler believes the instruction would work. Pleasingly, Gcc and Clang are both happy to hold an integer type and its std::bitset representation in the same register at the same time, so there is no runtime penalty for a round-trip through std::bitset.

MSVC's standard library implementation of std::bitset does not use the popcount instruction.


Try the Cosmopolitan Libc implementation of popcnt(). It uses CPUID checks for compatibility which get hoisted out of a tight loop by the optimizer so there's no performance loss when building for -march=k8. If you build for -march=native then they get DCE'd entirely. See https://github.com/jart/cosmopolitan/blob/master/libc/bits/p...


Because of a bug in a bunch of Intel chips, something like the following is probably better.

      asm(
  "mov\t%1,%0\n"
  "popcnt\t%1,%1"
    : "=r"(Res) : "r"(Pop) : "cc");
(assuming intel syntax)


Thanks you just helped tons of my code go faster. Turns out false output dependency is a much bigger issue for bsr and bsf with 32-bit operands since it impacts all models and GCC builtins like __builtin_clz(x)^31 won't xor the output unless you use -march=native. So once again I find myself replacing compiler apis with stallman notation. For what it's worth:

    asm("popcnt\t%0,%0" : "=r"(res) : "0"(pop) : "cc");
Can generate the mov statement automatically.


C++20 added std::popcount() in the new header <bit>

If you haven't told it that the target ISA has the POPCNT instruction, std::popcount() on MSVC will use runtime feature detection: https://godbolt.org/z/qab4Mjv1v


Curiously, for std::bitset<32>(i).count(), MSVC (VS 16.9) still generates a loop, regardless.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: