Obviously using a dedicated instruction is fastest in normal cases.
But if you need to implement popcount or many other bit manipulation algorithms in software, a good book to look at is "Hacker's Delight" by Henry S. Warren, Jr, 2003.
"Hacker's Delight' page 65+ discuss "Counting 1-bits" (population counts). There are a lot of software algorithms to do this.
One approach is to set each 2-bit field to the count of 2 1-bit fields, then each 4-bit field to the count of 2 2-bit fields, etc., like this:
Which runs only a number of iterations equal to the number of 1 bits. This works because (x-1) actually just flips the rightmost 1 bit and all zeroes to its right, then the & zeroes all of those.
It's not that fast unless your integer is really sparse (since it has a branch), but I've always liked the bit hack.
This is essentially equivalent to feeding the input through bitwise-NOT first. Unfortunately, there are far more integers that are neither sparse nor dense than integers that are sparse or dense.
But I can certainly imagine there could be problem domains where most of some collection of integers being manipulated are expected to be sparse or dense.
Yet, practically all software running on 64-bit x86 machines is compiled without, because the original amd64 released in 2003 lacked it, and distributions still target that. Likewise, MSVC. There would be good reasons for Apple XCode not to, but that doesn't mean they don't.
If you tell MSVC to issue a popcount instruction with "__popcnt64()" (etc.), it will. If you ask Gcc to issue a popcount instruction with "__builtin_popcount()", it will only do it if you have also told it to target an ISA that has one; otherwise it emulates.
The only portable way to get a popcount instruction, thus far, is to use C++'s std::bitset::count() in circumstances where the compiler believes the instruction would work. Pleasingly, Gcc and Clang are both happy to hold an integer type and its std::bitset representation in the same register at the same time, so there is no runtime penalty for a round-trip through std::bitset.
MSVC's standard library implementation of std::bitset does not use the popcount instruction.
Try the Cosmopolitan Libc implementation of popcnt(). It uses CPUID checks for compatibility which get hoisted out of a tight loop by the optimizer so there's no performance loss when building for -march=k8. If you build for -march=native then they get DCE'd entirely. See https://github.com/jart/cosmopolitan/blob/master/libc/bits/p...
Thanks you just helped tons of my code go faster. Turns out false output dependency is a much bigger issue for bsr and bsf with 32-bit operands since it impacts all models and GCC builtins like __builtin_clz(x)^31 won't xor the output unless you use -march=native. So once again I find myself replacing compiler apis with stallman notation. For what it's worth:
C++20 added std::popcount() in the new header <bit>
If you haven't told it that the target ISA has the POPCNT instruction, std::popcount() on MSVC will use runtime feature detection: https://godbolt.org/z/qab4Mjv1v
But if you need to implement popcount or many other bit manipulation algorithms in software, a good book to look at is "Hacker's Delight" by Henry S. Warren, Jr, 2003.
"Hacker's Delight' page 65+ discuss "Counting 1-bits" (population counts). There are a lot of software algorithms to do this.
One approach is to set each 2-bit field to the count of 2 1-bit fields, then each 4-bit field to the count of 2 2-bit fields, etc., like this:
assuming x is 32 bits.I think this approach is a classic divide-and-conquer solution.