The NSA Instruction (2019)

notacoward · on June 11, 2021

Back in the mid-2000s I worked at a company that made their own (MIPS-based) chips. NSA was one of our customers - supposedly the "defense" who could be considered the good side of NSA compared to the 10x larger "offense" but still. As we were planning for our second generation, they offered quite a bit of money if we'd implement a "sheep and goats" instruction. It would take two operands: an input and a mask. The masked-in bits of the input (the "sheep") would be packed toward the MSB of the output, while the masked-out bits (the "goats") would be packed toward the LSB. We had a lot of people on staff with serious chops in all sorts of math including cryptography, but none of them could identify an algorithm that would benefit from having such an instruction (as distinct from more conventional range-based bitfield instructions). Since the company went under shortly afterward, it remained a mystery. I still wonder about it.

less_less · on June 11, 2021

Half of this instruction is present in AMD64's BMI2 extension as PEXT, and the reverse operation as PDEP. Unlike "sheep and goats", PEXT just extracts the sheep into the LSB and ignores the goats.

If I recall the Knuth lecture correctly, given a "sheep and goats" instruction where one of the sets is packed in reverse order, you can implement any n-bit permutation in something like log2(n) instructions. I don't remember if this is true if they're both packed in forward order. But it would be nice for some hardware crypto designs, like DES or more recently GIFT.

PEXT has at least two additional use cases I know of: manipulating bit indices for databases, and binary (GF2) matrix manipulation. I've used it in a (non-crypto) project to select a subset of columns from a binary matrix, to convert it to systematic form. This subroutine also used popcount.

What I really wanted in that project was another "NSA instruction": bit matrix multiply. Cray supercomputers can multiply two 64x64 binary matrices in one instruction, though I have no idea how many cycles it takes. With AVX2, the best I could do is 6 instructions plus precomputation for 8x8 x 8x32, which is 1/128'th the work.

thomasmg · on June 11, 2021

Succinct (space-saving) data structures often need "rank" and "select" operations. Rank(n) is the number of 1 bits up to position n. Select(n) is the reverse: at which position is the n-th 1 bit.

For "rank", the "popcount" instruction can be used. Interestingly, for "select", the "PDEP" instruction can be used: you can put the data array in the PDEP mask, and 1 << n in the value; basically flip the operands. I found this quite fascinating. For details, there is a short paper on this: "A Fast x86 Implementation of Select".

I wonder if those succinct data structures are in any way related to what NSA is doing. I think not, but who knows.

yvdriess · on June 12, 2021

I've seen it used for large-scale genomics. Saving a few bits if you're dealing with billions of a thing is very useful. They're also vital for being able to pack as much of a datastructure (e.g. a graph) on a single node. Some graph algorithms, e.g. random walks, are latency bound and scale really badly in a distributed system.

Bayart · on June 11, 2021

The paper, if anyone wants to save some clicks : https://arxiv.org/abs/1706.00990

someguydave · on June 11, 2021

indeed, Cray famously said "If you were plowing a field, which would you rather use: two strong oxen or 1024 chickens?"

Unfortunately we only have 1024 chickens in modern computers.

CalChris · on June 11, 2021

Yes, but those chickens now are as powerful as Cray's oxen were then.

flavius29663 · on June 11, 2021

so, do you want 2 modern oxen or 1024 modern chickens?

lowbloodsugar · on June 11, 2021

These days, all the oxen are made up of chickens. The biggest one is 7,630,848 chickens.

https://www.top500.org/lists/top500/2020/11/

CalChris · on June 11, 2021

Gimme dem modern wide supercalar OOO cached chickens, please. Cray was right back then but he is no longer right now. If he were, the market would say so.

PhantomGremlin · on June 12, 2021

Cray is still right.

Today we know how to put 16 4 GHz CPUs on a single die. If we want, we can hook chips together to build a computer with 16,384 CPUs.

But we can't build a single chip running usefully at 16x4 GHz. We can't build a single system running at 16384x4 GHz.

If we could build that fast chip or system, all else being equal, the market would choose the single fast CPU over the pile of slow CPUs.

Right now "the market can't say so". It's impossible to provide such a system to the market. We're forced to buy computers with so many CPUs because we've pretty much hit the wall in terms of frequency scaling.

Intel Pentium 4, circa 2001, ran at about 1.4 GHz.

Intel Core i9, circa 2021, runs at about 3.5 GHz, with turbo boost to about 5.2 GHz.

That's about a 3x improvement in clock speed in 20 years. We simply can't make CPUs that run faster than that.

(I had to use x to represent multiplication, HN formatting gets funny with asterisks).

djmips · on June 13, 2021

Thanks for the post but you can't in all fairness keep your "Cray is still right" opening statement when you go on to agree that in reality, which is what is important, we have to settle for lots of chickens.

PhantomGremlin · on June 14, 2021

That's a good point. The "strong oxen" Cray was originally talking about are now totally unachievable, compared to settling for lots of chickens.

lowbloodsugar · on June 11, 2021

If you had to digest a million grains, which would you rather use?

Enginerrrd · on June 11, 2021

It's out of my depth, but my guess is on sething DES related. Here's a link to some possibly relevant discussion about it.

http://www.icodeguru.com/Embedded/Hacker's-Delight/050.htm

Sniffnoy · on June 11, 2021

Heh, so it's an instruction for INTERCAL's "select" (~) operator...

pbsd · on June 11, 2021

That sounds like a decent primitive to accelerate arbitrary bit permutations in software. It's known as GRP in, e.g., [1].

[1] http://palms.ee.princeton.edu/PALMSopen/shi00bit.pdf

notacoward · on June 11, 2021

Very interesting. This was published in 2000, and the people we worked with were near Princeton, so this result - the specific utility of such an instruction, if not the semantics themselves - might very well be something that was known to some relevant people but not yet widely enough for any of our people to recognize it. Thanks!

dboreham · on June 11, 2021

I have also been around in a company that made CPUs that initially had no bit count instruction. Then at some point the instruction was added. At the time I heard that "men in black with mirrored sunglasses" had shown up and demanded that the instruction be added. Whether or not this was an accurate description of events, you can see the note on page 74 in this document (section 8.2) : http://www.transputer.net/iset/pdf/tis-acwg.pdf recording the instruction having been added.

Edit, I see Roger Shepherd (one of the people in the know at above mentioned company) commented in the comp.arch thread (which I vaguely remember reading at the time) but no mention of MIB...

mjevans · on June 11, 2021

The 'and goats' part leads me to conceptualizing the instruction more like:

Bit Scrambler / Chutes - shuffle bits around in a way that divides a stream.

This might also be useful in pre-filters for compression (entropy reduction) if you knew the content of the message. E.G. for ASCII text the upper 2-3 bits of each letter could be ranked to the side for better compression and a reduction of message size.

As others have pointed out, modern CPUs ended up with 'half' that instruction, so I wonder if there were any other reasons for the full instruction.

jhallenworld · on June 11, 2021

Compress and expand (from Hackers Delight) are like this, but only the selected bits are kept. These are quite useful instructions. One use is the hash function for perfect hash tables. The hash table includes a mask which picks the bits of the keys which actually change (between all keys) and compresses them all to the right as the hash index.

Disclaimer: I contributed the "expand" algorithm shown in Hacker's Delight.

coolspot · on June 11, 2021

Is it possible that by some mistake the NSA your company was working with was National Sheepfarmers Association?

Did representatives of the “NSA” have a New Zealand accent?

devit · on June 11, 2021

That's equivalent to !0 << popcount(!(x & mask)) (where left shift must saturate and not truncate the shift count, otherwise you need to special case x = 0) and seems much less useful than popcount.

notacoward · on June 11, 2021

I think you're misunderstanding what the instruction (or similar ones that others have mentioned) would do. It's a specialized permutation function; every bit in the input is preserved, just in a different position. Your version doesn't have that property at all, and would indeed not be very useful.

st_goliath · on June 11, 2021

Some years back, I got myself a copy of Andrew Hodges "Alan Turing - The Enigma", a biography and IMO generally a good read, but also with some gems regarding very early computing history in it.

Specifically, after WWII, Turing worked on the ACE 1 (later reduced to Pilot ACE) project to build an electronic computer, which didn't really progress due to management and bureaucracy overhead. He eventually went to Manchester, once they got their Manchester Mark 1 off the ground, which they tried to commercialize as "Ferranti Mark 1" (https://en.wikipedia.org/wiki/Ferranti_Mark_1).

While employed for the University, Turing IIRC continued to work as an external consultant for whatever became of G.C. & C.S. on the side. According to the book, he convinced them to buy such a machine (presumably for crypt-analysis?) and, on the Manchester side of things, insisted on some modifications to be made, including a "horizontal adder", so it could count the number of bits set in a word with a single instruction, i.e. a popcount instruction. This would pre-date the IBM Stretch mentioned in the article.

tptacek · on June 11, 2021

The consensus on the 1992 thread (including a really great comment from 'Animats) seems to be that `popcount` was generally not added to architectures at NSA's request --- that people familiar with those archs knew the actual reason `popcount` wound up in the ISA, and it preceded NSA purchases.

https://groups.google.com/g/comp.arch/c/UXEi7G6WHuU/m/Z2z7fC...

Animats · on June 11, 2021

The striking thing is that the IBM System/360 didn't have it. Nor does the System/370. Those were the standard mainframes for a generation.

IBM Z-series machines do have population count, finally.

drichel · on June 11, 2021

Counting bits was the bottleneck in the genomic scan I co-authored (Kanoungi et al. 2020). popcnt resulted in insane perfomance gains comared to all other methods.

However, we re-discovered the fact that some Intel CPUs, including the Nehalem mentioned in the article, have a bug that severly affects popcnt's performance, see for example here: https://github.com/komrad36/LATCH/issues/3#issuecomment-2671...

adrian_b · on June 11, 2021

It is possible that the "population count" instruction has been included in the instruction sets of most American supercomputers at the request of NSA, which was an important customer for them.

Nevertheless, the first computer having this instruction was a British computer, the Ferranti Mark I (February 1951).

The name used by Ferranti Mark I for this instruction was "sideways add".

Also notable was that Ferranti Mark I had the equivalent of LZCNT (count leading zeroes) too.

Both instructions are very useful and they are standard now for modern instruction sets, but they were omitted in most computers after Ferranti Mark I, except in expensive supercomputers.

adrian_b · on June 11, 2021

Moreover, Ferranti Mark I included a hardware random number generator, another feature useful for cryptography, which was reintroduced only recently in modern CPUs.

iib · on June 11, 2021

Hardware random number generators do have some security issues though. Linux devs were opposed to solely relying on them, because they can be compromised by the vendor [1]. So they are at best used in algorithms that they can not compromise (still in [1], but lower, in the comments).

[1] https://web.archive.org/web/20180611180213/https://plus.goog...

adrian_b · on June 11, 2021

The security issues are not with hardware random number generators in general, but with those that are included inside complex devices like monolithic CPUs or TPMs, so that the owners of those devices cannot verify that the RNG's really do what they are claimed to do.

Discrete hardware RNG's, like that of the Ferranti Mark I, are perfectly secure.

For a modern device, the best way to implement a hardware RNG is to just include an ADC (analog-digital converter) input. Then you may connect externally on the PCB some noisy analog amplifier, e.g. one which has a noisy resistor or diode at its input. Digitizing the noise with the ADC will provide the random numbers and the ADC input can be isolated and tested separately at any time, so the user can verify that there is no hidden functionality.

Most microcontrollers have ADC inputs, so it is easy to add a secure hardware RNG for them. The same could be done for a personal computer by making a noisy amplifier that can be plugged in the microphone input, or by making a USB device with a microcontroller.

iib · on June 11, 2021

I remembered something, and I want to say as an aside, for anybody reading that at one point has to design a toy RNG from an ADC, as I had to some years ago, you should not take the last bits as they are--as was my first thought--, you should pass them through something like the von Neumann corrector [1].

[1] https://everything2.com/title/von+Neumann+corrector

ncmncm · on June 11, 2021

Indeed, AMD has more than once shipped CPUs in which the random-number instruction would always yield the same value, that had to be monkey-patched to yield apparently random numbers. A valuable hint.

st_goliath · on June 11, 2021

I commented on that earlier, including that it probably also has a cryptanalysis background: https://news.ycombinator.com/item?id=27472900

But yes, it definitely pre-dates the 1961 IBM machine in the article.

dwheeler · on June 11, 2021

Obviously using a dedicated instruction is fastest in normal cases.

But if you need to implement popcount or many other bit manipulation algorithms in software, a good book to look at is "Hacker's Delight" by Henry S. Warren, Jr, 2003.

"Hacker's Delight' page 65+ discuss "Counting 1-bits" (population counts). There are a lot of software algorithms to do this.

One approach is to set each 2-bit field to the count of 2 1-bit fields, then each 4-bit field to the count of 2 2-bit fields, etc., like this:

    x = (x & 0x55555555) + ((x >> 1) & 0x55555555);
    x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
    x = (x & 0x0f0f0f0f) + ((x >> 4) & 0x0f0f0f0f);
    x = (x & 0x00ff00ff) + ((x >> 8) & 0x00ff00ff);
    x = (x & 0x0000ffff) + (x >> 16);

assuming x is 32 bits.

I think this approach is a classic divide-and-conquer solution.

amalcon · on June 11, 2021

A neat one for large, sparse integers is:

  for(i=0; !x; ++i) {
    x = (x-1)&x
  }
  return i;

Which runs only a number of iterations equal to the number of 1 bits. This works because (x-1) actually just flips the rightmost 1 bit and all zeroes to its right, then the & zeroes all of those.

It's not that fast unless your integer is really sparse (since it has a branch), but I've always liked the bit hack.

pbsd · on June 11, 2021

When the integer is expected to be dense, you have the corresponding trick

    size_t count = sizeof(x) * 8;
    while(x != -1) {
        x |= x+1;
        --count;
    }
    return count;

throw5away · on June 11, 2021

This is essentially equivalent to feeding the input through bitwise-NOT first. Unfortunately, there are far more integers that are neither sparse nor dense than integers that are sparse or dense.

jffry · on June 12, 2021

But I can certainly imagine there could be problem domains where most of some collection of integers being manipulated are expected to be sparse or dense.

dlemire · on June 11, 2021

You can also go faster using SIMD instructions if you need to compute wider population counts (beyond 64 bits):

Faster Population Counts Using AVX2 Instructions, Computer Journal, Volume 61, Issue 1, 2018 https://arxiv.org/abs/1611.07612

dragontamer · on June 11, 2021

> But if you need to implement popcount or many other bit manipulation algorithms in software

Power9, ARM, x86 BMI, Nvidia PTX, AMD GCN, and AMD RDNA all have a popcount instruction.

Yeah, all mainstream CPUs and GPUs made in the past decade...

Unfortunately, there's no system I can think of where you'd need the software solution anymore... Maybe if you wanted popcount on an Arduino??

ncmncm · on June 11, 2021

Yet, practically all software running on 64-bit x86 machines is compiled without, because the original amd64 released in 2003 lacked it, and distributions still target that. Likewise, MSVC. There would be good reasons for Apple XCode not to, but that doesn't mean they don't.

If you tell MSVC to issue a popcount instruction with "__popcnt64()" (etc.), it will. If you ask Gcc to issue a popcount instruction with "__builtin_popcount()", it will only do it if you have also told it to target an ISA that has one; otherwise it emulates.

The only portable way to get a popcount instruction, thus far, is to use C++'s std::bitset::count() in circumstances where the compiler believes the instruction would work. Pleasingly, Gcc and Clang are both happy to hold an integer type and its std::bitset representation in the same register at the same time, so there is no runtime penalty for a round-trip through std::bitset.

MSVC's standard library implementation of std::bitset does not use the popcount instruction.

jart · on June 12, 2021

Try the Cosmopolitan Libc implementation of popcnt(). It uses CPUID checks for compatibility which get hoisted out of a tight loop by the optimizer so there's no performance loss when building for -march=k8. If you build for -march=native then they get DCE'd entirely. See https://github.com/jart/cosmopolitan/blob/master/libc/bits/p...

ncmncm · on June 13, 2021

Because of a bug in a bunch of Intel chips, something like the following is probably better.

      asm(
  "mov\t%1,%0\n"
  "popcnt\t%1,%1"
    : "=r"(Res) : "r"(Pop) : "cc");

(assuming intel syntax)

jart · on June 15, 2021

Thanks you just helped tons of my code go faster. Turns out false output dependency is a much bigger issue for bsr and bsf with 32-bit operands since it impacts all models and GCC builtins like __builtin_clz(x)^31 won't xor the output unless you use -march=native. So once again I find myself replacing compiler apis with stallman notation. For what it's worth:

    asm("popcnt\t%0,%0" : "=r"(res) : "0"(pop) : "cc");

Can generate the mov statement automatically.

ynik · on June 14, 2021

C++20 added std::popcount() in the new header <bit>

If you haven't told it that the target ISA has the POPCNT instruction, std::popcount() on MSVC will use runtime feature detection: https://godbolt.org/z/qab4Mjv1v

ncmncm · on June 14, 2021

Curiously, for std::bitset<32>(i).count(), MSVC (VS 16.9) still generates a loop, regardless.

dragontamer · on June 11, 2021

GPU-programmers use popcount-based programming all the time these days, but the abstractions are built on top and are hardware accelerated.

CUDA's __activemask(); returns the 32-bit value of your current 32-wide EXEC mask. That is to say, if your current warp is:

    int foo = 0;
    if(threadIdx.x %= 2){
      foo = __activemask(); 
    }

foo will be "0b01010101...." or 0x55555555. This __activemask() has a number of useful properties should you use __popc with it.

popcount(__activemask()); returns the number of threads executing.

lanemask_lt() returns "0b0000000000000001" for the 0th lane. 0b0000000000000011 for the 1st lane. 0b0000000000000111... for the 2nd lane... and 111111111...111 for the last 31st lane.

popcount(__activemask() & lanemask_lt()); returns the "active lane count". All together now, we can make a parallel SIMD-stack that can push/pop together in parallel.

    int head = 0;
    char buffer[0x1000];

    while(fooBar()){ // Dynamic! We don't know who is, or is not active anymore
        int localPrefix = __popc(__activemask() & __lanemask_lt());
        int totalWarpActive = __popc(__activemask()); 
        buffer[head + localPrefix] = generateValueThisThread();
        if(localPrefix == 0){
            head += totalWarpActive; // Move the head forward, much like a "push" operation in single-thread land
            // Only one thread should move the head
        }
         __syncthreads(); // Thread barrier, make sure everyone is waiting on activeThread#0 before continuing.
    }

------------

As such, you can dynamically load-balance between GPU threads (!!!) from a shared stack with minimal overheads.

If you want to extend this larger than one 32-wide CUDA-warp, you'll need to use __shared __ memory to share the prefix with the rest of the block.

It is a bad idea (too much overhead) to extend this much larger than a block, as there's no quick way to communicate outside of your block. Still though, having chunks of up to 1024 threads synchronized through a shared data-structure that only has nanoseconds of overhead is a nifty trick.

-----------

EDIT: Oh right, and this concept is now replicated very, very quickly in the dedicated __ballot_sync(...) function (which compiles down to just a few assembly instructions).

Playing with the "Exec-mask" is a hugely efficient way to synchronously, and dynamically gather information across your warp. So lots of little tricks have been built around this.

jiggawatts · on June 12, 2021

Occasionally I am humbled to realise that even in IT there are vast fields of knowledge that are so far removed from my ordinary knowledge as to appear almost like magic.

This is one of those moments!

PS: I once wrote a 3D engine, but clearly my knowledge is now so out of date as to be practically stone age compared to this kind of thing...

4gotunameagain · on June 11, 2021

Another interesting application of popcount is in computer vision, namely in matching keypoints that use binary descriptors for 3D reconstruction in SLAM/TRN etc

jonatron · on June 11, 2021

Yep, I've used __builtin_popcountll for ORB from OpenCV (256 bit binary descriptors).

4gotunameagain · on June 11, 2021

Looks like we've done similar things :)

Horror story: I was once developing a TRN system for a spacecraft instrument which uses an ancient x86 processor that does not have popcnt, ended up using a look-up table instead...

solarexplorer · on June 11, 2021

Did you know about HAKMEM 169? I guess it was/is not widely known since many people mention lookup tables as the only fast alternative to the popcnt instruction.

http://www.hakmem.org/#item169

4gotunameagain · on June 14, 2021

That's interesting, thanks.

It seems that look up tables are still faster when memory allows it though (http://www.dalkescientific.com/writings/diary/archive/2008/0...)

oefrha · on June 11, 2021

Discussed at the time: https://news.ycombinator.com/item?id=20914479

implements · on June 11, 2021

In this discussion someone offers:

“I remember in one interview I was asked to write a function that returns true iff x is a power of two, so I wrote return 1 == __builtin_popcount(x). They liked that.”

I’m no longer a programmer, but I wondered why it wasn’t “return (__builtin_popcount(x) == 1)” - just out of interest.

throw5away · on June 11, 2021

If you accidentally write "return x = 1" when x is a variable, you always return true. If you return "1 = x", you cause a syntax error. So some people have gotten into the habit of writing constants on the left, even if the return value of __builtin_popcount is not assignable.

implements · on June 11, 2021

Thanks for explaining - that makes sense.

IntrepidWorm · on June 12, 2021

> I’m no longer a programmer, but I wondered why it wasn’t “return (__builtin_popcount(x) == 1)” - just out of interest.

== is a boolian comparison operator. Therefore, we are returning true or false depending on how the expression evaluates. __builtin_popcount(x) will return the number of set '1' bits in the binary int x. Since powers of 2 in binary are always a single 1 followed by 0's, this expression is checking whether the number of 1's in the binary representation of x is equal to 1, and returning true if this is the case. Otherwise, if there are more than 1 '1's, this indicates that x is not a multiple of 2, and should return false.

Example: 15 in binary is 01111. There are 4 '1's, so the 1 == 4 comparison returns false. Increment to 16, or 10000, and there is exactly 1 '1', and so the comparison 1 == 1 returns true.

Hope this helped. :)

ncmncm · on June 11, 2021

It is appalling that, after every other general-computing architecture in common use either started out with a popcount instruction, or had one added later at substantial expense, RISC-V came out without one.

It still doesn't have any. The proposed B, "bitmanip" extension has it (along with a raft of trivial variations: count leading zeroes, count trailing ones, yada yada) but that is not ratified and not implemented in any chip I know of. Since B is a huge extension, we can expect it will be routinely omitted even after it's ratified, and compilers will need special prodding to produce any such instructions.

It should have been in the base instruction set. We probably can blame its lack on the academic origins of the design. CS professors probably think of it as a thing not needed to implement Lisp, therefore not worth class time.

(Some people say, "Oh, but you can trap and emulate it", which adds insult to injury. Trapping and emulating eliminates all the value the instruction offers.)

nullc · on June 12, 2021

Indeed, lack of CLZ is also pretty horrible making a lot of (de)compression and signal processing code needlessly much slower.

Bitmanip includes other some nice stuff that other CPUs lack-- but I'd give it up happily to be able to count on popcount and clz being there.

I'm doubtful with your academic origins speculation. Even MMIX has SADD. It may be more the case that CLZs and popcounts are relatively rare. But in essentially every case they're used there in some performance critical inner-loop. They're not even necessarily obvious in source code because modern compilers are smart enough to detect obvious constructions and substitute the instruction.

The world of arm is full of important extensions that are optional but the world gets by. Hopefully at some point someone will standardize some RISC-V edition that turns a number of optional things mandatory and after it becomes popular enough that's what people will target.

ncmncm · on June 12, 2021

The habit of assessing the importance of operations according only as how frequently they appear in static object files, or even as how frequently they are executed, yields a badly distorted picture.

An instruction executed relatively rarely in the course of running a program may achieve critical importance by reducing the latency of the most important result, or by each replacing what would otherwise be dozens of other instructions. Among candidates for such a distinction, popcount takes honors second only to multiplication.

Count-leading-zeroes and other variations rely on the same circuitry and can be emulated by preceding popcount with one or two conventional ALU operations, so are conveniences; popcount is the fully general, indispensible primitive.

avianes · on June 15, 2021

> It should have been in the base instruction set.

No, it should not. Popcount is not necessary for all the microcontrollers and it does not fully share its circuitry with the other instructions of the basic ISA.

Popcount must be in an extension (like bitmanip) and it will be ratified and integrated into chips soon. The 1.0 version of the B extension is currently under review and is much simpler than the previous versions.

graderjs · on June 11, 2021

This is a great piece of writing. I wish more blogs tied together so many technical perspectives like this. Bravo auteur!

carapace · on June 11, 2021

(I just want to add that this is the best thread on HN i've read in a while. Y'all bringing a little nerdy tear to my eye. <3 )

ludamad · on June 11, 2021

My first thought "How else do you quickly count pieces on a bitboard?". Definitely chess programming caused me to never second guess the usefulness of `popcount`

pklausler · on June 11, 2021

Surely the Cray BMM (bit matrix multiplication) instructions have a better claim to that nickname.

bsmith0 · on June 11, 2021

Here's a dumb question. If someone asked me to do it I'd probably write code like:

while(x != 0) { c += x&1; x >>= 1; }

Is this something that should be added to LLVM?

Edit: flip the order

nickysielicki · on June 11, 2021

Popcount is easily recognized by llvm (and it’s actually mentioned in the article...)

In the case of the code you’ve posted, you’re shifting out the LSB before you check the bit, so it’s not quite right, but (in general) popcount is recognized and used when possible.

bsmith0 · on June 11, 2021

Yep my bad! I think flipping the order should work still though.

The two links in the article:

https://lemire.me/blog/2016/05/23/the-surprising-cleverness-...

And the LLVM source indicate to me it only picks up on x&(x-1) pattern, which would miss the popcount optimization on code like mine.

nickysielicki · on June 11, 2021

Flipping the order works, except if the LSB on x is set.

https://godbolt.org/z/qdWhxMPsf

Note the run output under clang.

edit:

> And the LLVM source indicate to me it only picks up on x&(x-1) pattern, which would miss the popcount optimization on code like mine.

Thanks for teaching me something this morning. That's annoying.

I think the portable solution is std::popcount in C++ (or equivalent in Rust).

vardump · on June 11, 2021

While it seems to be true gcc and clang don't recognize this pattern even when implemented correctly, your program becomes an infinite loop if the highest bit is set (negative), because 'i' will never become 0.

Example with int8_t:

  int8_t i = -127; // 0b10000001
  i >>= 1; // 0b11000000
  i >>= 1; // 0b11100000
  i >>= 1; // 0b11110000
  i >>= 1; // 0b11111000
  i >>= 1; // 0b11111100
  i >>= 1; // 0b11111110
  i >>= 1; // 0b11111111
  i >>= 1; // 0b11111111 ad infinitum

One needs to be careful when using >> (shift right) with signed integers.

So your program is not equivalent to popcount.

tialaramex · on June 11, 2021

> or equivalent in Rust

https://doc.rust-lang.org/std/?search=count_ones

Internally Rust actually just staples LLVM's implementation into your code, via an intrinsic - but if that were ever to change the standard library count_ones() methods will do whatever happens instead so you should use that.

sumtechguy · on June 11, 2021

I came across this long ago. But it shows some very nice ways to fiddle bits. It has a few different ways to do it. Which would be handy on systems that do not have a popcount.

https://graphics.stanford.edu/~seander/bithacks.html

secondcoming · on June 11, 2021

Both clang and gcc have __builtin_popcnt variants.

ncmncm · on June 11, 2021

But both will issue actual popcount instructions only if they have been assured the program will be run on a machine that implements the instruction, which is not the default on, in particular, amd64/x86_64.

FridayoLeary · on June 11, 2021

A bit off- topic but i want to know; is binary code (01etc) still used today in programming/coding? And for what applications?

fortyrod · on June 11, 2021

Maybe not for general-purpose computing. I've used it for on-the-fly code generation (hacking display rotation into the Windows 3x BitBlt engine) and programming special-purpose media accelerators. In both cases you end up creating a bunch of convenience #defines or macros that generate the bits, which immediately takes you back into tiny language territory rather than pure machine code. The relative ease of creating new programmable hardware in FPGAs is another place this might occur.

ncmncm · on June 11, 2021

This seems like a strange question. All computing today uses binary code. We often write things in other bases for convenience, but at the hardware level, it is nothing but ones and zeroes.

But maybe you are asking about uses of independent bits. A maximally efficient representation for a set with a fixed universe of members uses each bit position in a large-enough integer type to represent presence or absence of a possible member. Then, AND and OR operations correspond to set intersection and union operations. C++ provides std::bitset for this use. Useful such sets include days in the week or month, and letters in the alphabet.

The article mentions chess, where you might have a 64-bit word to represent the positions of (say) all the pawns on the board. Fairly simple bitwise operations identify all the positions those pawns threaten.

Modern symmetric cryptographic primitives often use principally bitwise operations, including shifts.

rcgorton · on June 11, 2021

It is also incredibly useful for doing string scanning - look at strlen/strchr in various libc imp lementations

chrchang523 · on June 11, 2021

Nitpick: it's the related "count trailing zeros" operation that is useful (in combination with movemask) there, not popcount itself.

torgoguys · on June 11, 2021

[flagged]

mrobot · on June 11, 2021

[flagged]

torgoguys · on June 11, 2021

Simply because I found what I quoted from the article to be a weird comment. What I mean is that there probably isn't very much overlap in the two-circle Venn diagram of 1) people who already are low-level enough to care about native CPU instructions and 2) those who can't think of good uses for a population count.

I'm wasn't trying to be a jerk to the author. I just found it strange and would be happy to find out that I'm an idiot and forgot about ____ programmers who fall into that category. My comment was inviting such comments. (After all, if the author meant what they wrote there, apparently they fall into that camp...)

l33t2328 · on June 11, 2021

[flagged]

Alekhine · on June 11, 2021

But taken in the context of the whole article, it adds nothing of any value. The author literally spends the rest of the post describing how it is useful. Saying 'doesn't seem useful, does it?' at the beginning is a rhetorical device. The author is assuming that the reader probably doesn't have experience using bit manipulations for complex problems.

I am so tired of HN pedantry.

pc86 · on June 11, 2021

[flagged]

tbrake · on June 11, 2021

Put a little more blunt than I would have but there's nothing wrong there. Rarely a submission goes by where the first comments aren't people racing in to argue with the author.

There's an adversarial air that exists around dissecting, criticizing, nit-picking etc ideas presented, both here and in the tech world at large. As if one's most valuable contribution to a conversation is assuming the role of smug contrarian.

It's frankly tiring and obnoxious. I used to think non-geeks were just bad at communicating with us; that the fault was on their side somehow. But now I think we're just dicks.

lurquer · on June 11, 2021

That’s not the way ‘literally’ should be used.