I did the same thing! But they didn't like it. So I wrote the naivest possible p...

jcranmer · on Sept 9, 2019

> I explained to them that the compiler will optimize or the loop so this will be fast and that they should check it on godbolt. (did I mention that I was coding on the white board?) There's a loop in the source code, but there isn't a loop in the machine code, and it's like three instructions.

Did you check this yourself? These sorts of "grand" pattern-matches tend to be low-productivity optimizations for compilers (the speedups you get are not worth the extra compile-time). LLVM tends to have a higher tolerance for this than GCC. But even looking at LLVM's source code, the "naivest possible popcount" [1] is not going to pass the loop idiom recognizer for a popcount.

> I can't remember if I told him "that's nice, but it's still an extra instruction compared to the trivial, obvious, easy to maintain method I showed you ten minutes ago."

n & (n - 1) should come out to 2 cycles of latency. 1 == popcount(n) would come out to 4 cycles of latency (3 latency for the POPCNT, 1 for the comparison at the end).

[1] I'm assuming by this that you mean iterating over all bits and individually checking each one for set/unset. Both gcc (since version 9) and LLVM handle the case where you iterate through set bits in a for (; x; x = x & (x - 1)) loop.

saagarjha · on Sept 9, 2019

I would not be surprised if the compiler detected the second version as well and replaced it with a popcnt. In any case, it seems like you ended up with the stereotypical “bad interviewer” so I doubt it would have mattered to point this out.

garmaine · on Sept 9, 2019

That was you interviewing them. They failed.

quickthrower2 · on Sept 9, 2019

The interviewer felt smug, so their mission was accomplished.

AnimalMuppet · on Sept 9, 2019

If that was their mission (rather than to hire someone), they succeeded in their personal mission, at the expense of the company's intended mission for them.

dooglius · on Sept 9, 2019

popcnt is generally going to take multiple cycles though, I'd expect the n&(n-1) method to be more performant. It's also vectorizeable and more portable.

castratikron · on Sept 9, 2019

Intel AVX-512 has a vector popcount, VPOPCNTDQ.

dalke · on Sept 9, 2019

builtin popcount when compiled for an architecture with hardware popcount should be 4 bytes per 1 cycle.

The n&(n-1) method to test for a power of 2 cannot be faster than that.

jcranmer · on Sept 9, 2019

POPCNT has a latency of 3 cycles on most x86 hardware.

dalke · on Sept 9, 2019

You are right. I was thinking of the number of operations.

Is the popcnt test slower than the n&(n-1) test? (Edit: Ooops! I see you already addressed that at https://news.ycombinator.com/item?id=20918136 .)

jcranmer · on Sept 9, 2019

A popcnt/test/jmp cycle should have roughly 4 cycles of latency: a test+jmp can be fused into one uop, and popcnt would be 3 cycles of latency. n&(n-1) == 0 would compile down into a DEC, TEST, JMP, which is 2 cycles of latency (again, test+jmp are fused into one uop), as dec is just 1 cycle of latency.

So n&(n-1) is faster for checking if it's a power of two.

dalke · on Sept 9, 2019

In the following I confirm that n&(n-1) is faster than popcount (compiled with "cc -march=haswell check_pow2.c -O3"), at 699 vs. 1016 microseconds, respectively:

  #include <stdio.h>
  #include <stdlib.h>
  #include <sys/time.h>
  
  const int N = 2*1000*1000;
  
  int count_popcnt(unsigned int *data) {
    int sum = 0;
    int i;
    for (i=0; i<N; i+=8) {
      sum += (
              (__builtin_popcount((data+i)[0]) == 1) +
              (__builtin_popcount((data+i)[1]) == 1) +
              (__builtin_popcount((data+i)[2]) == 1) +
              (__builtin_popcount((data+i)[3]) == 1) +
              (__builtin_popcount((data+i)[4]) == 1) +
              (__builtin_popcount((data+i)[5]) == 1) +
              (__builtin_popcount((data+i)[6]) == 1) +
              (__builtin_popcount((data+i)[7]) == 1)
              );
    }
    return sum;
  }
  
  int count_wegner(unsigned int *data) {
    int sum = 0;
    int i;
    for (i=0; i<N; i+=8) {
      sum += (
              ((((data+i)[0]) & ((data+i)[0]-1)) == 0) +
              ((((data+i)[1]) & ((data+i)[1]-1)) == 0) +
              ((((data+i)[2]) & ((data+i)[2]-1)) == 0) +
              ((((data+i)[3]) & ((data+i)[3]-1)) == 0) +
              ((((data+i)[4]) & ((data+i)[4]-1)) == 0) +
              ((((data+i)[5]) & ((data+i)[5]-1)) == 0) +
              ((((data+i)[6]) & ((data+i)[6]-1)) == 0) +
              ((((data+i)[7]) & ((data+i)[7]-1)) == 0)
              );
    }
    return sum;
  }
  
  
  int main(void) {
    unsigned int data[N];
    FILE *f = fopen("/dev/urandom", "rb");
    if (!f) {
      perror("Cannot open /dev/urandom");
      exit(1);
    }
    if (fread(data, sizeof(unsigned int), N, f) != N) {
      perror("Cannot read enough items");
      exit(1);
    }
    /* Use only a few bits */
    for (int i=0; i<N; i++) {
      data[i] = data[i] & 31;
      /* Don't worry about handling 0 */
      if (data[i] == 0) {
        data[i] = i + 1;
      }
    }
  
    struct timeval time1, time2;
    int n;
    
    gettimeofday(&time1, NULL);
    n = count_popcnt(data);
    gettimeofday(&time2, NULL);
    
    printf(" popcnt: %d in %8ld us\n",
           n,
           (time2.tv_sec-time1.tv_sec) * 1000*1000 +
           (time2.tv_usec-time1.tv_usec));
  
    gettimeofday(&time1, NULL);
    n = count_wegner(data);
    gettimeofday(&time2, NULL);
  
    printf("n&(n-1): %d in %8ld us\n",
           n,
           (time2.tv_sec-time1.tv_sec) * 1000*1000 +
           (time2.tv_usec-time1.tv_usec));
           
    return 0;
  }

Thank you for teaching me something new!

nkurz · on Sept 9, 2019

Besides just timing it (excellent!) did you take a look at what it compiles to?

https://godbolt.org/z/C0ZeE5

It's faster because both GCC and Clang now optimize loops with n&(n-1) to use AVX2 SIMD! I haven't looked closely to confirm, but I think they may in fact even do Harley-Seal for "naive popcount" loops.

C is no longer portable assembly. If you want to test whether a particular algorithm is faster than another, you probably need to write assembly --- or at least confirm that the compiler did what you thought it did.

dalke · on Sept 10, 2019

I think I'll stick with paying people like you to deal with these sorts of issues in my code. ;)

piadodjanho · on Sept 9, 2019

Maybe they wanted something that worked with non integer types.

saagarjha · on Sept 9, 2019

Both methods only work on integers.