Hacker News new | past | comments | ask | show | jobs | submit login

I did the same thing!

But they didn't like it. So I wrote the naivest possible popcount and returned 1 == naive_bitcount(n).

They didn't like that. I explained to them that the compiler will optimize or the loop so this will be fast and that they should check it on godbolt. (did I mention that I was coding on the white board?) There's a loop in the source code, but there isn't a loop in the machine code, and it's like three instructions.

They didn't like that either, so I worked out the n & (n-1) method from first principals. I didn't do it optimally, there was some redundancy in my code, which the interviewer smugly explained to me. I can't remember if I told him "that's nice, but it's still an extra instruction compared to the trivial, obvious, easy to maintain method I showed you ten minutes ago." I'm pretty sure I didn't, too afraid to make a bad impression.

I didn't get the job. In hindsight, I'm glad I didn't. Those guys are dicks.




> I explained to them that the compiler will optimize or the loop so this will be fast and that they should check it on godbolt. (did I mention that I was coding on the white board?) There's a loop in the source code, but there isn't a loop in the machine code, and it's like three instructions.

Did you check this yourself? These sorts of "grand" pattern-matches tend to be low-productivity optimizations for compilers (the speedups you get are not worth the extra compile-time). LLVM tends to have a higher tolerance for this than GCC. But even looking at LLVM's source code, the "naivest possible popcount" [1] is not going to pass the loop idiom recognizer for a popcount.

> I can't remember if I told him "that's nice, but it's still an extra instruction compared to the trivial, obvious, easy to maintain method I showed you ten minutes ago."

n & (n - 1) should come out to 2 cycles of latency. 1 == popcount(n) would come out to 4 cycles of latency (3 latency for the POPCNT, 1 for the comparison at the end).

[1] I'm assuming by this that you mean iterating over all bits and individually checking each one for set/unset. Both gcc (since version 9) and LLVM handle the case where you iterate through set bits in a for (; x; x = x & (x - 1)) loop.


I would not be surprised if the compiler detected the second version as well and replaced it with a popcnt. In any case, it seems like you ended up with the stereotypical “bad interviewer” so I doubt it would have mattered to point this out.


That was you interviewing them. They failed.


The interviewer felt smug, so their mission was accomplished.


If that was their mission (rather than to hire someone), they succeeded in their personal mission, at the expense of the company's intended mission for them.


popcnt is generally going to take multiple cycles though, I'd expect the n&(n-1) method to be more performant. It's also vectorizeable and more portable.


Intel AVX-512 has a vector popcount, VPOPCNTDQ.


builtin popcount when compiled for an architecture with hardware popcount should be 4 bytes per 1 cycle.

The n&(n-1) method to test for a power of 2 cannot be faster than that.


POPCNT has a latency of 3 cycles on most x86 hardware.


You are right. I was thinking of the number of operations.

Is the popcnt test slower than the n&(n-1) test? (Edit: Ooops! I see you already addressed that at https://news.ycombinator.com/item?id=20918136 .)


A popcnt/test/jmp cycle should have roughly 4 cycles of latency: a test+jmp can be fused into one uop, and popcnt would be 3 cycles of latency. n&(n-1) == 0 would compile down into a DEC, TEST, JMP, which is 2 cycles of latency (again, test+jmp are fused into one uop), as dec is just 1 cycle of latency.

So n&(n-1) is faster for checking if it's a power of two.


In the following I confirm that n&(n-1) is faster than popcount (compiled with "cc -march=haswell check_pow2.c -O3"), at 699 vs. 1016 microseconds, respectively:

  #include <stdio.h>
  #include <stdlib.h>
  #include <sys/time.h>
  
  const int N = 2*1000*1000;
  
  int count_popcnt(unsigned int *data) {
    int sum = 0;
    int i;
    for (i=0; i<N; i+=8) {
      sum += (
              (__builtin_popcount((data+i)[0]) == 1) +
              (__builtin_popcount((data+i)[1]) == 1) +
              (__builtin_popcount((data+i)[2]) == 1) +
              (__builtin_popcount((data+i)[3]) == 1) +
              (__builtin_popcount((data+i)[4]) == 1) +
              (__builtin_popcount((data+i)[5]) == 1) +
              (__builtin_popcount((data+i)[6]) == 1) +
              (__builtin_popcount((data+i)[7]) == 1)
              );
    }
    return sum;
  }
  
  int count_wegner(unsigned int *data) {
    int sum = 0;
    int i;
    for (i=0; i<N; i+=8) {
      sum += (
              ((((data+i)[0]) & ((data+i)[0]-1)) == 0) +
              ((((data+i)[1]) & ((data+i)[1]-1)) == 0) +
              ((((data+i)[2]) & ((data+i)[2]-1)) == 0) +
              ((((data+i)[3]) & ((data+i)[3]-1)) == 0) +
              ((((data+i)[4]) & ((data+i)[4]-1)) == 0) +
              ((((data+i)[5]) & ((data+i)[5]-1)) == 0) +
              ((((data+i)[6]) & ((data+i)[6]-1)) == 0) +
              ((((data+i)[7]) & ((data+i)[7]-1)) == 0)
              );
    }
    return sum;
  }
  
  
  int main(void) {
    unsigned int data[N];
    FILE *f = fopen("/dev/urandom", "rb");
    if (!f) {
      perror("Cannot open /dev/urandom");
      exit(1);
    }
    if (fread(data, sizeof(unsigned int), N, f) != N) {
      perror("Cannot read enough items");
      exit(1);
    }
    /* Use only a few bits */
    for (int i=0; i<N; i++) {
      data[i] = data[i] & 31;
      /* Don't worry about handling 0 */
      if (data[i] == 0) {
        data[i] = i + 1;
      }
    }
  
    struct timeval time1, time2;
    int n;
    
    gettimeofday(&time1, NULL);
    n = count_popcnt(data);
    gettimeofday(&time2, NULL);
    
    printf(" popcnt: %d in %8ld us\n",
           n,
           (time2.tv_sec-time1.tv_sec) * 1000*1000 +
           (time2.tv_usec-time1.tv_usec));
  
    gettimeofday(&time1, NULL);
    n = count_wegner(data);
    gettimeofday(&time2, NULL);
  
    printf("n&(n-1): %d in %8ld us\n",
           n,
           (time2.tv_sec-time1.tv_sec) * 1000*1000 +
           (time2.tv_usec-time1.tv_usec));
           
    return 0;
  }
Thank you for teaching me something new!


Besides just timing it (excellent!) did you take a look at what it compiles to?

https://godbolt.org/z/C0ZeE5

It's faster because both GCC and Clang now optimize loops with n&(n-1) to use AVX2 SIMD! I haven't looked closely to confirm, but I think they may in fact even do Harley-Seal for "naive popcount" loops.

C is no longer portable assembly. If you want to test whether a particular algorithm is faster than another, you probably need to write assembly --- or at least confirm that the compiler did what you thought it did.


I think I'll stick with paying people like you to deal with these sorts of issues in my code. ;)


Maybe they wanted something that worked with non integer types.


Both methods only work on integers.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: