Hacker News new | past | comments | ask | show | jobs | submit login

Apologies if this is somewhat off-topic for the thread, but I suspect this will be a fun puzzle for fans of low-level optimization. The theme is "optimized fizzbuzz".

The classic fizzbuzz will use %3 and %5 operations to test divisibility. As we know from the same source as OP, these are horrifically slow. In addition, the usual approach to fizzbuzz has an annoying duplication, either of the strings or of the predicates.

So, the challenge is, write an optimized fizzbuzz with the following properties: the state for the divisibility testing is a function with a period of 15, which can be calculated in 2 C operations. There are 3 tests for printing, each of the form 'if (...) printf("...");' where each if test is one C operation.

Good luck and have fun!




One common trick is to handle the %3 and %5 with down-counters that you reset to 3 or 5 when they reach zero. Or better, unroll the loop by 3 so you only need one down-counter.

(Also, no, `%5` isn't "horrifically" slow when it's a compile-time-constant modulus: you or the compiler can do it with a multiply and shift for the integer division, and then x%5 = x - (x/5 * 5). https://godbolt.org/g/3HwBrF. It still sucks though.)

I wrote an x86 assembly FizzBuzz for fun a while ago, intended to be an example of how to optimize (https://stackoverflow.com/a/37494090/224132). Some parts of it kind of suck, though. I made some parts efficient (like appending "buzz\n" to a buffer with a mov [rdx], r13 / add rdx, 5), but left in too many conditional branches and a function call instead of inlining everything when unrolling.

I'm glad so many people like my post that the OP linked. It was fun to write. :)


Indeed, llvm is smart enough not to use a DIV instruction for modulo of a constant.

Turns out you can compute x%3 == 0 with even fewer steps, it's (x*0xaaaaaaaab) < 0x55555556 (assuming wrapping unsigned 32 bit arithmetic).


Nice. It seems gcc/clang don't know that trick. They still divide and then check that x/3 * 3 == x. https://godbolt.org/g/V4kfuu. (So do ICC17 and MSVC CL19).


Well, there is this approach:

    #include <stdio.h>
    #include <stdint.h>
    
    #define FIZZ(x)         (x & 0x0924)
    #define BUZZ(x)         (x & 0x0210)
    #define FIZZBUZZ(x)     (x & 0x4000)
    #define NONE(x)         (x & 0x34cb)
    
    void fizzbuzz(size_t n_max)
    {
        uint32_t x = 1;
        for (size_t i = 1; i <= n_max; i++)
        {
            if (FIZZ(x))
              printf("fizz\n");
            if (BUZZ(x))
              printf("buzz\n");
            if (FIZZBUZZ(x))
              printf("fizzbuzz\n");
            if (NONE(x))
              printf("%d\n", i);
            x <<= 1;
            x |= (x >> 15);
        }
    }
    
    int main(int argc, char **argv)
    {
        fizzbuzz(100);
    }
It's less costly than integer division, and I'm not sure it's what you had in mind. The output here still duplicates the strings.


Tada! That's very close to what I had in mind. The only difference: you can fizz on 4924, buzz on 4210, none on 34cb, then always \n. Also, you have 3 ops to calculate the rotation, and that can be replaced by (0x8001 * x) >> 1 (for uint16_t x), but with tweaking of the masks because it's rotating right.

Congrats!


Good nerdsnipe.

I decided to reject your conditions and replace them with my own, for no apparent reason. Mine has some of the redundancy you wish to eliminate, but the loop body is completely branchless, aside from the call to printf, of course, which we're apparently ignoring.

    #include <stdio.h>
    
    #define NUM "\x06" "%zu\n\0"
    #define FIZZ "\x07" "Fizz\n\0"
    #define BUZZ "\x07" "Buzz\n\0"
    
    const char *x =
        NUM
        NUM
        FIZZ
        NUM
        BUZZ
        FIZZ
        NUM
        NUM
        FIZZ
        BUZZ
        NUM
        FIZZ
        NUM
        NUM
        "\xa6" "FizzBuzz\n";
    
    void fizzbuzz(size_t upTo) {
        for(size_t i = 1; i <= upTo; i++) {
            printf(x + 1, i);
            x += *x;
        }
    }
    
    int main(int argc, char **argv) {
        fizzbuzz(100);
    }


This is almost exactly the same as dmitryg's answer, using a state-machine the same way. It would compile to nearly the same code, but with an extra instruction to sign-extend the first char into a register before adding.

You did remove a level of indirection for the format-strings, though. You could have done that with

    struct state {
        int next;
        char fmt[6];
    };
Anyway, this has probably a 5 cycle loop-carried dependency chain on Skylake, from x += *x; compiling into a 4-cycle latency movsx rax, byte [rdi], then a 1-cycle add rdi, rax. (Or whatever registers the compiler picks).

If you'd stored pointers, you could have got it down to 4 cycles on Skylake for the load-use latency of a simple addressing mode ([reg + disp] where displacement is 0..2047).


Good candidate for most evil use of the assumption that char is signed. Incidentally, that's not going to be true on arm[0].

[0]: http://blog.cdleary.com/2012/11/arm-chars-are-unsigned-by-de...


Haha, whoops! At least it's easy to fix.


Let's optimize the if statements away, too (no extra memory accesses or branches, either).

  #include <stdio.h>
  #include <stdint.h>
  
  #define F(n) (1 << (2*(n)-2))
  #define B(n) (1 << (2*(n)-1))
  
  int main() {
    uint32_t m = F(3) | F(6) | F(9) | F(12) | F(15) | B(5) | B(10) | B(15);
    int i;
    static char t[4][16]= { "%d\n", "Fizz\n", "Buzz\n", "FizzBuzz\n" };
    for (i = 1; i <= 100; i++) {
      printf(t[m&3], i);
      m = (m >> 2) | (m << 28);
    }
    return 0;
  }
Transforming that into an if-based version is left as an exercise for the reader. :)


This probably wouldn't be a very good optimization puzzle as the cost of printing would so dominate the arithmetic operations. :(


This is true, and I've thought about it a bit. If you were really optimizing, then maybe the goal would be stated as putting the completed answer into a memory buffer. But for now, shave as many nanoseconds as you can from the logic and pretend we don't have to worry about printing.


  #define FIZZ 1
  #define BUZZ 2
  static const uint8_t divs[] = 
 {FIZZ+BUZZ,0,0,FIZZ,0,BUZZ,FIZZ,0,0,FIZZ,BUZZ,0,FIZZ,0,0};
  static const char* strs[] = {
    [0] = "%u\n",
    [FIZZ] = "fizz\n",
    [BUZZ] = "buzz\n",
    [FIZZ+BUZZ] = "fizzbuzz\n"
  };

  void fizzbuzz(uint32_t upTo)
  {
    uint32_t i, state;

    for(i = state = 0; i < upTo; i++) {
      printf(strs[divs[state]], i);
      if(++state == sizeof(divs) / sizeof(*divs))
          state = 0;
  }


No divisibility tests at all

One level of arrays may be skipped at cost of extra .rodata size (store string pointers directly in divs). But in a modern cpu cost of the extra load is small


I should have stated extra constraints; you're not allowed to use either additional memory (memory accesses are slow) or an if statement to update your state (branch mispredicts are slow). Yes, these are arbitrary (and I know the total time would be dominated by the cost of printf), but the point is to shave nanoseconds. Also, you've still got the duplication of strings.


Memory accesses are not slow on modern CPUs. With speculative loads and large L1d caches these loads won't stall even a single cycle

String duplication helps speed here.

Branch misprediction will happen once every fifteen times worst case.

Plus, if shaving nanoseconds, printing is a bad idea


I see your points; maybe I need to work on the storytelling in posing the puzzle. I do stand by the principle that avoiding branch mispredicts is worthwhile if you can replace the branch with 1 or 2 logic/arith C operations. In any case, the point of the puzzle is how to get a not-quite-trivial result from extremely efficient sequence of pure logic/arith operations.


  #define FIZZ 1
  #define BUZZ 2


  struct {
    uint8_t flags;
    uint8_t next;
  } static const nfo[] = {
    {FIZZ+BUZZ,1},
    {0,2},
    {0,3},
    {FIZZ,4},
    {0,5},
    {BUZZ,6},
    {FIZZ,7},
    {0,8},
    {0,9},
    {FIZZ,10},
    {BUZZ,11},
    {0,12},
    {FIZZ,13},
    {0,14},
    {0,0}
  } __attribute__((align(YOUR_L1D_LINE_SIZE)));
  static const char* strs[] = {
    [0] = "%u\n",
    [FIZZ] = "fizz\n",
    [BUZZ] = "buzz\n",
    [FIZZ+BUZZ] = "fizzbuzz\n"
  };

  void fizzbuzz(uint32_t upTo)
  {
    uint32_t i, state;

    for(i = state = 0; i < upTo; i++) {
      printf(strs[nfo[state].flags], i);
      state = nfo[state].next;
    }
  }


No divisibility tests at all. No branches besides loop and printf call. Space can be saved by using a bitfield, but masking it will add speed costs.

  .data: 0
  .rodata: 30 + 4 * sizeof(void*) + strings
  .text: depending on arch, but not much
Assuming call to printf has no cost and the caches are hot, a modern x86 cpu could execute one iteration of this loop in 1 cycle (issuing 2 loads, one add, one cmp, one branch)

I have no compiler and am typing this on a phone so please forgive typos, if any


It's pretty ridiculous to pretend that printf is free. The crux of the matter is to concat string constants and string-representations of integers into a buffer. "buzz\n" is only 5 bytes, so you can store it in a uint64_t.

Also, no, a typical x86 CPU would take about 4 or 5 cycles per iteration even if printf (and the cost of moving its arguments into the right register) was free.

state = nfo[state].next is a pointer-chasing loop-carried dependency chain, so you will bottleneck on L1D load-use latency. (For Skylake, 5 cycles for a complex addressing mode: http://www.7-cpu.com/cpu/Skylake.html).

If out-of-order execution could overlap many of these loops then the throughput could be close to 1 iter per clock.


@raphlinus: well, if you can get it better than one iteration per cycle, I would love to see how (truly)


I see your point regarding pure cycle counting, but as I posed the puzzle it's still open :)


Don't want to be _that_ guy but your loop starts with 0 instead of 1


Hm. I don't quite see how to do it, but I think I know one of the tricks you have in mind. If you have the book Hacker's Delight handy, open it up to 10-17, "Test for Zero Remainder after Division by a Constant". With n<=100 you can get an intermediate result with a single 32-bit multiply and mask that will allow testing divisibility by 3 and 5 each with a single operation (and a bit of union trickery to access high bits without counting as an operation). I just don't see how to make the test 'not a multiple of 3 or 5' come out of that in one C operation yet. So maybe this isn't the right technique.


and having allowed myself to read the rest of the replies .. fizz bzzzzt!


Assuming 32-bit unsigned integers (but similarly for other integer sizes), multiplying by 0x11111111 and shifting right 28 bits would give a function of period 15, at least for inputs between 1 and 0x1000000e inclusive. Faster than dividing by 15.


You're absolutely on the right track with a (wrapping) multiply and a shift right, but the constants aren't quite right.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: