Apologies if this is somewhat off-topic for the thread, but I suspect this will ...

pcordes · on Aug 22, 2017

One common trick is to handle the %3 and %5 with down-counters that you reset to 3 or 5 when they reach zero. Or better, unroll the loop by 3 so you only need one down-counter.

(Also, no, `%5` isn't "horrifically" slow when it's a compile-time-constant modulus: you or the compiler can do it with a multiply and shift for the integer division, and then x%5 = x - (x/5 * 5). https://godbolt.org/g/3HwBrF. It still sucks though.)

I wrote an x86 assembly FizzBuzz for fun a while ago, intended to be an example of how to optimize (https://stackoverflow.com/a/37494090/224132). Some parts of it kind of suck, though. I made some parts efficient (like appending "buzz\n" to a buffer with a mov [rdx], r13 / add rdx, 5), but left in too many conditional branches and a function call instead of inlining everything when unrolling.

I'm glad so many people like my post that the OP linked. It was fun to write. :)

raphlinus · on Aug 22, 2017

Indeed, llvm is smart enough not to use a DIV instruction for modulo of a constant.

Turns out you can compute x%3 == 0 with even fewer steps, it's (x*0xaaaaaaaab) < 0x55555556 (assuming wrapping unsigned 32 bit arithmetic).

pcordes · on Aug 22, 2017

Nice. It seems gcc/clang don't know that trick. They still divide and then check that x/3 * 3 == x. https://godbolt.org/g/V4kfuu. (So do ICC17 and MSVC CL19).

showdead · on Aug 22, 2017

Well, there is this approach:

    #include <stdio.h>
    #include <stdint.h>
    
    #define FIZZ(x)         (x & 0x0924)
    #define BUZZ(x)         (x & 0x0210)
    #define FIZZBUZZ(x)     (x & 0x4000)
    #define NONE(x)         (x & 0x34cb)
    
    void fizzbuzz(size_t n_max)
    {
        uint32_t x = 1;
        for (size_t i = 1; i <= n_max; i++)
        {
            if (FIZZ(x))
              printf("fizz\n");
            if (BUZZ(x))
              printf("buzz\n");
            if (FIZZBUZZ(x))
              printf("fizzbuzz\n");
            if (NONE(x))
              printf("%d\n", i);
            x <<= 1;
            x |= (x >> 15);
        }
    }
    
    int main(int argc, char **argv)
    {
        fizzbuzz(100);
    }

It's less costly than integer division, and I'm not sure it's what you had in mind. The output here still duplicates the strings.

raphlinus · on Aug 22, 2017

Tada! That's very close to what I had in mind. The only difference: you can fizz on 4924, buzz on 4210, none on 34cb, then always \n. Also, you have 3 ops to calculate the rotation, and that can be replaced by (0x8001 * x) >> 1 (for uint16_t x), but with tweaking of the masks because it's rotating right.

Congrats!

mikeash · on Aug 22, 2017

Good nerdsnipe.

I decided to reject your conditions and replace them with my own, for no apparent reason. Mine has some of the redundancy you wish to eliminate, but the loop body is completely branchless, aside from the call to printf, of course, which we're apparently ignoring.

    #include <stdio.h>
    
    #define NUM "\x06" "%zu\n\0"
    #define FIZZ "\x07" "Fizz\n\0"
    #define BUZZ "\x07" "Buzz\n\0"
    
    const char *x =
        NUM
        NUM
        FIZZ
        NUM
        BUZZ
        FIZZ
        NUM
        NUM
        FIZZ
        BUZZ
        NUM
        FIZZ
        NUM
        NUM
        "\xa6" "FizzBuzz\n";
    
    void fizzbuzz(size_t upTo) {
        for(size_t i = 1; i <= upTo; i++) {
            printf(x + 1, i);
            x += *x;
        }
    }
    
    int main(int argc, char **argv) {
        fizzbuzz(100);
    }

pcordes · on Aug 22, 2017

This is almost exactly the same as dmitryg's answer, using a state-machine the same way. It would compile to nearly the same code, but with an extra instruction to sign-extend the first char into a register before adding.

You did remove a level of indirection for the format-strings, though. You could have done that with

    struct state {
        int next;
        char fmt[6];
    };

Anyway, this has probably a 5 cycle loop-carried dependency chain on Skylake, from x += *x; compiling into a 4-cycle latency movsx rax, byte [rdi], then a 1-cycle add rdi, rax. (Or whatever registers the compiler picks).

If you'd stored pointers, you could have got it down to 4 cycles on Skylake for the load-use latency of a simple addressing mode ([reg + disp] where displacement is 0..2047).

raphlinus · on Aug 22, 2017

Good candidate for most evil use of the assumption that char is signed. Incidentally, that's not going to be true on arm[0].

[0]: http://blog.cdleary.com/2012/11/arm-chars-are-unsigned-by-de...

mikeash · on Aug 22, 2017

Haha, whoops! At least it's easy to fix.

rbehrends · on Aug 22, 2017

Let's optimize the if statements away, too (no extra memory accesses or branches, either).

  #include <stdio.h>
  #include <stdint.h>
  
  #define F(n) (1 << (2*(n)-2))
  #define B(n) (1 << (2*(n)-1))
  
  int main() {
    uint32_t m = F(3) | F(6) | F(9) | F(12) | F(15) | B(5) | B(10) | B(15);
    int i;
    static char t[4][16]= { "%d\n", "Fizz\n", "Buzz\n", "FizzBuzz\n" };
    for (i = 1; i <= 100; i++) {
      printf(t[m&3], i);
      m = (m >> 2) | (m << 28);
    }
    return 0;
  }

Transforming that into an if-based version is left as an exercise for the reader. :)

ghettoimp · on Aug 22, 2017

This probably wouldn't be a very good optimization puzzle as the cost of printing would so dominate the arithmetic operations. :(

raphlinus · on Aug 22, 2017

This is true, and I've thought about it a bit. If you were really optimizing, then maybe the goal would be stated as putting the completed answer into a memory buffer. But for now, shave as many nanoseconds as you can from the logic and pretend we don't have to worry about printing.

dmitrygr · on Aug 22, 2017

  #define FIZZ 1
  #define BUZZ 2
  static const uint8_t divs[] = 
 {FIZZ+BUZZ,0,0,FIZZ,0,BUZZ,FIZZ,0,0,FIZZ,BUZZ,0,FIZZ,0,0};
  static const char* strs[] = {
    [0] = "%u\n",
    [FIZZ] = "fizz\n",
    [BUZZ] = "buzz\n",
    [FIZZ+BUZZ] = "fizzbuzz\n"
  };

  void fizzbuzz(uint32_t upTo)
  {
    uint32_t i, state;

    for(i = state = 0; i < upTo; i++) {
      printf(strs[divs[state]], i);
      if(++state == sizeof(divs) / sizeof(*divs))
          state = 0;
  }

No divisibility tests at all

One level of arrays may be skipped at cost of extra .rodata size (store string pointers directly in divs). But in a modern cpu cost of the extra load is small

raphlinus · on Aug 22, 2017

I should have stated extra constraints; you're not allowed to use either additional memory (memory accesses are slow) or an if statement to update your state (branch mispredicts are slow). Yes, these are arbitrary (and I know the total time would be dominated by the cost of printf), but the point is to shave nanoseconds. Also, you've still got the duplication of strings.

dmitrygr · on Aug 22, 2017

Memory accesses are not slow on modern CPUs. With speculative loads and large L1d caches these loads won't stall even a single cycle

String duplication helps speed here.

Branch misprediction will happen once every fifteen times worst case.

Plus, if shaving nanoseconds, printing is a bad idea

raphlinus · on Aug 22, 2017

I see your points; maybe I need to work on the storytelling in posing the puzzle. I do stand by the principle that avoiding branch mispredicts is worthwhile if you can replace the branch with 1 or 2 logic/arith C operations. In any case, the point of the puzzle is how to get a not-quite-trivial result from extremely efficient sequence of pure logic/arith operations.

dmitrygr · on Aug 22, 2017

  #define FIZZ 1
  #define BUZZ 2


  struct {
    uint8_t flags;
    uint8_t next;
  } static const nfo[] = {
    {FIZZ+BUZZ,1},
    {0,2},
    {0,3},
    {FIZZ,4},
    {0,5},
    {BUZZ,6},
    {FIZZ,7},
    {0,8},
    {0,9},
    {FIZZ,10},
    {BUZZ,11},
    {0,12},
    {FIZZ,13},
    {0,14},
    {0,0}
  } __attribute__((align(YOUR_L1D_LINE_SIZE)));
  static const char* strs[] = {
    [0] = "%u\n",
    [FIZZ] = "fizz\n",
    [BUZZ] = "buzz\n",
    [FIZZ+BUZZ] = "fizzbuzz\n"
  };

  void fizzbuzz(uint32_t upTo)
  {
    uint32_t i, state;

    for(i = state = 0; i < upTo; i++) {
      printf(strs[nfo[state].flags], i);
      state = nfo[state].next;
    }
  }

No divisibility tests at all. No branches besides loop and printf call. Space can be saved by using a bitfield, but masking it will add speed costs.

  .data: 0
  .rodata: 30 + 4 * sizeof(void*) + strings
  .text: depending on arch, but not much

Assuming call to printf has no cost and the caches are hot, a modern x86 cpu could execute one iteration of this loop in 1 cycle (issuing 2 loads, one add, one cmp, one branch)

I have no compiler and am typing this on a phone so please forgive typos, if any

pcordes · on Aug 22, 2017

It's pretty ridiculous to pretend that printf is free. The crux of the matter is to concat string constants and string-representations of integers into a buffer. "buzz\n" is only 5 bytes, so you can store it in a uint64_t.

Also, no, a typical x86 CPU would take about 4 or 5 cycles per iteration even if printf (and the cost of moving its arguments into the right register) was free.

state = nfo[state].next is a pointer-chasing loop-carried dependency chain, so you will bottleneck on L1D load-use latency. (For Skylake, 5 cycles for a complex addressing mode: http://www.7-cpu.com/cpu/Skylake.html).

If out-of-order execution could overlap many of these loops then the throughput could be close to 1 iter per clock.

dmitrygr · on Aug 22, 2017

@raphlinus: well, if you can get it better than one iteration per cycle, I would love to see how (truly)

raphlinus · on Aug 22, 2017

I see your point regarding pure cycle counting, but as I posed the puzzle it's still open :)

beeswax · on Aug 22, 2017

Don't want to be _that_ guy but your loop starts with 0 instead of 1

jepler · on Aug 22, 2017

Hm. I don't quite see how to do it, but I think I know one of the tricks you have in mind. If you have the book Hacker's Delight handy, open it up to 10-17, "Test for Zero Remainder after Division by a Constant". With n<=100 you can get an intermediate result with a single 32-bit multiply and mask that will allow testing divisibility by 3 and 5 each with a single operation (and a bit of union trickery to access high bits without counting as an operation). I just don't see how to make the test 'not a multiple of 3 or 5' come out of that in one C operation yet. So maybe this isn't the right technique.

jepler · on Aug 22, 2017

and having allowed myself to read the rest of the replies .. fizz bzzzzt!

besselheim · on Aug 22, 2017

Assuming 32-bit unsigned integers (but similarly for other integer sizes), multiplying by 0x11111111 and shifting right 28 bits would give a function of period 15, at least for inputs between 1 and 0x1000000e inclusive. Faster than dividing by 15.

raphlinus · on Aug 22, 2017

You're absolutely on the right track with a (wrapping) multiply and a shift right, but the constants aren't quite right.