Apologies if this is somewhat off-topic for the thread, but I suspect this will be a fun puzzle for fans of low-level optimization. The theme is "optimized fizzbuzz".
The classic fizzbuzz will use %3 and %5 operations to test divisibility. As we know from the same source as OP, these are horrifically slow. In addition, the usual approach to fizzbuzz has an annoying duplication, either of the strings or of the predicates.
So, the challenge is, write an optimized fizzbuzz with the following properties: the state for the divisibility testing is a function with a period of 15, which can be calculated in 2 C operations. There are 3 tests for printing, each of the form 'if (...) printf("...");' where each if test is one C operation.
One common trick is to handle the %3 and %5 with down-counters that you reset to 3 or 5 when they reach zero. Or better, unroll the loop by 3 so you only need one down-counter.
(Also, no, `%5` isn't "horrifically" slow when it's a compile-time-constant modulus: you or the compiler can do it with a multiply and shift for the integer division, and then x%5 = x - (x/5 * 5). https://godbolt.org/g/3HwBrF. It still sucks though.)
I wrote an x86 assembly FizzBuzz for fun a while ago, intended to be an example of how to optimize (https://stackoverflow.com/a/37494090/224132). Some parts of it kind of suck, though. I made some parts efficient (like appending "buzz\n" to a buffer with a mov [rdx], r13 / add rdx, 5), but left in too many conditional branches and a function call instead of inlining everything when unrolling.
I'm glad so many people like my post that the OP linked. It was fun to write. :)
Nice. It seems gcc/clang don't know that trick. They still divide and then check that x/3 * 3 == x. https://godbolt.org/g/V4kfuu. (So do ICC17 and MSVC CL19).
Tada! That's very close to what I had in mind. The only difference: you can fizz on 4924, buzz on 4210, none on 34cb, then always \n. Also, you have 3 ops to calculate the rotation, and that can be replaced by (0x8001 * x) >> 1 (for uint16_t x), but with tweaking of the masks because it's rotating right.
I decided to reject your conditions and replace them with my own, for no apparent reason. Mine has some of the redundancy you wish to eliminate, but the loop body is completely branchless, aside from the call to printf, of course, which we're apparently ignoring.
#include <stdio.h>
#define NUM "\x06" "%zu\n\0"
#define FIZZ "\x07" "Fizz\n\0"
#define BUZZ "\x07" "Buzz\n\0"
const char *x =
NUM
NUM
FIZZ
NUM
BUZZ
FIZZ
NUM
NUM
FIZZ
BUZZ
NUM
FIZZ
NUM
NUM
"\xa6" "FizzBuzz\n";
void fizzbuzz(size_t upTo) {
for(size_t i = 1; i <= upTo; i++) {
printf(x + 1, i);
x += *x;
}
}
int main(int argc, char **argv) {
fizzbuzz(100);
}
This is almost exactly the same as dmitryg's answer, using a state-machine the same way. It would compile to nearly the same code, but with an extra instruction to sign-extend the first char into a register before adding.
You did remove a level of indirection for the format-strings, though. You could have done that with
struct state {
int next;
char fmt[6];
};
Anyway, this has probably a 5 cycle loop-carried dependency chain on Skylake, from x += *x; compiling into a 4-cycle latency movsx rax, byte [rdi], then a 1-cycle add rdi, rax. (Or whatever registers the compiler picks).
If you'd stored pointers, you could have got it down to 4 cycles on Skylake for the load-use latency of a simple addressing mode ([reg + disp] where displacement is 0..2047).
This is true, and I've thought about it a bit. If you were really optimizing, then maybe the goal would be stated as putting the completed answer into a memory buffer. But for now, shave as many nanoseconds as you can from the logic and pretend we don't have to worry about printing.
One level of arrays may be skipped at cost of extra .rodata size (store string pointers directly in divs). But in a modern cpu cost of the extra load is small
I should have stated extra constraints; you're not allowed to use either additional memory (memory accesses are slow) or an if statement to update your state (branch mispredicts are slow). Yes, these are arbitrary (and I know the total time would be dominated by the cost of printf), but the point is to shave nanoseconds. Also, you've still got the duplication of strings.
I see your points; maybe I need to work on the storytelling in posing the puzzle. I do stand by the principle that avoiding branch mispredicts is worthwhile if you can replace the branch with 1 or 2 logic/arith C operations. In any case, the point of the puzzle is how to get a not-quite-trivial result from extremely efficient sequence of pure logic/arith operations.
No divisibility tests at all. No branches besides loop and printf call. Space can be saved by using a bitfield, but masking it will add speed costs.
.data: 0
.rodata: 30 + 4 * sizeof(void*) + strings
.text: depending on arch, but not much
Assuming call to printf has no cost and the caches are hot, a modern x86 cpu could execute one iteration of this loop in 1 cycle (issuing 2 loads, one add, one cmp, one branch)
I have no compiler and am typing this on a phone so please forgive typos, if any
It's pretty ridiculous to pretend that printf is free. The crux of the matter is to concat string constants and string-representations of integers into a buffer. "buzz\n" is only 5 bytes, so you can store it in a uint64_t.
Also, no, a typical x86 CPU would take about 4 or 5 cycles per iteration even if printf (and the cost of moving its arguments into the right register) was free.
state = nfo[state].next is a pointer-chasing loop-carried dependency chain, so you will bottleneck on L1D load-use latency. (For Skylake, 5 cycles for a complex addressing mode: http://www.7-cpu.com/cpu/Skylake.html).
If out-of-order execution could overlap many of these loops then the throughput could be close to 1 iter per clock.
Hm. I don't quite see how to do it, but I think I know one of the tricks you have in mind. If you have the book Hacker's Delight handy, open it up to 10-17, "Test for Zero Remainder after Division by a Constant". With n<=100 you can get an intermediate result with a single 32-bit multiply and mask that will allow testing divisibility by 3 and 5 each with a single operation (and a bit of union trickery to access high bits without counting as an operation). I just don't see how to make the test 'not a multiple of 3 or 5' come out of that in one C operation yet. So maybe this isn't the right technique.
Assuming 32-bit unsigned integers (but similarly for other integer sizes), multiplying by 0x11111111 and shifting right 28 bits would give a function of period 15, at least for inputs between 1 and 0x1000000e inclusive. Faster than dividing by 15.
The classic fizzbuzz will use %3 and %5 operations to test divisibility. As we know from the same source as OP, these are horrifically slow. In addition, the usual approach to fizzbuzz has an annoying duplication, either of the strings or of the predicates.
So, the challenge is, write an optimized fizzbuzz with the following properties: the state for the divisibility testing is a function with a period of 15, which can be calculated in 2 C operations. There are 3 tests for printing, each of the form 'if (...) printf("...");' where each if test is one C operation.
Good luck and have fun!