It's pretty ridiculous to pretend that printf is free. The crux of the matter is to concat string constants and string-representations of integers into a buffer. "buzz\n" is only 5 bytes, so you can store it in a uint64_t.
Also, no, a typical x86 CPU would take about 4 or 5 cycles per iteration even if printf (and the cost of moving its arguments into the right register) was free.
state = nfo[state].next is a pointer-chasing loop-carried dependency chain, so you will bottleneck on L1D load-use latency. (For Skylake, 5 cycles for a complex addressing mode: http://www.7-cpu.com/cpu/Skylake.html).
If out-of-order execution could overlap many of these loops then the throughput could be close to 1 iter per clock.
Also, no, a typical x86 CPU would take about 4 or 5 cycles per iteration even if printf (and the cost of moving its arguments into the right register) was free.
state = nfo[state].next is a pointer-chasing loop-carried dependency chain, so you will bottleneck on L1D load-use latency. (For Skylake, 5 cycles for a complex addressing mode: http://www.7-cpu.com/cpu/Skylake.html).
If out-of-order execution could overlap many of these loops then the throughput could be close to 1 iter per clock.