The Fastest FizzBuzz Implementation

atorodius · on Dec 2, 2021

Relevant: https://news.ycombinator.com/item?id=29031488

dang · on Dec 2, 2021

Thanks! Macroexpanded:

55 GiB/s FizzBuzz - https://news.ycombinator.com/item?id=29031488 - Oct 2021 (255 comments)

throw10920 · on Dec 2, 2021

If I may ask, what's the default keybinding for dang-macroexpand-1?

dang · on Dec 2, 2021

crtl+alt+b

lifthrasiir · on Dec 2, 2021

> The developer behind the Assembler version goes by the handle "ais523".

Alex Smith [1] is also known for winning the Wolfram 2,3 Turing Machine Research Prize [2].

[1] https://www.cs.bham.ac.uk/~ais523/

[2] https://writings.stephenwolfram.com/2007/10/the-prize-is-won...

buro9 · on Dec 2, 2021

The simplest change for performance even in Python, JavaScript, etc is to avoid the modulo function.

All of the top contenders on https://codegolf.stackexchange.com/questions/215216/high-thr... do exactly this.

Yes there is a lot of optimization specific to languages, OS, CPU... but the takeaway for me isn't that you have to go that extent to build an order of magnitude improvement, you can still achieve many order of magnitudes of improvement by understanding what is happening when you use something like the modulo function and that if you have requirements where you can avoid using it then that's the win.

For FizzBuzz you are told it "iterates from 1 to <some number>"... and modulo would be perfect for "take a single number and produce Fizz, Buzz or FizzBuzz if it's divisable by 3, 5, or both 3 and 5"... but as we're iterating, counters are better suited to the task.

I love seeing epic code golf (and the solution outlined in the article is it), but the takeaway for the majority of engineers is that significant improvements can be made with a readable and maintainable solution just by understanding the requirements and comp sci fundamentals.

jiggawatts · on Dec 2, 2021

The really heavyweight use of div/mod is to produce the base-10 printed decimal numbers, not in the main control loop.

The record-beating versions all do something clever to optimise the printing of the numbers between the "Fizz" and "Buzz" strings.

richardwhiuk · on Dec 2, 2021

> Before running the code, make sure your CPU does support AVX2. ... You should see "sse2" at a minimum.

AVX2 is not SSE2 - SSE2 is much, much older.

You want to check for the avx2 flag itself.

parhamn · on Dec 2, 2021

When I saw this a few weeks ago it set off a lot of thinking about compilers and ML.

Even the unoptimized C version is 10x slower than the one that is painstakingly optimized.

Some of this performance gain will certain be achievable by ML/AI assisted compilers and translators. In that case, even a 2x speedup would be game changing. Is anyone using ML in compilers yet? The part where you verify the code still does what you wanted it to do seems trickiest.

m3at · on Dec 2, 2021

> Is anyone using ML in compilers yet?

Yes :)

Check out TVM for something mainstream: http://tvm.apache.org/

And LibNC for something esoteric (but elegant): https://bellard.org/libnc/

lovasoa · on Dec 2, 2021

The question was about the use of ML in compilers, not the use of compilers in ML.

It was more about things like http://compilerai.apps.iitd.ac.in/

m3at · on Dec 3, 2021

Oh indeed I misread, thanks for the correction. Note though that TVM is also using ML in the compiler: https://arxiv.org/abs/2006.06762

DeathArrow · on Dec 2, 2021

It seems there's a huge gap between what machine code compilers can generate and hand written optimized assembler. There's a lot of room left for compilers to be improved.

andrewaylett · on Dec 2, 2021

Back when I was employed to work on a compiler, our approach to such things was to implement our run-time libraries in C and fix the compiler so it was able to produce the desired output.

In this case, though, the biggest differences aren't coming from C vs assembler but from walking away from platform conventions -- ABI, memory layout, CPU compatibility. It might be easier to write assembly than to teach the compiler how to generate code that does this, but that's largely because you don't need these optimisations in most cases. If the methods used provided a general speedup, it would be worth implementing them in compilers.

btilly · on Dec 2, 2021

I suspect that a compiler which produced code that will crash when you attempt to write to a file would be declared buggy and unusable.

And yet, that is one of the optimizations that helps get this one the speed crown.

bawolff · on Dec 2, 2021

Most programs aren't fizzbuzz. This is basically hand optimizing a microbenchmark. While i'm sure compilers can always be better i'm not sure this particular example generalizes.

londons_explore · on Dec 2, 2021

But hand optimizing a few microbenchmarks like this tells us how far we might be able to improve compilers.

And from the looks of it, at least 100x is possible. It's a shame therefore that even a 1% performance increase in a compiler release is considered pretty noteworthy.

anthony_r · on Dec 2, 2021

This is a very specific problem that lends itself to some fantastically cpu-friendly optimizations. Doing a single hash-map lookup here on the hot path would kill performance, and let's not even get started on any serious I/O (disk or network) as that would be multiple orders of magnitude slower. Not many problems are just "cat something extremely predictable into /dev/null".

samhw · on Dec 2, 2021

And so what? That's what this code does, and so a compiler should be able to generate the optimal assembly. Why would a compiler need to consider something that the code in question is not doing?

bawolff · on Dec 2, 2021

How would the compiler know that? This code for example is only optimal if you are piping it somewhere instead of displaying on the terminal. There is no way, even in principle, for the compiler to know that this program is always run with stdout connected to a pipe.

Moreover, if you mean optimal in the absolute sense, pretty sure that is equivalent to solving the halting problem.

If what your asking is: why can't computers take all surounding fuzzy context into account and do precisely the right thing, the answer is: because we haven't invented strong AI yet.

samhw · on Dec 2, 2021

Ah, sorry, I misunderstood you. I thought you were placing the emphasis on what the code was doing, rather than where it was being piped. But why do you think that most of the possible optimisations would only be relevant if the output were being piped to somewhere other than a terminal?

Also, why would making it absolutely optimal be "equivalent to solving the halting problem"? (Though either way, no, I'm not talking about producing 100.0000% optimal code - clearly there's plenty of daylight between the current degree of optimality and that one.)

bawolff · on Dec 2, 2021

> Ah, sorry, I misunderstood you. I thought you were placing the emphasis on what the code was doing, rather than where it was being piped. But why do you think that most of the possible optimisations would only be relevant if the output were being piped to somewhere other than a terminal?

One of the primary optimizations is using vm_splice to avoid memory copies. vm_splice only works if the file descriptor is a pipe.

> Also, why would making it absolutely optimal be "equivalent to solving the halting problem"?

https://en.m.wikipedia.org/wiki/Rice%27s_theorem

Or more specificly, if a program always halts (without outputting things) then the best optimization would be to have it halt immediately but to know that you have to have solved the halting problem.

You're right of course, absolutes are a bit of a distraction we just need something good for 99% of programs. However that's still really hard, and an optimization applied in the wrong context will slow down a program. There's still lots of room for compilers to grow, but the types of optimizations in use in this example are at best probably NP-complete to determine if they apply (i dont have any evidence for that though)

samhw · on Dec 5, 2021

Ah, thank you for a really interesting reply. I appreciate it.

> One of the primary optimizations is using vm_splice to avoid memory copies. vm_splice only works if the file descriptor is a pipe.

Thanks - I hadn't actually read the code in question. I now realise I should have. That's a totally fair point.

That being said, in the general case, there are clearly far more optimisations to be made. Do you think GCC-from-C is really theoretically limited to producing assembly code which is two thirds as performant as handwritten assembly?

> https://en.m.wikipedia.org/wiki/Rice%27s_theorem

Ahhh, I getchu. Yes, that technically means that a compiler can't produce 100% optimal code in 100% of cases. However, it doesn't mean it can't produce 100% optimal code in 99.99999% of cases, or 99.99999% optimal code in 100% of cases, as you say yourself.

This is more of an academic issue about formal properties of a compiler, rather than a practical issue with the optimisations it could make in the overwhelming majority of real-world cases.

That being said, I'm very sympathetic to your point. The (mostly toy) compilers I've written have been extremely weak in the optimisation department. I have a lot of respect for people who are able to do that, and a lot of respect for the challenge it poses.

(I do hope that the growth of big data, and developers' increasing indifference to privacy, might have the salutary effect that compiler authors gain access to more information about the programs their compilers are compiling, out in the real world ... which, at least going by my own experience, would be invaluable in making decisions about particular optimisations. Sadly 99% of optimisations are not peephole optimisations.)

anthony_r · on Dec 2, 2021

Yeah you could optimize GCC to work really well for those problems, there's a great ROI for the "write something trivial really quickly into /dev/null" class of problems.

Let us know how this goes if you embark on this adventure.

samhw · on Dec 2, 2021

If you genuinely understood me as saying that, then I should clarify, to be absolutely foolproof:

That is not what I'm saying. I'm talking about having a compiler which could produce more optimal code in any given case, or most given cases. I obviously am not suggesting that compilers optimise this and only this case.

anthony_r · on Dec 2, 2021

> I'm talking about having a compiler which could produce more optimal code in any given case, or most given cases

Funny coincidence, that's exactly what the vast majority of compiler code is for, with tens (or hundreds) of thousands of man-years of work spent on that.

samhw · on Dec 5, 2021

My point, again, is about having a compiler that can produce more optimal code than existing compilers (and whether there's a theoretical upper bound preventing them from more closely approaching the handwritten assembly in this case).

I'm not quite sure what else you could have thought the word 'more' was referring to?

londons_explore · on Dec 2, 2021

But nearly all problems boil down to nearly zero actual work...

Take booting up a computer for example. All that actually needs to be done is to display the login screen on the screen. And we know from 60 fps video playback, that the hardware is capable of displaying a frame inside 16 milliseconds.

anthony_r · on Dec 2, 2021

You can create an operating system that just displays a static image within milliseconds yourself (shouldn't take you more than a few weeks), it would work even when compiled at -O0.

pjc50 · on Dec 2, 2021

Producing this solution requires not merely a clever compiler but an extraordinary amount of creative intelligence.

bonzini · on Dec 2, 2021

The top 3 entries use completely different algorithms. The use of assembly might help a bit, but not as much as the raw numbers suggest.

solmag · on Dec 2, 2021

I think the original author said that it was harder to make this than his masters thesis.

DeathArrow · on Dec 2, 2021

The most interesting thing is the C version is almost as fast as assembler version.

bonzini · on Dec 2, 2021

The most interesting thing is that it took one day to write the C version and several months for the assembly version, because the C version uses a simpler algorithm. Talk about diminishing returns. :)

pokepim · on Dec 2, 2021

And you could copy paste Python version from SO and have it running in less than a minute, so here’s the winner :)

bonzini · on Dec 2, 2021

If the Python version is 2000 times slower, the breakeven would be after a few terabytes of FizzBuzz.

Qem · on Dec 2, 2021

The Python version posted is about 85 MiB/sec, so only 600 times slower than the fastest one.

bonzini · on Dec 2, 2021

Ah, forgot about pypy!

DeathArrow · on Dec 3, 2021

Yes, but if you start the Python version and still have 5 minutes left from the day, the C version will overtake Python.

javier10e6 · on Dec 2, 2021

Asm 56 GB/s (you are reading Sanskrit) C 41 GB/s (you are reading US Tax Code) Rust 30 GB/s (you are reading English) Python (why bother) (you are speed reading ) So many tools, so many languages, life is great.

opentokix · on Dec 2, 2021

Imagine writing this asm-implementation to the 22 y/o google recruiter and fail the interview. :D

pxeger1 · on Dec 2, 2021

This looks like a pretty rubbish blogspam regurgitation of the original SE answer.

w0mbat · on Dec 2, 2021

I played with FizzBuzz when this was mentioned here a month ago, focusing on the algorithm not optimizations.

I stopped once I could do it in simple code with no math except addition, 15 numbers at a time.

    for (int x = 0 ; x < 1000000 ; x+= 15) {
      printf("%d\n%d\nfizz\n%d\nbuzz\nfizz\n%d\n%d\nfizz\nbuzz\n%d\nfizz\n%d\n%d\nfizzbuzz\n",
            1 + x, 2 + x, 4 + x, 7 + x, 8 + x, 11 + x, 13 + x, 14 + x);
    }

That is simple enough and fast enough for me.

WithinReason · on Dec 2, 2021

Did you check what the throughput of this is?

lovasoa · on Dec 2, 2021

That's around 300MiB/s on my computer.

andi999 · on Dec 2, 2021

I dont understand the FizzBuzz thingy. I mean isnt it just about who has the fastest print method? Correct me please, but this problem has period 15, so if you just write:

for (i=1;i<n;i+=15) { print i; print i+1; print 'fizz' print i+3 ... } You have the fastest , dont you? (if the ending n is important then one can just write 15 of such loops/endgame depending on the modulus). Since there is no 'if' the branch predictor should be on your side.

Tepix · on Dec 2, 2021

You're printing the number too so, no, it does not heave a period of 15.

zkldi · on Dec 2, 2021

Did you even read the OP?

andi999 · on Dec 2, 2021

The OP is great. And basically it is about how to print faster. I was more commenting on the fast fizzbuzz challenge itself.

Terretta · on Dec 2, 2021

YAWFB: Yet another wrong fizzbuzz.

“fizz” % 3, “buzz” % 5, what is the % 15 for?

You can blow a interviewee’s mind by saying, great, now “bang” % 7… and (hopefully!) watch the light dawn.

// why so serious? :-)

LeifCarrotson · on Dec 2, 2021

The % 15 is because they're using 'elif' or 'else if'; they only process each entry once and they exit the function when a successful comparison is made.

It's true that you could do this with something like:

    result = ''
    if x % 3 == 0: result += 'fizz'
    if x % 5 == 0: result += 'buzz'
    if result == '': result += str(x)
    return result

to append fizz when for 15 and also continue processing and append buzz, but this requires an extra comparison to check that you did neither of those to print the number.

Terretta · on Dec 2, 2021

we all know what the %15 is for.

but note that if you do 3 comparisons, or 2 comparisons and an "extra" comparison, you've done the same number of comparisons, it's not really 'extra'

then go ahead and keep adding factors and see which results in more comparisons as we rack up the actually extra comparisons for various combinations of factors

btw and ftw, your modulo and string test example, more correct than the usual implementation, is among the most efficient:

http://www.zoharbabin.com/which-fizzbuzz-solution-is-the-mos...

// still not so serious...

trevyn · on Dec 2, 2021

I wonder how an M1 Max assembly version would fare.

SekstiNi · on Dec 2, 2021

The Firestorm cores are capable of issuing two 128bit stores per cycle [1], giving a bandwidth of 102.4 GBps. This matches the experiments by Andrei with mixed reads/writes [2], and I have personally verified that it is attainable for pure write workloads to L1 as well as L2.

This does not mean 102.4 GBps is achievable on a single core, but if L2 bandwidth is the limiting factor it very well could be.

[1] https://dougallj.github.io/applecpu/firestorm.html [2] https://www.anandtech.com/show/17024/apple-m1-max-performanc...

Croftengea · on Dec 2, 2021

ARM64 SIMD instructions perform in the same ballpark, see this: https://news.ycombinator.com/item?id=25408853

faeyanpiraat · on Dec 2, 2021

Before running the code, make sure your CPU does support AVX2. Most 64-bit Intel and AMD CPUs should. ARM CPUs, like those found in newer Apple computers, Smartphones or Raspberry Pis, won't support AVX2.

Tepix · on Dec 2, 2021

That sentence seems silly. ARM assembly and x86-64 assembly are two different beasts. Running the x86-64 code using Rosetta 2 emulation seems counter-intuitive as well if you want to get the fastest possible result on that machine.

comonoid · on Dec 2, 2021

ARM CPUs have NEON. I cannot say if NEON has all equivalent commands, however, the threadstarter is talking about complete rewrite, not just running it as is.

innocenat · on Dec 2, 2021

NEON only has 128bit register. AVX has 256 bit register.

comonoid · on Dec 2, 2021

Yep, it will most likely require to reorganize the program, but by itself it doesn't necessary mean that the result will be slower.

jprupp · on Dec 2, 2021

Hold my beer. Now were is that OpenCL programming book when I need it?

sshb · on Dec 2, 2021

I wonder about FPGA and io_uring implementations performances

hackater · on Dec 2, 2021

How do you learn about this? Does anyone have any good recommendation on books, where you learn how to do these kind of very low level optimizations?

6510 · on Dec 4, 2021

I would like to propose (or remind) that for this type of problem the fastest implementation is paper.

bXVsbGVy · on Dec 2, 2021

How did they tested it? Is the data actually being delivered, and read, in order?

Maursault · on Dec 2, 2021

No machine code implementation? I wonder how much faster than assembler it could be.

pxeger1 · on Dec 2, 2021

Assembler has basically no abstractions on top of machine code; just a human-readable syntax. So there would be no difference.

GnarfGnarf · on Dec 2, 2021

581 lines of Assembler

Interesting, but not very practical.

chii · on Dec 2, 2021

is practicality really relevant to a competition to write the fastest FizzBuzz implementation?

sillysaurusx · on Dec 2, 2021

> As of this writing, the 3rd fastest submission is written in Rust and produces out at a rate of 3 GB/s, 2nd is written in C and produces at a rate of 41 GB/s and the fastest is written in Assembler and produces at a rate of 56 GB/s.

This makes me happy. The order is exactly as it should be, much to the disappointment of the Rustaceans.

It's a nice reminder that for the ultimate performance, there is no substitute for knowing exactly what the assembler is doing.

It's also a nice reminder that most of us will never want to do that.

EDIT: Whoa. It's late, and I thought Rust was at 30 GB/s instead of 3, i.e. roughly neck and neck with C. I meant to make a good-natured joke. There must be something wrong with the Rust implementation to be 10x slower; I wonder what it is.

As I mentioned further down in the comments, I'll donate $500 to Watsi if someone manages to displace the C implementation in Rust or any other language besides C/C++/ASM. Be sure to DM me on Twitter (https://twitter.com/theshawwn) if you do, both so I'll see it and so that I can congratulate you. I'm sure it's possible, but it's sort of interesting that the activation energy is so high.

pcwalton · on Dec 2, 2021

You can basically take any LLVM IR that Clang would produce and produce Rust that compiles down to it (modulo irreducible control flow, I think, but the C++ doesn't have that). Because of that, C++ vs. Rust language (as opposed to library) performance comparisons are uninteresting.

> It's a nice reminder that for the ultimate performance, there is no substitute for knowing exactly what the assembler is doing.

C and C++ do not let you know exactly what the assembler is doing. I work on compilers for a living and despite that--perhaps because of that--I have no confidence that I can predict what instructions the compiler will generate ahead of time. If I need to know the actual instructions, I run my code through Godbolt and check. C, C++, and Rust are all millions of lines of LLVM optimization passes away from the metal.

Based on some other comments, I suspect that all the time is being spent in locking stdout, which you can amortize by using Stdout::lock [1]. As is so often the case when microbenchmarks are discussed online, this has nothing to do with the language.

Also, it shouldn't need to be said, but FizzBuzz is so totally unrepresentative of real workloads that you shouldn't draw any conclusions about language performance from it.

[1]: https://doc.rust-lang.org/stable/std/io/struct.Stdout.html#m...

parhamn · on Dec 2, 2021

> LLVM IR that Clang would produce

You most certainly can’t unless your code is unsafe and in that case the comparison isn’t so interesting anymore.

Aissen · on Dec 2, 2021

It is very interesting if you have properly verified invariants and you can guarantee (with manual checking) that your code is correct. It's an order of magnitude harder than just letting the borrow checker do its job, but it's entirely possible (just like in any other language).

bonzini · on Dec 2, 2021

The second one (written by me) is actually JIT-compiled assembly. So little of the runtime is spent in the C code that -O0 vs -O3 makes no difference at all; and even if you wrote the glue code in Python it would not be much slower than the C code.

Also the third entry is not the one in Rust. It's just that the Rust entry came later and thus was not included in the plot. The third entry is https://codegolf.stackexchange.com/a/215236/56694; it is also C.

> displace the C implementation in Rust or any other language besides C/C++/ASM

This is not a language contest, it's an algorithmic contest: that is, it's the best algorithm that wins, not the best compiler. The top entry, converted into C or Rust by mostly menial work, would probably be only a few % slower and would displace the C implementation.

stavros · on Dec 2, 2021

Now I'm kind of hoping that someone will rewrite the glue in Python so that Python is the second fastest implementation of them all, just for the laughs.

animal531 · on Dec 2, 2021

So what we're seeing is that a specialized handcoded assembler version is 10x faster than a generic C version?

If so that makes me a bit sad, because in the modern day we'd want to assume that compilers have gotten a little better at optimization.

jiggawatts · on Dec 2, 2021

Read the assembly.

Seriously. It's eye-popping. It's eldritch madness. It's black wizardry. The blackest of the black arts.

The "fastest FizzBuzz" implementation isn't merely "fast". It's faster than memcpy()!!!

Expecting compilers to outperform this tour-de-force is asking a tad too much from compiler writers...

btilly · on Dec 2, 2021

From the article, "The application produced 64 bytes of FizzBuzz for every 4 CPU clock cycles. The author states the ultimate bottleneck of performance is based on the throughput of the CPU's L2 cache."

For comparison, that time is on par with integer multiplication.

noduerme · on Dec 2, 2021

It's actually really nice the way the author commented it. It's a cool insight into their mental process. I've never coded in Assembly and I can't normally follow it. But reading these comments I can understand how the author is surgically managing registers and operating off a complete mental map of what's there, the same way a coder would with, say, a very large database they have a complete mental model of.

All I mean is that I appreciate the author making their thought process accessible. It certainly looks like virtuosity, but I'm not competent enough to judge.

nojs · on Dec 2, 2021

The assembly is ridiculous. Random section:

    // The bytecode generation loop is unrolled to depth 30, allowing the
    // units digits to be hardcoded. The tens digit is stored in %al, and
    // incremented every ten lines of output. The parity of the hundreds
    // digit is stored in %ymm2: one half predicts the hundreds digit to
    // be even, the other to be odd, and the halves are swapped every time
    // the tens digit carries (ensuring the predictions are correct).

entropicdrifter · on Dec 2, 2021

Controlled CPU predictions, wow. That's a level of performance I've never even contemplated

bonzini · on Dec 2, 2021

It's not CPU predictions, it means that it computes in parallel one result that is correct and one that is wrong, and picks the right one by swapping the two halves every now and then.

bonzini · on Dec 2, 2021

> assume that compilers have gotten a little better at optimization

Again: you're comparing the performance of different algorithms. A compiler won't turn bubblesort into quicksort.

tialaramex · on Dec 2, 2021

I mean, this is importantly true as a generalisation, but, it's worth knowing modern compilers know about idioms.

If you show a modern compiler the typical idiomatic blob of code for counting the set bits in a word in a language like C, the compiler goes "Oh, popcount I know this" and when targeting modern CPUs it emits a single instruction, the popcount instruction for that CPU.

In APL idiom recognition was essential to making even a halfway decent compiler. It's easy, and idiomatic, to write APL which would be tremendously wasteful if implemented naively, while recognising common idioms could remove lots of that waste without needing the level of sophistication modern compilers have during optimisation.

Because the compiler knows about idioms you should not try to optimise your programs by writing non-idiomatic code until you've determined you have a performance problem and the non-idiomatic code is identified as a way to fix it. Otherwise it may be that the compiler improves and now your code is not only harder to maintain it also has worse performance than the idiomatic solution.

rowanG077 · on Dec 2, 2021

Not yet. But I have no doubt in my mind that a once we have smarter then human AI there will be compilers that absolutely can turn bubblesort into quicksort.

marginalia_nu · on Dec 2, 2021

Seems fairly unlikely. The best sorting algorithm is a function of the data. The compiler can't know this. There are pathological data sets where quicksort is O(n^2). For small n, quick sort is also a bad choice.

The programmer can understand these domain characteristics but the compiler certainly can't. If code isn't running qsort(), this could be a mistake, but may also be an optimization.

rowanG077 · on Dec 2, 2021

You almost always know something about the data that a sorting function will operate on. Unless of course you are implementing a generic sorting algorithm. In practice you could train your compiler on real world data to allow it to optimize better for your environment. This is already happening with profile guided optimizations.

gpderetta · on Dec 2, 2021

PGO is a thing for this reason.

(But no, I'm not expecting AI driven compilers anytime soon)

foob · on Dec 3, 2021

I think it's unfair that people downvoted this to gray. This person didn't claim that this was about to happen, and it seems likely to me that when/if AI surpasses human intelligence that something like this would be possible. It might require some additional way to describe expectations about the input data, and it might be many years away, but I would be disappointed if compilers couldn't do this in the far distant future. Look at the progress we've made in the last seventy years and try to extrapolate that out a thousand years forward. I obviously can't guarantee what will happen, but is it really that unreasonable to suggest progress like this as a strong possibility?

sillysaurusx · on Dec 2, 2021

I would love to see a Python implementation that approaches the C code. It would be educational on so many levels. Would anyone be interested in trying?

Also, why is the ASM version so much faster if the glue code doesn't make much difference? I suppose it matters at the margins...

bonzini · on Dec 2, 2021

> why is the ASM version so much faster if the glue code doesn't make much difference

The remark on the glue code only applies to my entry (the second). The assembly entry uses a completely different (and much more complicated) algorithm than the C entry.

hectormalot · on Dec 2, 2021

If you read the description you’ll see that it’s optimizing around certain syscalls like memcopy (which is otherwise limiting and reduces speed by 5x). I think it’s less language choice, and more low level optimization that is making a difference here.

(To check, I wrote a Go program that doen nothing else than fmt.Print($largebody) in a for loop and it tops out at around 10GB/s)

tux3 · on Dec 2, 2021

Note that memcpy is not a syscall. Common syscalls do copies. The fast implementation uses the vmsplice syscall as a clever way to pump a pipe with data, because it doesn't copy.

tialaramex · on Dec 2, 2021

Indeed. Today for most systems memcpy is in fact a compiler intrinsic. C's standard library offers a memcpy but the reality is that your C standard library was likely supplied by your C compiler vendor, and when you ask for memcmp() in C you get their preferred intrinsic inlined into your program unless you've been very firm in declining that optimisation.

In Rust this memcpy intrinsic is required (for the core language, not just the standard library) because Rust clearly has arbitrarily large data structures you can literally copy, so, somebody has to take responsibility for doing that work, Rust says it's the compiler back-end's job. On a toy CPU the memcpy might really just be a trivial read->register->write loop because that's adequate.

tialaramex · on Dec 2, 2021

Oops, I notice hours too late to edit it, that this says memcmp() in one place where it mostly says memcpy(). In fact your C compiler likely has intrinsics for both, but I did mean memcpy() everywhere in the comment.

Qem · on Dec 2, 2021

The Python implementation posted among the answers is about 85 MiB/sec, under PyPy.

bartvk · on Dec 2, 2021

It's not just knowing exactly what the assembler is doing, but also what the kernel is doing (but maybe you also meant this). "This implementation only supports Linux", and then continues about the use of a bunch of system calls:

* "asks the Kernel to defragment the physical memory it's using via the madvise system call" * "use 2 MB of RAM each even though neither need 2 MB. This keeps the page table orderly" * "Most competing FizzBuzz implementations call the write system call which can lead to data needing to be copied between user and Kernel space. So vmsplice is used instead."

And then the most fascinating bit: "The CPU's L2 cache is split into two halves, one for calculating the output of FizzBuzz, the other storing the result of the last calculation batch. This means there is no need to use any slower L3 memory operations."

"Slower L3 memory operations", I've not read that sentence before :D

MayeulC · on Dec 2, 2021

> the most fascinating bit

It is really fascinating, but I disagree a bit with your pick :)

To me, the most fascinating part is using a bytecode interpreter. It makes perfect sense, it's basically optimizing the code size to fits in the cache. It's also a bit like defining a whole new instruction set dedicated and optimized for Fizzbuzz!

I think the approach was a bit similar to a VLIW https://en.wikipedia.org/wiki/Very_long_instruction_word where a single 32 bits word is loaded and interpreted at a time.

Most of the state of the Fizzbuzz program is encoded in this bytecode.

> "Slower L3 memory operations", I've not read that sentence before

You hear it all the time when trying to optimize for speed in a hot loop. Keep your data structures in a way that fits in the cache. Align your data with cache boundaries. Code needs to be small too, hence the bytecode approach. This is also the reason why you can't unroll loops too much.

If your implementation does not need to fetch (or write) data from(/to) higher levels of the memory hierarchy, you'll see tremendous gains. There's even this bit in the original source: "while waiting for writes to go through to the L2 cache" :)

There are a few more very interesting hacks, such as the "high-decimal" format that skips the "need to do binary/decimal conversions."

GrayShade · on Dec 2, 2021

You do seem to have an axe to grind with the people using Rust.

I don't see any reason why the same thing [1] could not be implemented in Rust.

EDIT: I got caught by an off-by-one there, the second fastest solution (doing JIT compilation) is actually [2].

[1]: https://codegolf.stackexchange.com/a/215231

[2]: https://codegolf.stackexchange.com/a/236737

sillysaurusx · on Dec 2, 2021

Much like speedruns, it's up to you to prove it :)

I just enjoy dosing the Rusties with reality every so often.

If you write a Rust version that can beat the C version, I'll donate $500 to Watsi. Good luck!

KingOfCoders · on Dec 2, 2021

"I just enjoy dosing the Rusties with reality every so often."

Aka troll.

sillysaurusx · on Dec 2, 2021

Oh? Why is it trolling to say that Rust isn't as fast as C in this case?

If it were, you'd claim the prize. If you really want to hurt a troll, wouldn't you try to separate them from their $500?

KingOfCoders · on Dec 2, 2021

You still assume I care. As stated above I don't care. It's in your head only that every Rust developer writes Rust because she thinks it's faster than C.

detaro · on Dec 2, 2021

> to say that Rust isn't as fast as C in this case

That's not what the quoted statement is doing.

sillysaurusx · on Dec 2, 2021

I think we'll have to agree to disagree. It's illustrative that a seemingly simple C program can't be easily ported to Rust. If it could be, someone would have already done it -- it's been a few hours.

It's also illustrative to examine exactly what the hangups are. You might even get more Rust converts if you acknowledge that there are serious engineering burdens to overcome when deciding to use Rust.

samhw · on Dec 2, 2021

For the record, it's not the anti-Rust sentiment which people find annoying, it's the treatment of programming languages as though they were football teams.

If you want to make a case against Rust, then let's get concrete: why is one Turing-complete compiled language that can make arbitrary syscalls not capable of compiling down to the same assembly as another?

That would be an interesting discussion; this is not.

Retric · on Dec 2, 2021

You can dig into it, but the details are mostly only relevant to people designing computer languages and tooling.

Aka Compilers, as long as the machine code is equivalent to the source code their output can be wildly different. However, someone needs to actually write them which means both linguistic differences and the sheer effort that was actually put into the compiler are both significant. You shouldn’t assume that every language is operating on an even playing field because their simply not. The obvious solution of compiling to a second language is never quite as capable as using the language directly.

samhw · on Dec 2, 2021

> Compilers, as long as the machine code is equivalent to the source code their output can be wildly different

I don't understand what this means: the output of a compiler is the machine code from the source code, no?

Also, Rust uses LLVM as a backend, which has benefited from lots of work given its use for C, C++, ObjC, Swift, &c. Compiling the C with Clang/LLVM might give a more interesting insight into the differences between Rust and C.

> The obvious solution of compiling to a second language is never quite as capable as using the language directly.

I'm not sure what you mean by this, or rather how it's related to what we're talking about?

Retric · on Dec 2, 2021

> The output is the machine code.

You get different machine code using different compilers on identical pieces of source code. In theory the results of running that source code should have equivalent results even if one is more efficient than another. If for example one compiler uses a ASM instruction that another never uses then nothing the programmer does can get the second compiler to be as efficient as the first. The same is true of various optimizations etc, sometimes a compiler is just better.

Bring a different language into the picture and things get even more divergent.

> how it’s related

It’s specifically in reference to rust use of LLVM not being sufficient to guarantee equivalence. Many languages target the JVM, that does not make them equivalent.

samhw · on Dec 2, 2021

> You get different machine code using different compilers on identical pieces of source code.

Ah, right - well I'm certainly not denying that. (I'm familiar with the topic, to be clear: I've written compilers. I just wasn't clear what exactly you were saying.)

> It’s specifically in reference to rust use of LLVM not being sufficient to guarantee equivalence.

Formally, no. In practice, given that fact, given the experience of the maintainers of both languages, and given the 'compilability' of both languages, it would be surprising if there were a 10x gulf between them.

taneq · on Dec 2, 2021

That kind of behaviour (edit: theirs, not yours) is usually ultimately motivated by fear. ;)

AnIdiotOnTheNet · on Dec 2, 2021

Not necessarily. Language evangelists can be incredibly annoying... actually, evangelists in general can be really annoying. It is cathartic to prove annoying people wrong every once in a while.

GrayShade · on Dec 2, 2021

I don't know. I use Rust myself, but I've seen more people complaining about Rust evangelism than people actually doing that.

The RESF is a bad meme.

colejohnson66 · on Dec 2, 2021

When there's a memory related bug (or CVE) in a program written in C or C++, there's almost always some comment about how the Rust compiler would've caught it

tialaramex · on Dec 2, 2021

On HN there is often: A comment about Rust would catch this (sometimes sarcastic), an explanation about how the bug isn't "really" memory safety and so Rust wouldn't catch it, followed by a comment explaining actually Rust does catch it.

The recent Linux mutex security bug is an example. That's not a memory safety bug. Surely Rust doesn't help? But Rust's mutual exclusion design relies on the Rust type system to give us safer locks, in Rust you protect things by having "lock the mutex" be the way you get the reference to the protected object. Thus when the faulty code locks the wrong mutex in Rust it would get the wrong object, and either not compile (you lock variable A and then you're trying to change B but you don't even have B) or not work (you always lock variable A, then change variable A, but your code should change B, it does not work, trivially reproducible bug in your code)

KingOfCoders · on Dec 2, 2021

From Rust developers or from app users? As an app user I would prefer if apps are written in Rust and Im reminded with every CVE for a C program.

Ygg2 · on Dec 2, 2021

Otoh. Being a dick, to majority of people because of acts of minority is worse.

trevyn · on Dec 2, 2021

No no, C is great for writing fast FizzBuzz implementations; any larger program is hard to get correct. :-)

gerbilly · on Dec 2, 2021

Ultimately, your argument is a troll because you are trying to tell people what they should value in a programming language.

bawolff · on Dec 2, 2021

Pretty sure the fact its literally a speed contest is telling people what to value in a programming language.

KingOfCoders · on Dec 2, 2021

"disappointment of the Rustaceans."

I'm using Rust and I'm not dissappointed. Why should I?

But I do appreciate how you create imaginary people in your head, troll them and mentally pad yourself for doing so.

isoprophlex · on Dec 2, 2021

All the rust users rising to the bait in this thread sure aren't imaginary

KingOfCoders · on Dec 2, 2021

What comment comes to your mind of a Rust developer who thinks Rust is faster than C and needs a dose of reality?

colejohnson66 · on Dec 2, 2021

Rust is a memory safe systems programming language. As a systems programming language, it can be used in the same areas that C can (see Redox OS). I've seen quite a few Rust users online that have claimed that, because of that, it can be just as fast as C.

tialaramex · on Dec 2, 2021

But it can be just as fast as C. The C program that's so much faster here isn't "really" a C program it's just a fancy way to shovel machine code, which you could of course also do (unsafely) in Rust if you chose to do so but why?

However what you were asked about is claims it's faster and there micro-benchmarks like this are not revealing. At scale you're obliged to limit yourself to what you can write that works and in Rust the safety lets you reach higher - which is how it achieves better performance. In fact that's what Mozilla used Rust for. They'd failed more than once to ship a concurrent CSS speed-up, finding it too hard to successfully get the bugs out and so returning to a single threaded CSS implementation - but the attempt in Rust (Quantum) worked just fine.

This is the same reason C++ is faster, a C++ iterator isn't faster than the naive C for loop, in fact they emit identical code, but the iterator is easier to think about, which means you can confidently write more complicated software using this abstraction. Rust just provides the safety to go higher still.

KingOfCoders · on Dec 2, 2021

What is the opinion of Golang users on this?

politician · on Dec 2, 2021

This is really cool. It should be obvious that a few thousand hand-rolled assembly opcodes targeting a fixed architecture would win the race. If the size of the solution grew a couple of orders of magnitude larger though, I'd expect that the micro optimization challenge would overwhelm the author and compilers would be more competitive.

Is the Table of Contents of your upcoming book public?

KingOfCoders · on Dec 2, 2021

No not yet, incorporating first draft feedback.

You made my evening for asking <3

gopiandcode · on Dec 2, 2021

You're totally right. I think a lot of rust developers are going to be absolutely distraught that rust isn't the fastest language for writing fizzbuzz programs.

I guess they're just going to have to settle with rust being the best language to write safety-critical high performance concurrent low-level code.

tester34 · on Dec 2, 2021

>This makes me happy. The order is exactly as it should be, much to the disappointment of the Rustaceans.

Why it makes you happy? it's a reminder how much we're losing performance, energy and stuff when we're writing in safe/expressive/easy to use languages like Java, C#, Python yada yada

yodelshady · on Dec 2, 2021

FYI, on my machine a naive Rust implementation (modulo plus println!) runs at the same speed as a naive Python implemention (16-17 MB/s), even compiled with --release. Naive C manages 160-170 MB/s. (And I fully admit the naive implementations are pretty much as far as I'm going with this).

I would guess it's the println! macro eating the time.

beltsazar · on Dec 2, 2021

There are two reasons: 1) The println! macro will lock and unlock the stdout every time it's called. 2) Rust's stdout is line-buffered by default.

In my machine the naive C implementation achieves 81.7 MiB/s, whereas a naive Rust implementation with println achieves 7.58 MiB/s.

When I acquired the stdout's lock once and use writeln!(&mut stdout, ...) instead of println, I got 8.15 MiB/s.

When I wrapped stdout with BufWriter, I got 299 MiB/s.

    use std::io::{BufWriter, Write};

    fn main() {
        let stdout = std::io::stdout();
        let stdout = stdout.lock();
        let mut stdout = BufWriter::with_capacity(32 * 1024, stdout);

        for i in 1..1_000_000_000 {
            let _ = match (i % 3 == 0, i % 5 == 0) {
                (true, true) => writeln!(&mut stdout, "FizzBuzz"),
                (true, false) => writeln!(&mut stdout, "Fizz"),
                (false, true) => writeln!(&mut stdout, "Buzz"),
                (false, false) => writeln!(&mut stdout, "{}", i),
            };
        }
    }

yodelshady · on Dec 3, 2021

Well TIL that printf won't lock stdout.

Calling stdout.lock() by itself makes little difference for me. Using BufWriter as you have I get ~80 MiB/s (slightly ahead of PyPy).

My implementation also has/had a String allocation in the loop, kicking that out got me to 120 MiB/s. I tend to write it as though the numbers/strings could be changed by a product manager at any time.

Final edit/errata: I get up to 140 MiB/s by appending .to_string() in the integer case instead of format!(), which makes some sense.

I didn't bother saving my original C implementation, but in retrospect I think I didn't use string concatenation, as I have with other languages. Redoing it with string-concatenation, C is slightly slower than Rust, 110 MiBs. Also makes sense as concatenation in C is slow.

pcwalton · on Dec 2, 2021

It's probably stdout locking. Call Stdout::lock.

yodelshady · on Dec 2, 2021

edit: weirder, naive Julia (1.7 on Linux) is terrible. 45 KB/s.

Ruby is slightly slower than Python or Rust, 11 MB/s. I was expecting Python to be slowest among the interpreted languages, apparently not. Pypy yields a speedup to 70 MB/s.

Summing naive implementations (for each i: start with empty string, append fizz/buzz based on the modulus of each, then conditionally append i and print):

Julia 45 KB/s Ruby 11 MB/s CPython 15 MB/s Rust 16 MB/s Pypy 70 MB/s C 160 MB/s

sc11 · on Dec 2, 2021

Do you mind sharing the Julia code? Seems odd that it's that far away from even Python.

adgjlsfhk1 · on Dec 2, 2021

by default Julia uses locking for writes to guarantee thread safety when multiple writes are happening. That said, it's totally possible to go way faster using buffered IO.

yodelshady · on Dec 3, 2021

``` for i = 1:typemax(UInt64) s = "" if i % 3 == 0 s = s * "fizz" end if i % 5 == 0 s = s * "buzz" end if s == "" s = string(i) end println(s) end ```

Measured over 10 s after a 10 s burn-in period, Julia 1.7 on Ubuntu 21.10.

Qem · on Dec 2, 2021

I think recent versions of Ruby include a optional JIT compiler. Did you try it?

lawn · on Dec 2, 2021

So you think that it's expected that C is 10 times faster than Rust?

sillysaurusx · on Dec 2, 2021

Wow. Perhaps I should be embarrassed, but I mentally filled in "30 GB/s" instead of 3 GB/s. I thought Rust was mostly neck and neck with C. I see now why a lot of fellows responded in a defensive way.

What's it doing to be so much slower? That's shocking. I may not enjoy Rust, but I do respect its ability to approach C++ performance.

Arnavion · on Dec 2, 2021

At the very least, the itoap crate being used there does divisions by and comparisons to powers of 10 as part of stringifying integers, which the C and ASM versions don't seem to be doing. A crate with a function for stringifying arbitrary integers cannot take advantage of the fact that it's going to be used for stringifying consecutive integers, as an inlined solution can.

misja111 · on Dec 2, 2021

As far as I understand, that #2 C program contains hardly any C code but mostly inlined assembly.

bonzini · on Dec 2, 2021

The program itself is 100% C, but it spends 99.9999% of the time in assembly code that is generated at runtime.

IshKebab · on Dec 2, 2021

> That's shocking.

Not really. This is a hyper optimised microbenchmark. The actual work being done is so trivial that any "overhead" is really significant and you can get order of magnitude speedups by doing arcane things with caches, eliminating tiny inefficiencies and so on.

xxpor · on Dec 2, 2021

Complete speculation, but I think you'd find that's true for most problems. Very little CPU time is spent on the business logic, and the rest is overhead.

friedman23 · on Dec 2, 2021

I'm curious, how did rust affect you so badly that you hate it so much? If it's just silly people suggesting it everywhere I think you have some personal issues you need to resolve, language choice isn't something to get emotional about.

ask_b123 · on Dec 2, 2021

I've not seen any indication that sillysaurusx hates Rust.

jedimastert · on Dec 2, 2021

> Rustaceans

Completely off topic, but suddenly all of the crab branding makes a ton more sense

Arnavion · on Dec 2, 2021

>It's a nice reminder that for the ultimate performance, there is no substitute for knowing exactly what the assembler is doing.

I didn't realize viewing the assembly of a compiled binary was something reserved for C. Someone should tell rust.godbolt.org that it's not supposed to exist.

dahfizz · on Dec 2, 2021

Go read the second place C entry[1]. Its a JIT for assembly code.

[1] https://codegolf.stackexchange.com/questions/215216/high-thr...

samhw · on Dec 2, 2021

I think their point is that C is sufficiently close to assembler that you know exactly what assembly any given line of C will produce. This is an old canard which is not really true (https://zenhack.net/2015/08/02/c-is-not-portable-assembly.ht...), but it's common among people who derive a sense of identity from writing C.

recentdarkness · on Dec 2, 2021

interesting for me was the fact, that the author talks about themselves in the third person. Very unusual for some content like this :D

lifthrasiir · on Dec 2, 2021

The author of this post (Mark Litwintschik) is not same to the author of the assembly solution (Alex Smith, also known as ais523) :-)

nickdothutton · on Dec 2, 2021

This chap is the developer. Kudos to him. https://www.wolframscience.com/prizes/tm23/alex_smith_bio.ht...

Croftengea · on Dec 2, 2021

> The developer behind the Assembler version goes by the handle "ais523". > I've been unable to uncover this person's real-life identity

isodev · on Dec 2, 2021

Come on now, are we really using fizzbuzz as a performance benchmark? Seriously, don't base a real-world decisions on this.

PS. The Rust implementation is also not very optimal, for example, the stdout is not locked so... what are we even trying to show :-).

tourist2d · on Dec 2, 2021

Can you really not tell the difference between a fun challenge and a real world scenario?

gumby · on Dec 2, 2021

Pfft. No machine learning was used?