Yes there is a lot of optimization specific to languages, OS, CPU... but the takeaway for me isn't that you have to go that extent to build an order of magnitude improvement, you can still achieve many order of magnitudes of improvement by understanding what is happening when you use something like the modulo function and that if you have requirements where you can avoid using it then that's the win.
For FizzBuzz you are told it "iterates from 1 to <some number>"... and modulo would be perfect for "take a single number and produce Fizz, Buzz or FizzBuzz if it's divisable by 3, 5, or both 3 and 5"... but as we're iterating, counters are better suited to the task.
I love seeing epic code golf (and the solution outlined in the article is it), but the takeaway for the majority of engineers is that significant improvements can be made with a readable and maintainable solution just by understanding the requirements and comp sci fundamentals.
When I saw this a few weeks ago it set off a lot of thinking about compilers and ML.
Even the unoptimized C version is 10x slower than the one that is painstakingly optimized.
Some of this performance gain will certain be achievable by ML/AI assisted compilers and translators. In that case, even a 2x speedup would be game changing. Is anyone using ML in compilers yet? The part where you verify the code still does what you wanted it to do seems trickiest.
It seems there's a huge gap between what machine code compilers can generate and hand written optimized assembler. There's a lot of room left for compilers to be improved.
Back when I was employed to work on a compiler, our approach to such things was to implement our run-time libraries in C and fix the compiler so it was able to produce the desired output.
In this case, though, the biggest differences aren't coming from C vs assembler but from walking away from platform conventions -- ABI, memory layout, CPU compatibility. It might be easier to write assembly than to teach the compiler how to generate code that does this, but that's largely because you don't need these optimisations in most cases. If the methods used provided a general speedup, it would be worth implementing them in compilers.
Most programs aren't fizzbuzz. This is basically hand optimizing a microbenchmark. While i'm sure compilers can always be better i'm not sure this particular example generalizes.
But hand optimizing a few microbenchmarks like this tells us how far we might be able to improve compilers.
And from the looks of it, at least 100x is possible. It's a shame therefore that even a 1% performance increase in a compiler release is considered pretty noteworthy.
This is a very specific problem that lends itself to some fantastically cpu-friendly optimizations. Doing a single hash-map lookup here on the hot path would kill performance, and let's not even get started on any serious I/O (disk or network) as that would be multiple orders of magnitude slower. Not many problems are just "cat something extremely predictable into /dev/null".
And so what? That's what this code does, and so a compiler should be able to generate the optimal assembly. Why would a compiler need to consider something that the code in question is not doing?
How would the compiler know that? This code for example is only optimal if you are piping it somewhere instead of displaying on the terminal. There is no way, even in principle, for the compiler to know that this program is always run with stdout connected to a pipe.
Moreover, if you mean optimal in the absolute sense, pretty sure that is equivalent to solving the halting problem.
If what your asking is: why can't computers take all surounding fuzzy context into account and do precisely the right thing, the answer is: because we haven't invented strong AI yet.
Ah, sorry, I misunderstood you. I thought you were placing the emphasis on what the code was doing, rather than where it was being piped. But why do you think that most of the possible optimisations would only be relevant if the output were being piped to somewhere other than a terminal?
Also, why would making it absolutely optimal be "equivalent to solving the halting problem"? (Though either way, no, I'm not talking about producing 100.0000% optimal code - clearly there's plenty of daylight between the current degree of optimality and that one.)
> Ah, sorry, I misunderstood you. I thought you were placing the emphasis on what the code was doing, rather than where it was being piped. But why do you think that most of the possible optimisations would only be relevant if the output were being piped to somewhere other than a terminal?
One of the primary optimizations is using vm_splice to avoid memory copies. vm_splice only works if the file descriptor is a pipe.
> Also, why would making it absolutely optimal be "equivalent to solving the halting problem"?
Or more specificly, if a program always halts (without outputting things) then the best optimization would be to have it halt immediately but to know that you have to have solved the halting problem.
You're right of course, absolutes are a bit of a distraction we just need something good for 99% of programs. However that's still really hard, and an optimization applied in the wrong context will slow down a program. There's still lots of room for compilers to grow, but the types of optimizations in use in this example are at best probably NP-complete to determine if they apply (i dont have any evidence for that though)
Ah, thank you for a really interesting reply. I appreciate it.
> One of the primary optimizations is using vm_splice to avoid memory copies. vm_splice only works if the file descriptor is a pipe.
Thanks - I hadn't actually read the code in question. I now realise I should have. That's a totally fair point.
That being said, in the general case, there are clearly far more optimisations to be made. Do you think GCC-from-C is really theoretically limited to producing assembly code which is two thirds as performant as handwritten assembly?
Ahhh, I getchu. Yes, that technically means that a compiler can't produce 100% optimal code in 100% of cases. However, it doesn't mean it can't produce 100% optimal code in 99.99999% of cases, or 99.99999% optimal code in 100% of cases, as you say yourself.
This is more of an academic issue about formal properties of a compiler, rather than a practical issue with the optimisations it could make in the overwhelming majority of real-world cases.
That being said, I'm very sympathetic to your point. The (mostly toy) compilers I've written have been extremely weak in the optimisation department. I have a lot of respect for people who are able to do that, and a lot of respect for the challenge it poses.
(I do hope that the growth of big data, and developers' increasing indifference to privacy, might have the salutary effect that compiler authors gain access to more information about the programs their compilers are compiling, out in the real world ... which, at least going by my own experience, would be invaluable in making decisions about particular optimisations. Sadly 99% of optimisations are not peephole optimisations.)
Yeah you could optimize GCC to work really well for those problems, there's a great ROI for the "write something trivial really quickly into /dev/null" class of problems.
Let us know how this goes if you embark on this adventure.
If you genuinely understood me as saying that, then I should clarify, to be absolutely foolproof:
That is not what I'm saying. I'm talking about having a compiler which could produce more optimal code in any given case, or most given cases. I obviously am not suggesting that compilers optimise this and only this case.
> I'm talking about having a compiler which could produce more optimal code in any given case, or most given cases
Funny coincidence, that's exactly what the vast majority of compiler code is for, with tens (or hundreds) of thousands of man-years of work spent on that.
My point, again, is about having a compiler that can produce more optimal code than existing compilers (and whether there's a theoretical upper bound preventing them from more closely approaching the handwritten assembly in this case).
I'm not quite sure what else you could have thought the word 'more' was referring to?
But nearly all problems boil down to nearly zero actual work...
Take booting up a computer for example. All that actually needs to be done is to display the login screen on the screen. And we know from 60 fps video playback, that the hardware is capable of displaying a frame inside 16 milliseconds.
You can create an operating system that just displays a static image within milliseconds yourself (shouldn't take you more than a few weeks), it would work even when compiled at -O0.
The most interesting thing is that it took one day to write the C version and several months for the assembly version, because the C version uses a simpler algorithm. Talk about diminishing returns. :)
Asm 56 GB/s (you are reading Sanskrit)
C 41 GB/s (you are reading US Tax Code)
Rust 30 GB/s (you are reading English)
Python (why bother) (you are speed reading )
So many tools, so many languages, life is great.
I dont understand the FizzBuzz thingy. I mean isnt it just about who has the fastest print method? Correct me please, but this problem has period 15, so if you just write:
for (i=1;i<n;i+=15)
{ print i;
print i+1;
print 'fizz'
print i+3
...
}
You have the fastest , dont you? (if the ending n is important then one can just write 15 of such loops/endgame depending on the modulus). Since there is no 'if' the branch predictor should be on your side.
The % 15 is because they're using 'elif' or 'else if'; they only process each entry once and they exit the function when a successful comparison is made.
It's true that you could do this with something like:
result = ''
if x % 3 == 0: result += 'fizz'
if x % 5 == 0: result += 'buzz'
if result == '': result += str(x)
return result
to append fizz when for 15 and also continue processing and append buzz, but this requires an extra comparison to check that you did neither of those to print the number.
but note that if you do 3 comparisons, or 2 comparisons and an "extra" comparison, you've done the same number of comparisons, it's not really 'extra'
then go ahead and keep adding factors and see which results in more comparisons as we rack up the actually extra comparisons for various combinations of factors
btw and ftw, your modulo and string test example, more correct than the usual implementation, is among the most efficient:
The Firestorm cores are capable of issuing two 128bit stores per cycle [1], giving a bandwidth of 102.4 GBps. This matches the experiments by Andrei with mixed reads/writes [2], and I have personally verified that it is attainable for pure write workloads to L1 as well as L2.
This does not mean 102.4 GBps is achievable on a single core, but if L2 bandwidth is the limiting factor it very well could be.
Before running the code, make sure your CPU does support AVX2. Most 64-bit Intel and AMD CPUs should. ARM CPUs, like those found in newer Apple computers, Smartphones or Raspberry Pis, won't support AVX2.
That sentence seems silly. ARM assembly and x86-64 assembly are two different beasts. Running the x86-64 code using Rosetta 2 emulation seems counter-intuitive as well if you want to get the fastest possible result on that machine.
ARM CPUs have NEON. I cannot say if NEON has all equivalent commands, however, the threadstarter is talking about complete rewrite, not just running it as is.
> As of this writing, the 3rd fastest submission is written in Rust and produces out at a rate of 3 GB/s, 2nd is written in C and produces at a rate of 41 GB/s and the fastest is written in Assembler and produces at a rate of 56 GB/s.
This makes me happy. The order is exactly as it should be, much to the disappointment of the Rustaceans.
It's a nice reminder that for the ultimate performance, there is no substitute for knowing exactly what the assembler is doing.
It's also a nice reminder that most of us will never want to do that.
EDIT: Whoa. It's late, and I thought Rust was at 30 GB/s instead of 3, i.e. roughly neck and neck with C. I meant to make a good-natured joke. There must be something wrong with the Rust implementation to be 10x slower; I wonder what it is.
As I mentioned further down in the comments, I'll donate $500 to Watsi if someone manages to displace the C implementation in Rust or any other language besides C/C++/ASM. Be sure to DM me on Twitter (https://twitter.com/theshawwn) if you do, both so I'll see it and so that I can congratulate you. I'm sure it's possible, but it's sort of interesting that the activation energy is so high.
You can basically take any LLVM IR that Clang would produce and produce Rust that compiles down to it (modulo irreducible control flow, I think, but the C++ doesn't have that). Because of that, C++ vs. Rust language (as opposed to library) performance comparisons are uninteresting.
> It's a nice reminder that for the ultimate performance, there is no substitute for knowing exactly what the assembler is doing.
C and C++ do not let you know exactly what the assembler is doing. I work on compilers for a living and despite that--perhaps because of that--I have no confidence that I can predict what instructions the compiler will generate ahead of time. If I need to know the actual instructions, I run my code through Godbolt and check. C, C++, and Rust are all millions of lines of LLVM optimization passes away from the metal.
Based on some other comments, I suspect that all the time is being spent in locking stdout, which you can amortize by using Stdout::lock [1]. As is so often the case when microbenchmarks are discussed online, this has nothing to do with the language.
Also, it shouldn't need to be said, but FizzBuzz is so totally unrepresentative of real workloads that you shouldn't draw any conclusions about language performance from it.
It is very interesting if you have properly verified invariants and you can guarantee (with manual checking) that your code is correct. It's an order of magnitude harder than just letting the borrow checker do its job, but it's entirely possible (just like in any other language).
The second one (written by me) is actually JIT-compiled assembly. So little of the runtime is spent in the C code that -O0 vs -O3 makes no difference at all; and even if you wrote the glue code in Python it would not be much slower than the C code.
Also the third entry is not the one in Rust. It's just that the Rust entry came later and thus was not included in the plot. The third entry is https://codegolf.stackexchange.com/a/215236/56694; it is also C.
> displace the C implementation in Rust or any other language besides C/C++/ASM
This is not a language contest, it's an algorithmic contest: that is, it's the best algorithm that wins, not the best compiler. The top entry, converted into C or Rust by mostly menial work, would probably be only a few % slower and would displace the C implementation.
Now I'm kind of hoping that someone will rewrite the glue in Python so that Python is the second fastest implementation of them all, just for the laughs.
From the article, "The application produced 64 bytes of FizzBuzz for every 4 CPU clock cycles. The author states the ultimate bottleneck of performance is based on the throughput of the CPU's L2 cache."
For comparison, that time is on par with integer multiplication.
It's actually really nice the way the author commented it. It's a cool insight into their mental process. I've never coded in Assembly and I can't normally follow it. But reading these comments I can understand how the author is surgically managing registers and operating off a complete mental map of what's there, the same way a coder would with, say, a very large database they have a complete mental model of.
All I mean is that I appreciate the author making their thought process accessible. It certainly looks like virtuosity, but I'm not competent enough to judge.
// The bytecode generation loop is unrolled to depth 30, allowing the
// units digits to be hardcoded. The tens digit is stored in %al, and
// incremented every ten lines of output. The parity of the hundreds
// digit is stored in %ymm2: one half predicts the hundreds digit to
// be even, the other to be odd, and the halves are swapped every time
// the tens digit carries (ensuring the predictions are correct).
It's not CPU predictions, it means that it computes in parallel one result that is correct and one that is wrong, and picks the right one by swapping the two halves every now and then.
I mean, this is importantly true as a generalisation, but, it's worth knowing modern compilers know about idioms.
If you show a modern compiler the typical idiomatic blob of code for counting the set bits in a word in a language like C, the compiler goes "Oh, popcount I know this" and when targeting modern CPUs it emits a single instruction, the popcount instruction for that CPU.
In APL idiom recognition was essential to making even a halfway decent compiler. It's easy, and idiomatic, to write APL which would be tremendously wasteful if implemented naively, while recognising common idioms could remove lots of that waste without needing the level of sophistication modern compilers have during optimisation.
Because the compiler knows about idioms you should not try to optimise your programs by writing non-idiomatic code until you've determined you have a performance problem and the non-idiomatic code is identified as a way to fix it. Otherwise it may be that the compiler improves and now your code is not only harder to maintain it also has worse performance than the idiomatic solution.
Not yet. But I have no doubt in my mind that a once we have smarter then human AI there will be compilers that absolutely can turn bubblesort into quicksort.
Seems fairly unlikely. The best sorting algorithm is a function of the data. The compiler can't know this. There are pathological data sets where quicksort is O(n^2). For small n, quick sort is also a bad choice.
The programmer can understand these domain characteristics but the compiler certainly can't. If code isn't running qsort(), this could be a mistake, but may also be an optimization.
You almost always know something about the data that a sorting function will operate on. Unless of course you are implementing a generic sorting algorithm. In practice you could train your compiler on real world data to allow it to optimize better for your environment. This is already happening with profile guided optimizations.
I think it's unfair that people downvoted this to gray. This person didn't claim that this was about to happen, and it seems likely to me that when/if AI surpasses human intelligence that something like this would be possible. It might require some additional way to describe expectations about the input data, and it might be many years away, but I would be disappointed if compilers couldn't do this in the far distant future. Look at the progress we've made in the last seventy years and try to extrapolate that out a thousand years forward. I obviously can't guarantee what will happen, but is it really that unreasonable to suggest progress like this as a strong possibility?
I would love to see a Python implementation that approaches the C code. It would be educational on so many levels. Would anyone be interested in trying?
Also, why is the ASM version so much faster if the glue code doesn't make much difference? I suppose it matters at the margins...
> why is the ASM version so much faster if the glue code doesn't make much difference
The remark on the glue code only applies to my entry (the second). The assembly entry uses a completely different (and much more complicated) algorithm than the C entry.
If you read the description you’ll see that it’s optimizing around certain syscalls like memcopy (which is otherwise limiting and reduces speed by 5x). I think it’s less language choice, and more low level optimization that is making a difference here.
(To check, I wrote a Go program that doen nothing else than fmt.Print($largebody) in a for loop and it tops out at around 10GB/s)
Note that memcpy is not a syscall. Common syscalls do copies. The fast implementation uses the vmsplice syscall as a clever way to pump a pipe with data, because it doesn't copy.
Indeed. Today for most systems memcpy is in fact a compiler intrinsic. C's standard library offers a memcpy but the reality is that your C standard library was likely supplied by your C compiler vendor, and when you ask for memcmp() in C you get their preferred intrinsic inlined into your program unless you've been very firm in declining that optimisation.
In Rust this memcpy intrinsic is required (for the core language, not just the standard library) because Rust clearly has arbitrarily large data structures you can literally copy, so, somebody has to take responsibility for doing that work, Rust says it's the compiler back-end's job. On a toy CPU the memcpy might really just be a trivial read->register->write loop because that's adequate.
Oops, I notice hours too late to edit it, that this says memcmp() in one place where it mostly says memcpy(). In fact your C compiler likely has intrinsics for both, but I did mean memcpy() everywhere in the comment.
It's not just knowing exactly what the assembler is doing, but also what the kernel is doing (but maybe you also meant this). "This implementation only supports Linux", and then continues about the use of a bunch of system calls:
* "asks the Kernel to defragment the physical memory it's using via the madvise system call"
* "use 2 MB of RAM each even though neither need 2 MB. This keeps the page table orderly"
* "Most competing FizzBuzz implementations call the write system call which can lead to data needing to be copied between user and Kernel space. So vmsplice is used instead."
And then the most fascinating bit: "The CPU's L2 cache is split into two halves, one for calculating the output of FizzBuzz, the other storing the result of the last calculation batch. This means there is no need to use any slower L3 memory operations."
"Slower L3 memory operations", I've not read that sentence before :D
It is really fascinating, but I disagree a bit with your pick :)
To me, the most fascinating part is using a bytecode interpreter. It makes perfect sense, it's basically optimizing the code size to fits in the cache. It's also a bit like defining a whole new instruction set dedicated and optimized for Fizzbuzz!
Most of the state of the Fizzbuzz program is encoded in this bytecode.
> "Slower L3 memory operations", I've not read that sentence before
You hear it all the time when trying to optimize for speed in a hot loop. Keep your data structures in a way that fits in the cache. Align your data with cache boundaries. Code needs to be small too, hence the bytecode approach. This is also the reason why you can't unroll loops too much.
If your implementation does not need to fetch (or write) data from(/to) higher levels of the memory hierarchy, you'll see tremendous gains. There's even this bit in the original source: "while waiting for writes to go through to the L2 cache" :)
There are a few more very interesting hacks, such as the "high-decimal" format that skips the "need to do binary/decimal conversions."
You still assume I care. As stated above I don't care. It's in your head only that every Rust developer writes Rust because she thinks it's faster than C.
I think we'll have to agree to disagree. It's illustrative that a seemingly simple C program can't be easily ported to Rust. If it could be, someone would have already done it -- it's been a few hours.
It's also illustrative to examine exactly what the hangups are. You might even get more Rust converts if you acknowledge that there are serious engineering burdens to overcome when deciding to use Rust.
For the record, it's not the anti-Rust sentiment which people find annoying, it's the treatment of programming languages as though they were football teams.
If you want to make a case against Rust, then let's get concrete: why is one Turing-complete compiled language that can make arbitrary syscalls not capable of compiling down to the same assembly as another?
That would be an interesting discussion; this is not.
You can dig into it, but the details are mostly only relevant to people designing computer languages and tooling.
Aka Compilers, as long as the machine code is equivalent to the source code their output can be wildly different. However, someone needs to actually write them which means both linguistic differences and the sheer effort that was actually put into the compiler are both significant. You shouldn’t assume that every language is operating on an even playing field because their simply not. The obvious solution of compiling to a second language is never quite as capable as using the language directly.
> Compilers, as long as the machine code is equivalent to the source code their output can be wildly different
I don't understand what this means: the output of a compiler is the machine code from the source code, no?
Also, Rust uses LLVM as a backend, which has benefited from lots of work given its use for C, C++, ObjC, Swift, &c. Compiling the C with Clang/LLVM might give a more interesting insight into the differences between Rust and C.
> The obvious solution of compiling to a second language is never quite as capable as using the language directly.
I'm not sure what you mean by this, or rather how it's related to what we're talking about?
You get different machine code using different compilers on identical pieces of source code. In theory the results of running that source code should have equivalent results even if one is more efficient than another. If for example one compiler uses a ASM instruction that another never uses then nothing the programmer does can get the second compiler to be as efficient as the first. The same is true of various optimizations etc, sometimes a compiler is just better.
Bring a different language into the picture and things get even more divergent.
> how it’s related
It’s specifically in reference to rust use of LLVM not being sufficient to guarantee equivalence. Many languages target the JVM, that does not make them equivalent.
> You get different machine code using different compilers on identical pieces of source code.
Ah, right - well I'm certainly not denying that. (I'm familiar with the topic, to be clear: I've written compilers. I just wasn't clear what exactly you were saying.)
> It’s specifically in reference to rust use of LLVM not being sufficient to guarantee equivalence.
Formally, no. In practice, given that fact, given the experience of the maintainers of both languages, and given the 'compilability' of both languages, it would be surprising if there were a 10x gulf between them.
Not necessarily. Language evangelists can be incredibly annoying... actually, evangelists in general can be really annoying. It is cathartic to prove annoying people wrong every once in a while.
When there's a memory related bug (or CVE) in a program written in C or C++, there's almost always some comment about how the Rust compiler would've caught it
On HN there is often: A comment about Rust would catch this (sometimes sarcastic), an explanation about how the bug isn't "really" memory safety and so Rust wouldn't catch it, followed by a comment explaining actually Rust does catch it.
The recent Linux mutex security bug is an example. That's not a memory safety bug. Surely Rust doesn't help? But Rust's mutual exclusion design relies on the Rust type system to give us safer locks, in Rust you protect things by having "lock the mutex" be the way you get the reference to the protected object. Thus when the faulty code locks the wrong mutex in Rust it would get the wrong object, and either not compile (you lock variable A and then you're trying to change B but you don't even have B) or not work (you always lock variable A, then change variable A, but your code should change B, it does not work, trivially reproducible bug in your code)
Rust is a memory safe systems programming language. As a systems programming language, it can be used in the same areas that C can (see Redox OS). I've seen quite a few Rust users online that have claimed that, because of that, it can be just as fast as C.
But it can be just as fast as C. The C program that's so much faster here isn't "really" a C program it's just a fancy way to shovel machine code, which you could of course also do (unsafely) in Rust if you chose to do so but why?
However what you were asked about is claims it's faster and there micro-benchmarks like this are not revealing. At scale you're obliged to limit yourself to what you can write that works and in Rust the safety lets you reach higher - which is how it achieves better performance. In fact that's what Mozilla used Rust for. They'd failed more than once to ship a concurrent CSS speed-up, finding it too hard to successfully get the bugs out and so returning to a single threaded CSS implementation - but the attempt in Rust (Quantum) worked just fine.
This is the same reason C++ is faster, a C++ iterator isn't faster than the naive C for loop, in fact they emit identical code, but the iterator is easier to think about, which means you can confidently write more complicated software using this abstraction. Rust just provides the safety to go higher still.
This is really cool. It should be obvious that a few thousand hand-rolled assembly opcodes targeting a fixed architecture would win the race. If the size of the solution grew a couple of orders of magnitude larger though, I'd expect that the micro optimization challenge would overwhelm the author and compilers would be more competitive.
Is the Table of Contents of your upcoming book public?
You're totally right. I think a lot of rust developers are going to be absolutely distraught that rust isn't the fastest language for writing fizzbuzz programs.
I guess they're just going to have to settle with rust being the best language to write safety-critical high performance concurrent low-level code.
>This makes me happy. The order is exactly as it should be, much to the disappointment of the Rustaceans.
Why it makes you happy? it's a reminder how much we're losing performance, energy and stuff when we're writing in safe/expressive/easy to use languages like Java, C#, Python yada yada
FYI, on my machine a naive Rust implementation (modulo plus println!) runs at the same speed as a naive Python implemention (16-17 MB/s), even compiled with --release. Naive C manages 160-170 MB/s. (And I fully admit the naive implementations are pretty much as far as I'm going with this).
I would guess it's the println! macro eating the time.
Calling stdout.lock() by itself makes little difference for me. Using BufWriter as you have I get ~80 MiB/s (slightly ahead of PyPy).
My implementation also has/had a String allocation in the loop, kicking that out got me to 120 MiB/s. I tend to write it as though the numbers/strings could be changed by a product manager at any time.
Final edit/errata: I get up to 140 MiB/s by appending .to_string() in the integer case instead of format!(), which makes some sense.
I didn't bother saving my original C implementation, but in retrospect I think I didn't use string concatenation, as I have with other languages. Redoing it with string-concatenation, C is slightly slower than Rust, 110 MiBs. Also makes sense as concatenation in C is slow.
edit: weirder, naive Julia (1.7 on Linux) is terrible. 45 KB/s.
Ruby is slightly slower than Python or Rust, 11 MB/s. I was expecting Python to be slowest among the interpreted languages, apparently not. Pypy yields a speedup to 70 MB/s.
Summing naive implementations (for each i: start with empty string, append fizz/buzz based on the modulus of each, then conditionally append i and print):
Julia 45 KB/s
Ruby 11 MB/s
CPython 15 MB/s
Rust 16 MB/s
Pypy 70 MB/s
C 160 MB/s
by default Julia uses locking for writes to guarantee thread safety when multiple writes are happening. That said, it's totally possible to go way faster using buffered IO.
Wow. Perhaps I should be embarrassed, but I mentally filled in "30 GB/s" instead of 3 GB/s. I thought Rust was mostly neck and neck with C. I see now why a lot of fellows responded in a defensive way.
What's it doing to be so much slower? That's shocking. I may not enjoy Rust, but I do respect its ability to approach C++ performance.
At the very least, the itoap crate being used there does divisions by and comparisons to powers of 10 as part of stringifying integers, which the C and ASM versions don't seem to be doing. A crate with a function for stringifying arbitrary integers cannot take advantage of the fact that it's going to be used for stringifying consecutive integers, as an inlined solution can.
Not really. This is a hyper optimised microbenchmark. The actual work being done is so trivial that any "overhead" is really significant and you can get order of magnitude speedups by doing arcane things with caches, eliminating tiny inefficiencies and so on.
Complete speculation, but I think you'd find that's true for most problems. Very little CPU time is spent on the business logic, and the rest is overhead.
I'm curious, how did rust affect you so badly that you hate it so much? If it's just silly people suggesting it everywhere I think you have some personal issues you need to resolve, language choice isn't something to get emotional about.
>It's a nice reminder that for the ultimate performance, there is no substitute for knowing exactly what the assembler is doing.
I didn't realize viewing the assembly of a compiled binary was something reserved for C. Someone should tell rust.godbolt.org that it's not supposed to exist.
I think their point is that C is sufficiently close to assembler that you know exactly what assembly any given line of C will produce. This is an old canard which is not really true (https://zenhack.net/2015/08/02/c-is-not-portable-assembly.ht...), but it's common among people who derive a sense of identity from writing C.