Then when checking the results are half the speed of what the compiler spits out...

im3w1l · on April 21, 2021

Last time I wrote assembly, and it was a long while ago, it was way faster. But let's be honest, 95% of it was doing manual buffering on top of OS api's rather than use C stdlib. And the other 5% were by skipping itoa calls, by doing arithmetic directly on the string representation.

I think this is why assembler can be faster many times. Not because I'm better than a compiler. But because the structure of the language nudges you into faster approaches.

magicalhippo · on April 21, 2021

Good point, though I usually found that if I then go back and restructure my high-level code, the compiler beat my ASM again.

kingsuper20 · on April 21, 2021

I've always been able to beat the compiler, and that's usually after trying to optimize using C. Admittedly, it's a whole lot harder to understand what's fast than it used to be. Access to SSE has it's own benefits.

It's been a problem (optimizing) for some time though. I remember it being some work to beat the compiler on the i960CA. OTOH, I seem to remember the i860 being not-so-great and for sure the TI C80 C compiler was downright awful (per usual for DSPs).

zsmi · on April 21, 2021

One should never loose to the complier, after all you can see it's output and it can't see yours.

Also, the programmer can "cheat" by doing things the compiler would consider invalid but are known to be ok given the larger context of the application.

The restrict keyword in c gives an example of how one can optimize code by knowing the larger context. https://cellperformance.beyond3d.com/articles/2006/05/demyst...

The problem is the ROI is usually pretty bad as these assumptions rarely hold as the code evolves, in my experience, and the optimization usually only lasts for finite (sometimes shockingly short) amount of time. i.e. OS changes, hardware changes, memory changes, etc. etc. etc.

magicalhippo · on April 21, 2021

Back in the Pentium 1 and earlier days I could beat the compiler. But then it got hard.

And it changes so often, instructions that are fast on one CPU are not so fast on the next one, and vice versa.

Not to mention branch prediction and out-of-order execution makes it very difficult to meaningfully benchmark. Is my code really faster, or just seems like it because some address got better aligned or similar.

I've gotten significant speed gains in certain projects by simply replacing certain hand-optimized assembly in libraries (ie not my code) with the plain C code equivalent. The assembly was probably faster 10-15 years ago, but not anymore...

kingsuper20 · on April 21, 2021

>I've gotten significant speed gains in certain projects by simply replacing certain hand-optimized assembly in libraries (ie not my code) with the plain C code equivalent.

That's an interesting point, plus there's the portability issue.

My own breadcrumbs of legacy code for this kind of innerloopish stuff has been to write a straightforward 'C' implementation (and time it), an optimized 'C' version (which itself can depend on the processor used), and a handtuned assembly version where really needed.

It allows you to back out of the tricky stuff plus acts as a form of documentation.

mhh__ · on April 21, 2021

I've started injecting code fragments into spectre exploits as a way of seeing whether the CPU likes them or not.

f00zz · on April 21, 2021

That's why it's more fun to try to optimize for size

FpUser · on April 21, 2021

For this case all fun is in the process ;)