Last time I wrote assembly, and it was a long while ago, it was way faster. But let's be honest, 95% of it was doing manual buffering on top of OS api's rather than use C stdlib. And the other 5% were by skipping itoa calls, by doing arithmetic directly on the string representation.
I think this is why assembler can be faster many times. Not because I'm better than a compiler. But because the structure of the language nudges you into faster approaches.
I've always been able to beat the compiler, and that's usually after trying to optimize using C. Admittedly, it's a whole lot harder to understand what's fast than it used to be. Access to SSE has it's own benefits.
It's been a problem (optimizing) for some time though. I remember it being some work to beat the compiler on the i960CA. OTOH, I seem to remember the i860 being not-so-great and for sure the TI C80 C compiler was downright awful (per usual for DSPs).
One should never loose to the complier, after all you can see it's output and it can't see yours.
Also, the programmer can "cheat" by doing things the compiler would consider invalid but are known to be ok given the larger context of the application.
The problem is the ROI is usually pretty bad as these assumptions rarely hold as the code evolves, in my experience, and the optimization usually only lasts for finite (sometimes shockingly short) amount of time. i.e. OS changes, hardware changes, memory changes, etc. etc. etc.
Back in the Pentium 1 and earlier days I could beat the compiler. But then it got hard.
And it changes so often, instructions that are fast on one CPU are not so fast on the next one, and vice versa.
Not to mention branch prediction and out-of-order execution makes it very difficult to meaningfully benchmark. Is my code really faster, or just seems like it because some address got better aligned or similar.
I've gotten significant speed gains in certain projects by simply replacing certain hand-optimized assembly in libraries (ie not my code) with the plain C code equivalent. The assembly was probably faster 10-15 years ago, but not anymore...
>I've gotten significant speed gains in certain projects by simply replacing certain hand-optimized assembly in libraries (ie not my code) with the plain C code equivalent.
That's an interesting point, plus there's the portability issue.
My own breadcrumbs of legacy code for this kind of innerloopish stuff has been to write a straightforward 'C' implementation (and time it), an optimized 'C' version (which itself can depend on the processor used), and a handtuned assembly version where really needed.
It allows you to back out of the tricky stuff plus acts as a form of documentation.