This seems quite ridiculous to me, I have seldom seen "modern compilers are always faster than you" but rather "they are good enough that it is not worth it". It provides a very over-confident "conclusion" based on a single dubious test.
The main advantage of compilers is that the optimizations scale across a large codebase through inlining for example.
Also, just moving from Sandy-Bridge to Haswell for example can have significant performance swing (in both direction). The maintenance cost of the assembly is again a scaling issue.
If you have a single function that takes a significant amount of time in your program, and performance is critical, of course you can try to go with lower level. But it is likely that it will be more profitable to start with 1) pre-optimized libraries (i.e. don't write your own "sort") ; 2) follow the optimization guidelines of the CPU vendors regarding memory layout, etc. ; and 3) start with vector C-level intrinsic if possible if you can benefit from vectorization.
On the contrary, I've often seen the "you can't beat the compiler" statement. This[1] recent reddit thread has it in the top comment, which is what prompted me to test it out.
And while all those other points are fine points (and I mention all that in the conclusion), it doesn't change the fact that beating the compiler isn't always the rocket science it's made out to be.
Fair. As a compiler engineer (full-time on clang/LLVM), I wouldn't make such ridiculous claim (even though the compiler is capable of nasty tricks that "normal" humans wouldn't be able to pull).
Someone that pretends that it is not possible to beat the compiler should start by taking something like a GEMM routine (for example from there https://github.com/flame/blis/tree/master/kernels/x86_64 ) and reimplement it to show how a mainstream optimizing compiler can do better using C/C++ or Fortran.
You should always be comparing best case for this kind of thing. Slower cases are most likely "your thread got switched out by the OS to let something else run", and that's not really a fair test.
If you want to guard against context switches by the OS, don't use stuff that "most likely" works. Measure with perf and let it count the context switches.
It's totally ridiculous - I spent several years coding almost exclusively in x86 assembly and the prospect that a compiler could best a human at optimizing a specific function is ignorant. The only exception I can think of so-called 'superoptimization' where the optimal sequence of instructions is determined by exhaustively testing every combination. And that's not an effective strategy for anything beyond a handful of instructions.
Technically you could imagine some intelligent-agent based search algorithm that finds the near-optimal sequence of instructions for a given statement of some intermediate language, employing some heuristics derived from thousands of years of deep learning to get the search time down to something reasonable (say like hours or days instead of eons) - with the pressure on compilers today leaning towards everything being JIT'ed on the fly I don't think we're like to see it ever. It's just the old "sufficiently smart compiler" myth.
Depends entirely on the domain. Sorry, but i have never seen a programmer come up with the kinds of parallelizing transforms, cache blocking and iteration reordering, etc, that most polyhedral optimizers do.
Now, if you aren't doing these kinds of things, and are just trying to optimize the hot loop of some simple program, yeah, you can definitely win given enough time, because you'll just sit there with IACA or whatever, and superoptimize it by hand.
But you also are often starting with the output of a good compiler. If you had to start with nothing, i doubt you would do as well.
You speak as though all low-level optimization is alignment, unrolling, pairing, etc.. These are basically micro optimizations that yield negligible gains unless applied over a large code base. That Pluto output is just a wall of code because it has been unrolled several times, it's not any sort of impressive optimization achievement.
A human would probably convert this to fixed point, convert the entire inner-most loop into a couple of address operations and a single fused multiply add, process the array 64 bytes per iteration without unrolling anything. At that point it's probably 10-20x more efficient than that slab of polyhedral bullshit and finally he'd come back to carefully pad here and there to avoid misalignment penalties.
As for optimizing for 4 cores - you take your shiny hand-polished assembly routine and spin it up on 4 threads, most likely in high level code since you're talking to the OS to get threads, synchronize them, etc. It's not wise to chase parallelism at a low level because that goes counter to minimizing the overhead costs in setting it up.
> But you also are often starting with the output of a good compiler.
No, nobody does this. I mean maybe if you're just learning assembly. Starting with the compiler-generated garbage does not help you other than maybe by giving you a benchmark to beat.
> The main advantage of compilers is that the optimizations scale across a large codebase through inlining for example.
This. You will die of natural causes well before beating the compiler en mass, on a number of codebases I've worked with. And even when dealing with the tight inner loop, I often optimize with a mind to, not so much beat the optimizer, but aid it.
There are plenty of obscenely low level optimizations - optimizations I'd argue are operating at an even lower level than assembly - that one can apply without actually resorting to assembly. Avoiding cache line aliasing, for example:
I agree with the gist of your post, but the pre-optimised libraries are often optimised for the general case. It can be quite worthwhile to dive deeper in some cases. Replacing e.g. the library malloc with your own pool based allocator or a general hash table with a hand-rolled one where you exploit certain knowledge about the data can be huge wins.
You're right, but what you're describing seems to be on the order of "algorithmic changes" or "data structure changes" to be more suited to your use-case.
It seems to me that it is a higher level than what I was addressing: "I'll write my code in assembly to get it to run faster". Unless you're writing your pool based allocator in assembly ;)
It's less that "you can't beat the compiler", but rather that "trying to beat the compiler can backfire if you don't take a great deal of care". Some optimization advice doesn't apply to current compilers or systems, and can produce actively worse results. The worst-case scenario isn't "you did no better than the compiler", it's "you did much worse than the compiler".
I ported the recursive variant of the quicksort test and ran it on my computer. Changes I made was to replace the Windows specific timing functions with Linux-specific clock_gettime() calls. Then I also changed the rcx and rdx registers to rdi and rsi because those are what the Linux 64bit calling convention uses.
So on my computer, the assembly code (barely) beat g++ but not clang++. From a cursory glance of the assembler code clang++ generates, the difference seem to be that it adds alignment to critical loops.
It is also smarter at using 32bit registers when it can get away with it. F.e the handwritten assembler code contains "xor r9, r9". An equivalent but faster variant that the compiler generates is "xor r9d, r9d".
There is also a slight error in the assembly code. rsp should be aligned to a 16 byte boundary when a call instruction is executed and the code doesn't ensure that. Likely it loses a whole bunch of performance by calling from unaligned addresses.
It's interesting that your clang and my clang give different results, even though we're using the same version. I suspect it's a result of differing CPU architectures. (i.e. my CPU is a different model to yours perhaps).
I originally did put loop alignment in my asm version, but I took it out because it was actually ever so slightly slower on mine. Make of that what you will.
I think a big difference is the flags. At least for g++, if you don't specify -march=native -mtune=native you're going to take a performance hit. How much of a performance hit depends on the features.
The sorttest.zip Makefile has only the following flags specified: -O3 --std=c++11
Where bjourne has: -O3 --std=c++11 -fomit-frame-pointer -march=native -mtune=native
I might rerun your tests with bjourne's additions!
That's very likely. Mine is an AMD Phenom(tm) II X6 1090T. Though I changed your code a little so that the intro looks like this:
sortRoutine:
; rdi = items
; esi = count
push rbp ; <- stack alignment push
sortRoutine_start:
cmp esi, 2
jb done
dec esi
The "cmp esi, 2; jb done; dec esi" corresponds to your "sub rdx, 1; jbe done". That improves it on my machine to 63 ms/loop. If you are interested I can put it online somewhere.
This shouldn't be needed. 8 byte alignment is fine for the CPU itself. The purpose of 16 byte alignment is to facilitate making 16 byte aligned stack allocations.
I also checked similar manual for AMD and it doesn't seem to mention RSP alignment at all, except that "some calling conventions may require ...".
The CPU doesn't care. It only matters when you call functions which allocate 16B objects on the stack.* This function calls only itself and pushes only 8B words on the stack so it's fine with 8B alignment.
* Some functions generated by C compilers do and they segfault if you call them with wrong alignment. Ask me how I know.
edit:
OK, so I downloaded this code. Results:
as-is: 78111us
push rbp: 73093us
sub rsp,8: 72332us
sub rax,8: 72222us
Seems to be a matter of instruction alignment, nothing to do with the stack.
They are equivalent due to zero extension. See http://stackoverflow.com/questions/11177137/why-do-most-x64-... But the encoding for most instructions using 32bit operands are shorter than 64bit operands. And shorter code is better because it fits better in caches and jumps are closer and so on.
Though I was wrong about that particular case because "xor r9d, r9d" and "xor r9, r9" are both three bytes long. But "xor eax, eax" is two bytes and "xor rax, rax" is three bytes. So that's why you should use the old GPR:s when you are able to choose.
CPUs special case xor r,r as clearing the register and breaking dependency chains. Do they do the same thing when relying on implicit zero extension? (Can't be bothered to check Agner)
While this may seem silly to some people, I definitely appreciate the sentiment. "The compiler is smarter than you" is thrown around often here, and on Reddit, and a lot of people consider it "common wisdom", but it's not really correct.
Writing code is having a dialogue with the compiler, it can do better than you sometimes, and vice versa, but treating the compiler as a magic box that always spits out faster code than you is pretty silly.
I can see where this received wisdom is coming from: a counter-reaction to the common tendency we had well into the 90s to hand-optimize every procedure considered to be even remotely on the hot path. It didn't even have to be inline assembly: it could just be C code sprinkled with registers, Duff's devices and bit shifts.
That used to work well enough for non-portable code targeting a limited range of CPUs, but nowadays the gains are too little , the RoI is negative and these efforts may actually end up backfiring on you.
I guess we needed to spread the knowledge that "the compiler is smarter than you" even if it wasn't really accurate, just to stop people from doing crazy stuff out of pure inertia.
I can see where this received wisdom is coming from: a counter-reaction to the common tendency we had well into the 90s to hand-optimize every procedure considered to be even remotely on the hot path. It didn't even have to be inline assembly: it could just be C code sprinkled with registers, Duff's devices and bit shifts.
That's not it at all. The original problem was that the compilers generated several orders of magnitude larger and slower code than what we could code in the demo scene, and other than processor or memory, made zero utilization of the hardware or DMA. And in the demo scene, if you're not getting the maximum performance out of the hardware, you might as well be dead -- "demo or die", as Chaos of Sanity (now Farbrausch) so famously put it.
Compilers didn't really catch up with us: the fastest and best they can do using hardware instead of just the CPU and RAM is CUDA Fortran (pgi Fortran compilers). I know of no compiler taking advantage of DMA or audio hardware, let alone co-processors like for example the Copper and the Blitter. Even on systems like PS3, the GCC compiler took zero advantage of the RSX chip -- it was just a generic PowerPC compiler.
Surely a compiler will sometimes beat a human by generating a perfectly or near perfectly scheduled sequence of instructions for a particular processor, but a human can write a generic piece of assembler code that will get really good performance across a range of different chips in a given processor family, and so still beat a compiler overall.
It's worth noting that the compiler does more than just make things fast. Even the smartest of people muck up things like keeping track of types and doing pointer math every now and then. Let's say you managed the OpenSSL project. If you knew, statistically, that every line of hand-written assembly reduced runtime by Y percent and increased the likelihood of a heartbleed-magnitude security issue (caused by that code) by Z percent, how much Y would you trade for Z?
If the compiler even averages out with a human with performance, the ability to get the sort of messages that, say, rustc generates is utterly invaluable.
That also brings about the risk of doing bad things that the compiler could otherwise prevent. Simply using a non-optimizing compiler with a timing attack-safe algorithm (and using the resulting machine code) would have the same effect. There's little reason to actually write assembly by hand, in that case, unless you're trying to milk performance.
Compilers are usually at a disadvantage compared to human programmers, as they're under pressure to produce code as quickly as possible; seconds if possible, minutes at worst. A human may spend many hours or days writing, profiling, testing, etc. This biases the kinds of algorithms that compilers use (especially JITs, since they have even stricter requirements).
It would be nice to have a compiler/optimiser/analyser/profiler/tester/fuzzer/etc. designed to run for long periods, running all sorts of improvement-finding algorithms, building up a knowledge base about the code on disk (which can be updated incrementally when the code changes), and providing reports and messages to the user.
When we're about to embark on a deep dive, for optimisation/debugging/etc. we can fire up this assistant and have it running for the entire time we're devoting to the problem. It can even keep running overnight if we spend several days on the issue.
Superoptimisation is very cool, although we'd want a bunch of infrastructure around it e.g. to identify bottlenecks, and make reuse easier.
Souper is a cool project, and I definitely agree with the use of such techniques as a form of static/dynamic analysis to inform the user, rather than as a compilation step.
Whilst the time taken by superoptimisation is a big problem, another big problem with this kind of approach is its unpredictability. Rather than using the output of such optimisers directly, it seems preferable to use those optimisers to find hints which we can annotate our code with; that way, the performance will be more predictable, and more robust to code changes.
Profiling information could be gathered during testing; we could kill two birds with one stone if we gathered profiling information during property checking, e.g. in the style of QuickCheck/(Lazy)SmallCheck/etc. Maybe with an option to add weights to the test cases, so we can assign a low weight to tests which throw crazy data at the system, like fuzzing, and higher weight to those with realistic data generators, golden tests, etc.
Your description of the assistant reminds me of a Clojure talk I watched recently[1] where the speaker outlines how a central pool of knowledge about invariant compilation properties could embody the scientific method.
Not watched the talk yet, but sounds very similar to my own thinking. For example, I think the "pipeline" approach of compilation (preprocess -> lex -> parse -> desugar -> inference -> check -> optimise -> code generation) is very restrictive, as it precludes many other activities (e.g. static analysis), forcing custom lexers, parsers, etc. to be created in parallel, which may-or-may-not work with existing code, etc.
I think a more data-based approach would be preferable, for example we might think of "source text", "preprocessed", "lexed", etc. as being tables in a relational database, and the phases of the pipeline as views/projections which populate one table from another. Optimisation would just be a one:many relation, with a single source text corresponding to multiple possible expressions. This data could be stored, collated, mined, augmented, etc. by various other processes, which allows more approaches to be taken than just compiling.
Of course, this is just one idea; and the relational part would only need to be an interface to the data, it could be computed lazily, and wouldn't necessarily be implemented with an actual RDBMS storing all of the intermediate bits.
Just watched the talk. The technique the author's discovered is called "supercompilation" (note: this is distinct from "superoptimisation", mentioned in sibling comments).
The idea is to use an evaluation strategy which works at compile time, reducing statically-known expressions whilst shuffling around unknown values symbolically to preserve the semantics. The result ends up folding constants, inlining, unrolling, etc. automatically, just like the examples in the talk.
Whilst supercompilation can be slow, the main criticism is that resulting code is quite large, due to the amount of inlining.
The main difference between supercompilers and the technique discussed in the talk is that supercompilation is provably correct: it doesn't rely on testing or placing exhaustiveness requirements on the user. I don't think it's necessary to sacrifice correctness; the speaker mentioned the use of guards in JITs, and how he avoids them for speed, but I think that since we're performing such extensive rewriting of the code anyway, many redundant guards can be supercompiled-away, and those remaining can be shunted to the start of the program, so that no branches appear in the hot path.
The allusions to the scientific method seem to be about properties/invariants, and potential counterexamples to them. I'm all for finding and sharing properties of programs, both conjectured and proven, although I don't like the idea of assuming the conjectures are true.
I've looked into this in Haskell, and found some great work, including:
I think there's definitely some low-hanging fruit there, for finding equivalent expressions, comparing their performance, and generating a list of rewrite rules which the programmer may want to consider adding to their code.
A slightly more ambitious project would use equations/properties about the program to generate a specialised supercompiler (maybe using something like a variant of knuth-bendix completion, ordered such that the normal forms are the higher-performing variants).
It's easy to beat a compiler in the small - just takes time & patience. But such an approach doesn't scale. We don't write tiny routines and throw them away; instead, we write big programs made of lots of routines & classes, and we maintain them for years, probably porting them from machine to machine.
I encourage everyone to write some assembly; you'll learn a lot. But use a compiler for your work.
If he sorts 1 mln items, I guess he runs out of L1 cache and probably out of L2 cache. Therefore memory accesses may pay the biggest role here and that explains why he sees almost no improvement from recursion elimination.
About Human vs Compiler, I see a very different issue: most developers (especially at Big Corps) only know objects and do not have a clue about how a processor is processing.
As a result, most high level programming has very poor performance - whatever the compiler quality. This is certainly why we keep waiting seconds for simple operations.
Questioning compiler output is a very good exercise to become a better developer, whether you can beat the compiler or not.
Sedgewick's 1978 paper[0] on implementing quicksort has some interesting hand optimizations of the assembly code -- loop rotating, unrolling, etc. I wonder if modern compilers do the same?
Bestcase seems like a poor metric when the CPU scheduler could certainly cause 7% variation. I would be interested to see, say, 100x the number of runs, and see mean rather than best, since one usually cares about average more than best.
I also wish I knew what optimization settings GCC/etc was using, and what effect tweaking those has.
Because of noise in general, "best case" seems always like the best metric to me. Over a large number of run, you're likely to hit the "perfect" measurement with on a microbenchmark.
Otherwise, for an "adaptive" number of runs till enough time is spent to have some "confidence" on the measure, I've been fairly happy with: https://github.com/google/benchmark/
The timing _should_ be constant each run, so best case is the best way to remove the scheduler variations. I tried mean also, and the results aren't that different.
Optimization is -O3 (see the attached Makefile at the bottom).
Would march=native and fstrict-aliasing do any difference?
It would be interesting to compare the compiled asm with the hand rolled one.
The code has some potential improvements also but maybe the compiler is smart enough to find them, such as reading pivot.key in the loop even though it doesn't change.
Yes, you can pretty easily beat the compiler in simple cases when you do this.
I would seriously challenge anyone to try to, by hand, do what PLUTO+ does .
http://dl.acm.org/citation.cfm?id=2688512
It is implemented in at least one real production C++ compiler. The analogue would be graphite in gcc, and polly in llvm, but they don't have the full cost modeling it does.
Then try to do it for multiple architectures or even different cache models (IE newer vs older processors).
Even simpler things than that, like deciding when it is profitable to add runtime vectorization/alignment checks, etc, is really hard by hand.
Hell, in larger functions, i doubt people can even optimally do register allocation (including live range splitting, remat, etc).
So yeah, stupid quicksort, sure, you can beat it.
I'm not sure what it's supposed to prove?
If you restrict yourselves to small cases that are easily optimizable without any thought, and not amenable to any even slightly advanced optimization, yes, you can beat the compiler.
So I guess that the lesson to take from this post is that you can beat the compiler. We should also appreciate that the people who did similar analyses and did not get a speed up, most probably did not write a blog post about it.
I would like to see in the article a discussion of the assembly the compiler produces, how it differs from the assembly the author wrote, and perhaps why the differences are worse.
>where making good use of the SIMD intrinsics can allow assembly to massively beat the compiler.
Is this the correct use of this terminology? I thought intrinsics were functions that allow you to tell the compiler to use particular instructions, specifically so you can avoid dropping into assembly. In assembly, wouldn't you just call them "instructions"?
Should the second assembler statement use `jle done` rather than `jbe done` to preserve the original semantics? (I know nothing about assembly so could be missing something obvious.)
Yeah, it probably should be. It doesn't affect the performance. The difference would only manifest if you passed in a negative count, which would be an error anyway.
If your version goes UB on a zero length input array, then I think the compiler not only wins, but wins by a lot.
Obviously you can easily fix it, but the statement people generally make is "You can't beat the compiler" not "(You, me, whomever else [that was as far as I got], and/or a huge time investment) can't beat the compiler". All that said...people categorically saying "You can't beat the compiler" annoys me too (though in my case they're right; I can't).
It only fails on a _negative_ count. Zero count works. If your program is passing around negative counts, it's already broken, and the exact specifics of where and when the brokenness manifests aren't particularly important.
Technically I should have used a size_t instead of an int for the count anyway, so it's kinda a moot point. I just picked int to make a simpler toy example program.
When optimizing code, you probably want to work on performance improvements that can actually be fixed by editing the code.
Most of the things you can directly affect are things that happen in every test run, so best-case will include them.
Slower test runs will include events that don't happen on every test run (the computer is busy doing something else), so editing the code has less effect on them, and possibly none at all if it's completely unrelated.
Maybe those other events causing slowdown should be investigated too? But usually you want to look for a way to make them happen every time before working on them.
If the thing you're timing is expected to have a constant running time, the only thing that can slow it down is external factors (e.g. OS background tasks).
Best-case over a large number of runs is the correct way to approach the ideal running time of the task in this case, as you can eventually hit a run that didn't get impeded by anything.
You should never do this. Best-case favors outliers and does not represent expected performance, which is what we care about. Just because the stars happen to align one time doesn't mean you report that run.
You will in practice hardly ever see outliers like you descrivbed in system A, where one run is significantly faster. You will often see cases where one run is significantly slower. The reason could be things like cache misses, swapped out code, some bad code path happening etc (all these on very different timescales). These things tend to happen only occasionally, so the reversed case from your example A (seven five second runs and one ten seconds run) is more pluasible. Because such factors tend to be things you can't easily control, taking the minimum is a good approximation when optimizing a code snippet as opposed to the whole program.
Nice job. Here's 900,000 lines of C++ code for you to now translate to assembly. And after you're done with that, I'd like to change a few lines and have you do it over again, preferrably 100 times a day.
It'd be nice if people replied instead of just downvoting.
You're missing the point of a compiler. It does a huge amount of work to reliably get a very, very good solution to a huge problem in a reasonable amount of time. Depending on the optimization settings, of course it is not going to try its hardest to get the very best code out of every single function. Besides, you can always use the output of the compiler as your starting point for hand optimization.
Why don't you try your hand at some Fortran kernels where a compiler might spend a few minutes or hours optimizing the hell out of something extremely important? I doubt you'll beat a Fortran compiler at its main job.
No one is claiming that you can't beat the compiler some of the time. But you can't beat the compiler even 0.01% of the time, given how much code there is out there.
I can't beat the compiler even 0.01% of the time? Really? Because that's the point of the article -- I just grabbed the first piece of C I found and managed to beat the compiler.
The main advantage of compilers is that the optimizations scale across a large codebase through inlining for example.
Also, just moving from Sandy-Bridge to Haswell for example can have significant performance swing (in both direction). The maintenance cost of the assembly is again a scaling issue.
If you have a single function that takes a significant amount of time in your program, and performance is critical, of course you can try to go with lower level. But it is likely that it will be more profitable to start with 1) pre-optimized libraries (i.e. don't write your own "sort") ; 2) follow the optimization guidelines of the CPU vendors regarding memory layout, etc. ; and 3) start with vector C-level intrinsic if possible if you can benefit from vectorization.