This seems quite ridiculous to me, I have seldom seen "modern compilers are alwa...

kayamon · on Nov 29, 2016

On the contrary, I've often seen the "you can't beat the compiler" statement. This[1] recent reddit thread has it in the top comment, which is what prompted me to test it out.

And while all those other points are fine points (and I mention all that in the conclusion), it doesn't change the fact that beating the compiler isn't always the rocket science it's made out to be.

[1] https://www.reddit.com/r/programming/comments/5f9evm/learnin...

Joky · on Nov 29, 2016

Fair. As a compiler engineer (full-time on clang/LLVM), I wouldn't make such ridiculous claim (even though the compiler is capable of nasty tricks that "normal" humans wouldn't be able to pull).

Someone that pretends that it is not possible to beat the compiler should start by taking something like a GEMM routine (for example from there https://github.com/flame/blis/tree/master/kernels/x86_64 ) and reimplement it to show how a mainstream optimizing compiler can do better using C/C++ or Fortran.

A good starting point to understand the gap between hand-written optimized assembly and what you can get with C is http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/index.ht...

imauld · on Nov 29, 2016

Are you the author of the linked post? If so I have a couple of questions:

- Why not throw out the best and worst cases for each and then find the mean of run times? Seems like a more "fair" way to compare them.

- Did you compare the assembly generated by the compiler to the assembly you wrote?

Vendan · on Nov 29, 2016

You should always be comparing best case for this kind of thing. Slower cases are most likely "your thread got switched out by the OS to let something else run", and that's not really a fair test.

qznc · on Nov 29, 2016

If you want to guard against context switches by the OS, don't use stuff that "most likely" works. Measure with perf and let it count the context switches.

zamalek · on Nov 29, 2016

Which is why you use 90th percentile.

to3m · on Nov 29, 2016

What's wrong with using the best result? If you're concerned the code could run faster than possible: don't be :)

bArray · on Nov 29, 2016

What if the test data is random? Could just get lucky and get a happy day scenario.

Vendan · on Nov 29, 2016

Then both versions should get that "happy day"? If you are using distinct random data for each version, then you aren't really benchmarking properly.

bArray · on Dec 2, 2016

It doesn't say that they both use the same data sets.

Vendan · on Dec 4, 2016

If they are using different data sets, then I'd say it's an invalid benchmark.

panic · on Nov 29, 2016

Your first question was asked and answered here: https://news.ycombinator.com/item?id=13060404

emeryberger · on Nov 29, 2016

A note: every compiler writer should read this paper.

https://emeryberger.com/research/stabilizer/

gfody · on Nov 29, 2016

It's totally ridiculous - I spent several years coding almost exclusively in x86 assembly and the prospect that a compiler could best a human at optimizing a specific function is ignorant. The only exception I can think of so-called 'superoptimization' where the optimal sequence of instructions is determined by exhaustively testing every combination. And that's not an effective strategy for anything beyond a handful of instructions.

Technically you could imagine some intelligent-agent based search algorithm that finds the near-optimal sequence of instructions for a given statement of some intermediate language, employing some heuristics derived from thousands of years of deep learning to get the search time down to something reasonable (say like hours or days instead of eons) - with the pressure on compilers today leaning towards everything being JIT'ed on the fly I don't think we're like to see it ever. It's just the old "sufficiently smart compiler" myth.

DannyBee · on Nov 29, 2016

"optimizing a specific function is ignorant"

Depends entirely on the domain. Sorry, but i have never seen a programmer come up with the kinds of parallelizing transforms, cache blocking and iteration reordering, etc, that most polyhedral optimizers do.

Now, if you aren't doing these kinds of things, and are just trying to optimize the hot loop of some simple program, yeah, you can definitely win given enough time, because you'll just sit there with IACA or whatever, and superoptimize it by hand.

But you also are often starting with the output of a good compiler. If you had to start with nothing, i doubt you would do as well.

Here: http://repo.or.cz/pluto.git/blob/HEAD:/examples/jacobi-2d-im...

Please, without looking at the output of pluto, create a multi-threaded, fully cache-optimized version of this code, optimized for 4 cores, by hand.

pluto can generate C code to do it in 0.2 seconds.

The result is here: http://repo.or.cz/pluto.git/blob_plain/HEAD:/test/jacobi-2d-...

Please also take the following gauss seidel code, and generate both a cache optimized sequential version, and a parallel cache optimized version: http://repo.or.cz/pluto.git/blob_plain/HEAD:/examples/seidel...

Most people would probably not be able to accomplish either, better than icc + pluto, pretty much ever, let alone in some reasonable time period.

gfody · on Nov 29, 2016

You speak as though all low-level optimization is alignment, unrolling, pairing, etc.. These are basically micro optimizations that yield negligible gains unless applied over a large code base. That Pluto output is just a wall of code because it has been unrolled several times, it's not any sort of impressive optimization achievement.

A human would probably convert this to fixed point, convert the entire inner-most loop into a couple of address operations and a single fused multiply add, process the array 64 bytes per iteration without unrolling anything. At that point it's probably 10-20x more efficient than that slab of polyhedral bullshit and finally he'd come back to carefully pad here and there to avoid misalignment penalties.

As for optimizing for 4 cores - you take your shiny hand-polished assembly routine and spin it up on 4 threads, most likely in high level code since you're talking to the OS to get threads, synchronize them, etc. It's not wise to chase parallelism at a low level because that goes counter to minimizing the overhead costs in setting it up.

> But you also are often starting with the output of a good compiler.

No, nobody does this. I mean maybe if you're just learning assembly. Starting with the compiler-generated garbage does not help you other than maybe by giving you a benchmark to beat.

MaulingMonkey · on Nov 29, 2016

> The main advantage of compilers is that the optimizations scale across a large codebase through inlining for example.

This. You will die of natural causes well before beating the compiler en mass, on a number of codebases I've worked with. And even when dealing with the tight inner loop, I often optimize with a mind to, not so much beat the optimizer, but aid it.

There are plenty of obscenely low level optimizations - optimizations I'd argue are operating at an even lower level than assembly - that one can apply without actually resorting to assembly. Avoiding cache line aliasing, for example:

https://www.pvk.ca/Blog/2012/07/30/binary-search-is-a-pathol...

pg314 · on Nov 29, 2016

I agree with the gist of your post, but the pre-optimised libraries are often optimised for the general case. It can be quite worthwhile to dive deeper in some cases. Replacing e.g. the library malloc with your own pool based allocator or a general hash table with a hand-rolled one where you exploit certain knowledge about the data can be huge wins.

Joky · on Nov 29, 2016

You're right, but what you're describing seems to be on the order of "algorithmic changes" or "data structure changes" to be more suited to your use-case.

It seems to me that it is a higher level than what I was addressing: "I'll write my code in assembly to get it to run faster". Unless you're writing your pool based allocator in assembly ;)

JoshTriplett · on Nov 29, 2016

It's less that "you can't beat the compiler", but rather that "trying to beat the compiler can backfire if you don't take a great deal of care". Some optimization advice doesn't apply to current compilers or systems, and can produce actively worse results. The worst-case scenario isn't "you did no better than the compiler", it's "you did much worse than the compiler".