It's totally ridiculous - I spent several years coding almost exclusively in x86...

DannyBee · on Nov 29, 2016

"optimizing a specific function is ignorant"

Depends entirely on the domain. Sorry, but i have never seen a programmer come up with the kinds of parallelizing transforms, cache blocking and iteration reordering, etc, that most polyhedral optimizers do.

Now, if you aren't doing these kinds of things, and are just trying to optimize the hot loop of some simple program, yeah, you can definitely win given enough time, because you'll just sit there with IACA or whatever, and superoptimize it by hand.

But you also are often starting with the output of a good compiler. If you had to start with nothing, i doubt you would do as well.

Here: http://repo.or.cz/pluto.git/blob/HEAD:/examples/jacobi-2d-im...

Please, without looking at the output of pluto, create a multi-threaded, fully cache-optimized version of this code, optimized for 4 cores, by hand.

pluto can generate C code to do it in 0.2 seconds.

The result is here: http://repo.or.cz/pluto.git/blob_plain/HEAD:/test/jacobi-2d-...

Please also take the following gauss seidel code, and generate both a cache optimized sequential version, and a parallel cache optimized version: http://repo.or.cz/pluto.git/blob_plain/HEAD:/examples/seidel...

Most people would probably not be able to accomplish either, better than icc + pluto, pretty much ever, let alone in some reasonable time period.

gfody · on Nov 29, 2016

You speak as though all low-level optimization is alignment, unrolling, pairing, etc.. These are basically micro optimizations that yield negligible gains unless applied over a large code base. That Pluto output is just a wall of code because it has been unrolled several times, it's not any sort of impressive optimization achievement.

A human would probably convert this to fixed point, convert the entire inner-most loop into a couple of address operations and a single fused multiply add, process the array 64 bytes per iteration without unrolling anything. At that point it's probably 10-20x more efficient than that slab of polyhedral bullshit and finally he'd come back to carefully pad here and there to avoid misalignment penalties.

As for optimizing for 4 cores - you take your shiny hand-polished assembly routine and spin it up on 4 threads, most likely in high level code since you're talking to the OS to get threads, synchronize them, etc. It's not wise to chase parallelism at a low level because that goes counter to minimizing the overhead costs in setting it up.

> But you also are often starting with the output of a good compiler.

No, nobody does this. I mean maybe if you're just learning assembly. Starting with the compiler-generated garbage does not help you other than maybe by giving you a benchmark to beat.