It's totally ridiculous - I spent several years coding almost exclusively in x86 assembly and the prospect that a compiler could best a human at optimizing a specific function is ignorant. The only exception I can think of so-called 'superoptimization' where the optimal sequence of instructions is determined by exhaustively testing every combination. And that's not an effective strategy for anything beyond a handful of instructions.
Technically you could imagine some intelligent-agent based search algorithm that finds the near-optimal sequence of instructions for a given statement of some intermediate language, employing some heuristics derived from thousands of years of deep learning to get the search time down to something reasonable (say like hours or days instead of eons) - with the pressure on compilers today leaning towards everything being JIT'ed on the fly I don't think we're like to see it ever. It's just the old "sufficiently smart compiler" myth.
Depends entirely on the domain. Sorry, but i have never seen a programmer come up with the kinds of parallelizing transforms, cache blocking and iteration reordering, etc, that most polyhedral optimizers do.
Now, if you aren't doing these kinds of things, and are just trying to optimize the hot loop of some simple program, yeah, you can definitely win given enough time, because you'll just sit there with IACA or whatever, and superoptimize it by hand.
But you also are often starting with the output of a good compiler. If you had to start with nothing, i doubt you would do as well.
You speak as though all low-level optimization is alignment, unrolling, pairing, etc.. These are basically micro optimizations that yield negligible gains unless applied over a large code base. That Pluto output is just a wall of code because it has been unrolled several times, it's not any sort of impressive optimization achievement.
A human would probably convert this to fixed point, convert the entire inner-most loop into a couple of address operations and a single fused multiply add, process the array 64 bytes per iteration without unrolling anything. At that point it's probably 10-20x more efficient than that slab of polyhedral bullshit and finally he'd come back to carefully pad here and there to avoid misalignment penalties.
As for optimizing for 4 cores - you take your shiny hand-polished assembly routine and spin it up on 4 threads, most likely in high level code since you're talking to the OS to get threads, synchronize them, etc. It's not wise to chase parallelism at a low level because that goes counter to minimizing the overhead costs in setting it up.
> But you also are often starting with the output of a good compiler.
No, nobody does this. I mean maybe if you're just learning assembly. Starting with the compiler-generated garbage does not help you other than maybe by giving you a benchmark to beat.
Technically you could imagine some intelligent-agent based search algorithm that finds the near-optimal sequence of instructions for a given statement of some intermediate language, employing some heuristics derived from thousands of years of deep learning to get the search time down to something reasonable (say like hours or days instead of eons) - with the pressure on compilers today leaning towards everything being JIT'ed on the fly I don't think we're like to see it ever. It's just the old "sufficiently smart compiler" myth.