One thing that Intel and AMD do better than any other player in the industry is branch prediction. An absolutely stupifying amount of die area is dedicated to it on x86. Combining this with massive speculative execution resources and you can get decent ILP even out of code that's ridiculously hostile to ILP.
Our modern CPU cores have hundreds of instructions in flight at any one moment because of the depth of OoO execution they go to. You can only go that deep on OoO if you have the branch prediction accurate enough not to choke it.
> An absolutely stupifying amount of die area is dedicated to it on x86.
Yep. For example, on this die shot of a Skylake-X core,[0] you can see the branch predictor is about the same area as a single vector execution port (about 8% of the non-cache area).
> One thing that Intel and AMD do better than any other player in the industry is branch prediction. An absolutely stupifying amount of die area is dedicated to it on x86.
Zen in particular combines an L1 perceptron and L2 TAGE[0] predictor[1]. TAGE in particular requires an immense amount of silicon, but it has something like 99.7% prediction accuracy, which is... crazy. The perceptron predictor is almost as good: 99.5%.
I wrote a software TAGE predictor, but too bad it didn't perform as well as predicted (heh) by the authors of the paper.
Everything is relative. They do things that seemed quite neat in the 90s, but then progress slowed to a crawl.
I'd call the state of the field quite bad. For example they do embarrassingly little for you to help with the 2 main bottlenecks we've had for a long time: concurrency and data layout optimization. And for even naive model (1 cpu / free memory) there is just so much potentially automateable manual toil in doing semantics based transformations in perf work that it's not even funny.
A large part is using languages that don't support these kinds of optimizations. It's not "C compiler improvements hit a wall", it continues "and we didn't develop & migrate to languages whose semantics allow optimizations". (There's a little of this in the GPU world, but the proprietary infighting there has produced a dev experience and app platform so bad that very few apps outside games venture there)
There's a whole alternative path of processor history not taken in the case that VLIW had panned out, and instead of failing because of optimism about compiler optimizations.
They do a good job but the scheduling aspects are really really fuzzy.
LLVM and GCC both have models of out of order pipelines but other than making sure they don't stall the instruction decoder it's really hazy whether they actually do anything.
The processors themselves are designed around the code the compilers emit and vice versa.
Optimizing compilers aren't that great on x86. Sure, they're good enough to make something 60fps that wasn't before, but they don't really have much specific x86 knowledge.
Perhaps they do a good job on popular platforms like x86, because we can encode decades of experience, but not so great on brand new ones.