I'd say a good thing to add will be that the lion share of progress in last 5 ye...

gpderetta · on Oct 18, 2018

Apple CPUs are quite sophisticated wide and deep OoO braniacs designs with state of the art branch predictors.

There is nothing simple about them. The only reason they are not desktop level performance is because the architecture has been optimized for a lower frequency target for power consumption.

A desktop optimized design would probably be slightly narrower (so that decoding is feasible with a smaller time budget) and and possibly deeper to accommodate the higher memory latency. Having said that, the last generation is not very far from reasonable desktop frequencies and might work as-is.

baybal2 · on Oct 18, 2018

Compare die shots of the two. Even after you correct for the density provided by 7nm process, A12 predictor is few times smaller than that of recent intel i cores

gpderetta · on Oct 18, 2018

5 minutes of Googling didn't return any image of either skylake or 12 die shots with labelled predictors. Do you have any pointers?

Also I know nothing about the details, but I expect that most of the predictor consists of CAM memory used to store the historical information. I doubt that without internal knoledge, is it possible to distinguish it reliably from other internal memories.

dfox · on Oct 18, 2018

CAM is expensive and requires some kind of replacement scheduling logic. I believe that branch predictors are still implemented as straight one-way associative RAM, often even without any kind of tagging and only true CAM in the CPU core is TLB.

meuk · on Oct 18, 2018

Interesting. Is the improvement in performance mostly due to improvement in size, number, and speed, did finetuning the cache parameters (number of caches, cache size, cache line size) help, or are there more fundamental architectural improvements? Do you have links to more information?

baybal2 · on Oct 18, 2018

Yes, fine tuned and fast caches is what Apple was going for last few generations. AMD Ruzen also got much faster caches than their previous gen chips. Most importantly, fined tuned caches don't come as a power/performance trade-off - they are simply better.

Moreover, when litho generations progress to 10nm, the difference in power consumption in between working and non-working transistors gets so small, that the traditional convention that "a slow chip is also a low power chip" does not hold true. You are better off in regards to power consumption if you get your IPC up and IO-wait down.

The best analysis of A12 cache performance I know of that is written in popular language is this piece: https://www.anandtech.com/show/13392/the-iphone-xs-xs-max-re...