Hacker News new | past | comments | ask | show | jobs | submit login

I'd say a good thing to add will be that the lion share of progress in last 5 years was done around cache architectures.

All what is described in the article like superscalarity and ooe has been squeezed to the practical maximum at around early core 2 duo era, with all later advances mostly coming without qualitative architectural improvements.

In that regard, Apple's recent chips got quite far. They got to near desktop level performance without super complex predictors, on chip ops reordering, or gigantic pipelines.

Yes, their latest chip has quite a sizeable pipeline, and total on chip cache comparable to low end server CPUs, but their distinction is that they managed to improve cache usage efficiency immensely. A big cache would't do much to performance if you have to flush it frequently. In fact, the average cache flush frequency is what determines where diminishing returns start in regards to cache size.




Apple CPUs are quite sophisticated wide and deep OoO braniacs designs with state of the art branch predictors.

There is nothing simple about them. The only reason they are not desktop level performance is because the architecture has been optimized for a lower frequency target for power consumption.

A desktop optimized design would probably be slightly narrower (so that decoding is feasible with a smaller time budget) and and possibly deeper to accommodate the higher memory latency. Having said that, the last generation is not very far from reasonable desktop frequencies and might work as-is.


Compare die shots of the two. Even after you correct for the density provided by 7nm process, A12 predictor is few times smaller than that of recent intel i cores


5 minutes of Googling didn't return any image of either skylake or 12 die shots with labelled predictors. Do you have any pointers?

Also I know nothing about the details, but I expect that most of the predictor consists of CAM memory used to store the historical information. I doubt that without internal knoledge, is it possible to distinguish it reliably from other internal memories.


CAM is expensive and requires some kind of replacement scheduling logic. I believe that branch predictors are still implemented as straight one-way associative RAM, often even without any kind of tagging and only true CAM in the CPU core is TLB.


Interesting. Is the improvement in performance mostly due to improvement in size, number, and speed, did finetuning the cache parameters (number of caches, cache size, cache line size) help, or are there more fundamental architectural improvements? Do you have links to more information?


Yes, fine tuned and fast caches is what Apple was going for last few generations. AMD Ruzen also got much faster caches than their previous gen chips. Most importantly, fined tuned caches don't come as a power/performance trade-off - they are simply better.

Moreover, when litho generations progress to 10nm, the difference in power consumption in between working and non-working transistors gets so small, that the traditional convention that "a slow chip is also a low power chip" does not hold true. You are better off in regards to power consumption if you get your IPC up and IO-wait down.

The best analysis of A12 cache performance I know of that is written in popular language is this piece: https://www.anandtech.com/show/13392/the-iphone-xs-xs-max-re...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: