Exploring DynamIQ and ARM’s New CPUs: Cortex-A75, Cortex-A55

jepler · on May 29, 2017

g-d those are some deceptive graphs! Take a look at the slide titled "Pushing the performance envelope". A 16% improvement (LMBench memcpy) is displayed so that it looks like a 164% improvement (size of bar increased from 50px to 132px)!

microcolonel · on May 29, 2017

Yeah, I noticed that immediately. It's kinda sad, too, given how the numbers are actually pretty good to begin with.

Why was the marketing department so ashamed about double digit percentage improvements in decent benchmarks? Do they think their customers are too stupid to appreciate that? Too stupid to notice the charts?

snvzz · on May 29, 2017

Might have to do with the pressure they're feeling from the unavoidable: RISC-V replacing ARM.

ant6n · on May 29, 2017

This is an obvious troll, but I'd like to point out that if risc-v where to become big, ARM could probably come up with one of the best implementations.

cm2187 · on May 29, 2017

Stupid question: if ARM keeps adding aggressively instructions and co-processors, how long before it becomes blotted and power hungry like the x86 architecture? Isn't its simplicity the strength of the ARM platform?

Symmetry · on May 29, 2017

Co-processors don't matter for power consumption if they aren't being used because you can always just turn off the power to those areas of the chip. People talk about a coming area of "dark silicon" where transistors keep getting cheaper without getting more efficient. That means we'll keep adding more and more specialized structures to chips which are very fast and power efficient when in use compared to a general purpose CPU but each sits there dark and silent most of the time.

Back in the day the advantages of RISC were that since everybody had small teams of designers you could spend less time implementing instructions and more time optimizing and adding features like pipelining. And also that RISC, when introduced, could be squeezed onto a single chip whereas CISCs needed multiple chips which resulted in big speed penalties. These days everybody has humongous teams and much better tooling and more transistors than you can shake a stick at. The complex decoding of x86 instructions gives you a 5-10% power penalty on big cores (and a larger one on small cores). And also the strickter memory ordering requirements of x86 might be a disadvantage, I've heard contradictory things. But in general the performance differences between ISAs is pretty small at the moment.

brigade · on May 29, 2017

No. All ISAs add instructions that make sense for some important use case and can be executed efficiently (generally with a throughput of at least 1 per cycle.) x86's "power hungry" issues are basically that determining how many bytes an instruction takes is a significant fraction of fully decoding it, thus the existence of the µop cache. And to a much lesser extent, partial register accesses. The former ARM has already shown that a simpler scheme can work well (thumb-2), and the latter aarch64 explicitly fixed over armv7 and isn't going to regress now that everyone knows that it's an issue.

Verification is more expensive for x86 sure, but any CPU approaching the complexity of modern OoOE monsters isn't going to be drastically cheaper to verify just because of the ISA.

Terribledactyl · on May 29, 2017

To add to the sibling comments, it might be better to think of ARM less like intel, and more like lego for building processors. Standard kits with a sensible collections, but you can always just ask for the IP blocks and diy. If a vendor discovers something isn't working for their application, they can turn it off, build a core without it, tell the compiler not to use that feature set. It's very modular.

additionally: x86 grew up on devices that were basically always plugged in. ARM has a huge install base thats significantly power or heat constrained.

mwfj · on May 29, 2017

I think this issue has been overblown. The x86 ISA isn't necessarily more power hungry per 'computational unit' than the ARM ISA (except for some lower-end cases):

https://www.extremetech.com/extreme/188396-the-final-isa-sho...

In particular: https://www.extremetech.com/wp-content/uploads/2014/08/Avera...

pedroaraujo · on May 29, 2017

It's not like ARM randomly slaps features just for the fun of it. These features are carefully modeled and studied before released into the market.

Aissen · on May 29, 2017

Of course every design team wants to think it works this way. But the big/little mess (at the OS level), and the fact that Apple is beating them at their own game (like Qualcomm was, at the time) is a proof that it's not that simple.

samfisher83 · on May 29, 2017

ARM needs to make a design they can sell to a lot of different people. Apple has one customer Apple. Their chips are larger and they have more transistors to play with and they can optimize the number of cores to what they want. If you have more transistors you can add more features like decoders, alus, cache, or whatever which make your chips faster. So it isn't fair to say Apple beat ARM at their own game.

pedroaraujo · on May 29, 2017

An ARM CPU is not exactly a self contained system like an x86 CPU, it is very dependent on the rest of the SoC.

Qualcomm and Apple have their own tricks which work very well in their platform alone but they wouldn't be able to simply plug out their ARM implementation and sell it as a really optimized system without their own SoC.

dooferlad · on May 29, 2017

The rest of the SoC plays a huge part, just like the architecture of a PC does. My personal mantra when thinking of how to get the most out of a CPU is "feed the beast" since idle time is the killer. In the most simplistic terms your L1 caches need to be big enough to hold the program you are running and the data it is using. You can hint to the CPU about what data you will need and get it to preload and branch prediction does a similar job for fetching instructions and avoiding branch misses.

I for one like to see big caches in SoCs, but the cost of putting it there needs to be balanced against the performance requirements of the product (no point having a bigger chip than you need). There is also the numbers game of 64 bit / number of cores / RAM etc which are easy to parse but difficult to understand. An irrelevant number of consumers care about IPC and time taken to wake a core or switch a workload between cores, so great innovations like big little are used as marketing numbers rather than to tune a SoC to its best performance. I would like to see 1 big, 2 little cores and more cache myself.

So what do Apple have? Lots of cores? No. If you build an 8 core SoC and can't keep those cores running then you have listened to marketing.

kevin_thibedeau · on May 29, 2017

Intel does the same. Didn't prevent them from producing P4 or Itanium.

psi-squared · on May 29, 2017

It's worth noting that, if you need something which runs on really low power, ARM have their R and M series processors. So even if the A series did become really power-hungry, the other two lines presumably wouldn't.

pja · on May 29, 2017

Eg, the processor in the BBC MicroBit pulls about 10mA @ 2.5V or so. Of course, it only runs at 32MHz or so & has a whopping 16kbit of RAM, but you can run one off a pair of AA batteries for a week or two without ever sleeping.

dbancajas · on May 29, 2017

if it wants to reach x86 level of performance then the answer is yes. in fact, it could be less efficient simply because x86 has two decades of optimizations going for it. you notice they keep comparing against previous gen and not the state of the art x86 performance numbers??

Symmetry · on May 29, 2017

I'd heard about them allowing heterogeneous clusters ahead of time and I was wondering how that would work. Private L2s should do it, you really need to design the L2 to match the profile of the supported CPU(s) but with L3 the latencies are higher and there's less need to be specialized.

jumpkickhit · on May 29, 2017

Nice write-up.

Also I was curious when consumers might see this in their products, last line in the article says late 2017/early 2018.

DCKing · on May 29, 2017

If current trends continue, Huawei (through their subsidiary HiSilicon) will be the first to launch a product with the new ARM IP.

The Cortex A72 was announced in April 2015, Huawei launched the Kirin 950 (4x Cortex A72 + 4x Cortex A53) in November 2015 as part of the Huawei Mate 8. The Cortex A73 was announced in May 2016, Huawei launched the Kirin 960 (4x Cortex A73 + 4x Cortex A53) in November 2016 as part of the Huawei Mate 9.

So yeah, my guess is on the Huawei Mate 10 with a Kirin 970 (4x Cortex A75 + 4x Cortex A55) in November this year. The market has become very iterative and predictable, and this ARM announcement confirms it. Don't actually give Huawei any money for this hardware though, their software update policies are horrible. The more interesting and useful implementations will come from Samsung, Qualcomm and maybe Nvidia in 2018.

mtgx · on May 29, 2017

Cortex-A75 really seems like a chip designed to enter the PC market (Chromebooks/Windows on ARM), slow and steady.