ISA showdown: Is ARM, x86, or MIPS intrinsically more power efficient?

userbinator · on Aug 28, 2014

ARM and x86 are rather close, but the real disappointment here is the MIPS Loongson, which is basically the "original RISC ISA". Unfortunately I've encountered a huge number of people, particularly academics, who still think (and teach) that MIPS or minor variants of it are the "best" ISAs and that one can easily make cheap, fast, and power-efficient processors based on it. Looking at the current state of things, it seems the only thing MIPS has succeeded in is being cheap and pedagogical.

I think instruction density has quite some significance here too - x86 opcodes vary between 1 and 15 bytes with 2-3 being average and ARM has Thumb mode where instructions are either 2 or 4 bytes, but all MIPS instructions are 4 bytes. It also has twice as much L1 as most of the ARM and x86 processors, which apparently didn't help it much. Cache consumes power too, and thus I believe small variable-length encodings (like x86) are ultimately better since they allow for better utilisation of cache; the extra complexity in the decoder to handle this, which basically amounts to a few barrel shifters, is almost nothing in comparison to the area and power that more cache would need.

The entire reason CISC architectures emphasized complex multi-cycle instruction execution is because memory accesses were orders of magnitude slower than the processor and data storage was extremely limited.

When considering cache, these points are all true again. There's a common belief about optimising for x86 to avoid the smaller but slower "CISC" instructions, but in situations like tight loops, an instruction that's 2-3x slower individually can be better than the faster longer one(s) if it means the difference between code and data staying in cache or a 10x+ slowdown from a cache miss somewhere else. Especially on an OoO/superscalar design where the slower instruction can be executed in parallel with other nondependent ones. (Intel/AMD's focus on speeding up these small CISC instructions - which they have done - is possibly one of the reasons why x86 performance continues to improve.)

ajross · on Aug 28, 2014

The Loongson is a 90nm part, the others are 32-45nm. No ISA is going to make up for a doubled transistor size.

Really the notable thing to me isn't the ISA nonsense at all. It's how singular a success the Cortex A9 core is. It came at exactly the right moment in history and hit exactly the right sweet spot, being significantly beefier than the A8 yet only minimally more power-hungry. Krait has followed on pretty well, but the A15 can almost be considered a failure at this point.

userbinator · on Aug 28, 2014

As mentioned in a different comment here, all the results were scaled to 45nm@1GHz, and while that's not totally accurate, there's still a big gap between it and the ARMs/x86s.

It would be interesting to see how the A7 also compares to the A8/A9, since it's supposed to be a more power-efficient version of the A15.

rcthompson · on Aug 28, 2014

Yeah, based on just the data in this article, I would see A9 as preferable to A15 for using as the CPU of a mobile device.

rplst8 · on Aug 28, 2014

Definitely, though wasn't the A15 designed for more of a compact server role? IIRC the A15 had different design goals.

rcthompson · on Aug 28, 2014

Ok, I' would believe that. I don't know much about the correspondence between A8/A9/A15 and real-world devices.

renox · on Aug 28, 2014

> but all MIPS instructions are 4 bytes.

Like ARM, MIPS has also 16 bit extensions, two in fact: MIPS16 and microMIPS.

> [cut] thus I believe small variable-length encodings (like x86) are ultimately better since they allow for better utilisation of cache

I disagree: mixed 16/32 bit RISCs ISA have nearly the same code density as x86 and are far simpler to decode.

So beside software compatibility (the killer feature) (well except for Intel of course), the x86 ISA has NO advantage.

MrBuddyCasino · on Aug 28, 2014

So the MIPS ISA is considered to be 'pure' RISC?

My rule of thumb: if something is conceptually "pure" instead of a complicated, carefully balanced mix of grey, its usually not worth considering.

I found this to be true for programming languages, ISAs, politics etc.

It is interesting how purity has a very strong allure - maybe our brains are naturally drawn to a reduced state of complexity, and thus energy consumption?

Padding · on Aug 28, 2014

> It is interesting how purity has a very strong allure - maybe our brains are naturally drawn to a reduced state of complexity, and thus energy consumption?

Or maybe complicated more often than not is just not a "carefully balanced mix of grey" but more of a clusterfuck .. and we learned to be wary of it.

Have a look at this: http://www.infoq.com/presentations/Simple-Made-Easy

MrBuddyCasino · on Aug 28, 2014

I've seen most of his talks, actually I'm a fan. I wouldn't consider Clojure to be a good example of purity though - it has both LISP purists as well as FP purists (Haskell) against it. Actually it is quite pragmatic for running on the JVM and even has optional typing.

If elegance and simplicity are achievable without making too many sacrifices, great! I'd choose Clojure over C++ any day.

blt · on Aug 28, 2014

MIPS: the Pascal of ISAs.

mljet · on Aug 28, 2014

Link to the study referenced by the article:

http://research.cs.wisc.edu/vertical/papers/2013/isa-power-s...

hendrik42 · on Aug 28, 2014

What irks me is computing and comparing W/Mips for vastly different processors like 45W Intel and 5W ARM chips. That just isn't reasonable, as performance increase is very sublinear in power.

0x0 · on Aug 28, 2014

Also, running the i7 in 32bit mode can't possibly be showing the intel chip from its best side?

hayfield · on Aug 28, 2014

There are a few other things that could also impact results. Looking at the experimental design, there could be a good 25% uncertainty in the results because of how it's done at such a high level.

Rough up-to values for various things I can think of / see:

- 10% because they're using an OS rather than running binaries straight.

- 15% because GCC is odd and -O3 does even stranger things, particularly when it comes to energy.

- 15% because their benchmarks are large workloads rather than microbenchmarks that may better target the architecture rather than being huge lumps (would exacerbate GCC strangeness).

- 17% because they're measuring board rather than CPU power supply (the claim that SoC-based ARM development boards cannot have processor power isolated is questionable - I've seen Beagleboards with the CPU power supply isolated)

- 10% because they're measuring energy consumption at a low resolution (their equipment measures in Hz when there's kit that happily measures in kHz or MHz).

Of course, some of these will cancel, and others will be nowhere near as bad as stated. It also doesn't introduce order-of-magnitude changes to the conclusions, although a few of the 'Key Findings' may want questioning.

sounds · on Aug 28, 2014

Did anyone find a reasonably prominent link to the source?

It seems to me as if this article is mostly linkbait simply by reason of it failing to provide anything more than vague phrases about the source: "This paper is an updated version of one I’ve referenced in previous stories, ... the team from the University of Wisconsin"

Half-baked studies frequently attempt to shout down the real hard science.

cjg_ · on Aug 28, 2014

Think we have the answer in the source article's abstract,

" Our methodical investigation demonstrates the role of ISA in modern microprocessors’ performance and energy efficiency. We find that ARM and x86 processors are simply engineering design points optimized for different levels of performance, and there is nothing fundamentally more energy efficient in one ISA class or the other. The ISA being RISC or CISC seems irrelevant."

http://research.cs.wisc.edu/vertical/papers/2013/isa-power-s...

jhallenworld · on Aug 28, 2014

The one argument I can make is that MIPS is too simple. But I would only make this claim on the simplest of in-order single or dual issue implementations. Think of a memcpy loop: 32-bit ARM and PowerPC can update the pointers as a side effect of the load and store instructions, but MIPS can not. You could make similar arguments in favour of ARM's thumb instruction set (more work done per 32-bits of instruction loaded with low decoding overhead vs. x86).

For implementations more advanced than this... I don't think you can make any such claim based on ISA. x86 may be at a slight disadvantage due to decoding, but that's about it.

josephlord · on Aug 28, 2014

With ARM different companies can license the design and include different system components on the chip. With Intel you need to take a packaged chip provided by Intel. This can allow a system power and cost advantage compared to having multiple chips. It is however a licensing/business model issue rather than a fundamental ISA issue.

rational-future · on Aug 28, 2014

Intel is starting to offer custom customer silicon on its chips. They signed a contract with Rockchip a couple months ago are now marketing that product to customers in China.

barrystaes · on Aug 28, 2014

The way this data is normalized destroys any comparison of offsets between mobile/server (and other) scenarios. Whats wrong with using the unit of measure, like Watts?

etep · on Aug 28, 2014

This is not a half baked study. The right comparison is being made, namely performance versus energy. Further, they attempt normalized comparisons, here quoting:

To factor out the impact of technology, present technology-independent power by scaling all processors to 45nm and normalizing the frequency to 1 GHz.

gvb · on Aug 28, 2014

Normalization is nice for a mental exercise, but I cannot buy a normalized phone with a normalized i7 that fits in my normalized pocket. Engineering is the art of trade-offs and the i7 has traded off size and power to achieve speed. That is great when you have an i7-scale size and power budget, but if the i7 exceeds your power or size budget it is a non-starter regardless of how efficient (when normalized) it is. Full stop.

The implicit argument of the paper is that Intel could produce a direct size+power+speed replacement for a phone-scale ARM processor, they just need to dial the knobs to small+small+slower. The counter argument is that they have tried but not come close. The Atom line is roughly comparable with respect to speed, but size and power are a problem. The Galileo processor is roughly comparable with respect to power and size but speed is horribly lacking.

Andys · on Aug 28, 2014

There are x86 phones on the market that have similar weight/shape/battery life to ARM phones. Anandtech reviewed one two years ago and found it in the middle of the pack with respect to energy.

The question for Intel is dialing down the profit knob: how much of a hit do they want to take on each unit shipped, by competing with ARM for tiny phone chips.

gvb · on Aug 28, 2014

Ref: http://www.anandtech.com/show/5770/lava-xolo-x900-review-the...

userbinator · on Aug 28, 2014

Unless they carefully considered issues such as memory timing, or actually ran all the CPUs at 1GHz, just scaling results by clock frequency will make the x86s look worse and the ARMs better, because the x86s have a bigger gap between core and memory speed. Anyone who has experience with PC overclocking will know this - increasing the core clock by e.g. 25% will not make any benchmark (except maybe the most trivial of microbenchmarks) result improve by that same amount, and the same goes for the other direction: A 3.4GHz i7 run at 1GHz will not be 3.4x slower. On the other hand (no pun intended), the ARMs have a native frequency closer to 1GHz and scaling their results will not introduce as much error.

gioele · on Aug 28, 2014

I hoped the article had a look at which parts of those ISAs were reversible, and thus did not dissipate energy.

For those interested, there is a master thesis from the '90 that discussed a prototype reversible ISA + RTL that wasted (in theory) no power for the logic (non-IO) parts [2].

[1] http://en.wikipedia.org/wiki/Reversible_computing [2] http://dspace.mit.edu/bitstream/handle/1721.1/36039/33342527...

pjc50 · on Aug 28, 2014

It's not clear that any of those technologies have been physically implemented? It's very different from standard CMOS.

kristianp · on Aug 28, 2014

"The ISA being RISC or CISC seems irrelevant.".

I thought these were all RISC processors when you get past the instruction decoder.

wsxcde · on Aug 28, 2014

Not really. Both Intel and AMD do weird non-RISCy things beyond the decode stage.

The original difference between RISC and CISC was that RISC eschewed arithmetic+memory-operations in the same instruction. Both Intel and AMD processors violate this commandment. Instead their decomposition of instructions into uops is based more on the more pragmatic notion of choosing uops that are easy to pipeline and execute out of order.

userbinator · on Aug 28, 2014

It depends on what you consider RISC to be. The ARMs in that list decode instructions into uops too. The only CPU in that list that's probably "pure RISC", in the sense of the ISA instructions themselves being the uops, is the MIPS.

Cookingboy · on Aug 28, 2014

I wouldn't call X86 RISC. Granted, RISC doesn't have a clear definition anymore these days and is more or less a marketing buzzword.

gumby · on Aug 28, 2014

I think kristianp's point is that current x86 implementations are simply an x86 instruction decoder/emulator running on a very-RISC microcoded machine.

Which is how a lot of mainframes were implemented in the 60s/70s (e.g. KL-10).

wmf · on Aug 28, 2014

Which still leaves the question of the cost of the instruction decoder.

sharpneli · on Aug 28, 2014

It's a balancing act.

Power consumption by the instruction decoder vs the power consumption of additional cache&memory bandwidth.

It's amusing that despite the complaints towards X86 in the 90's nowadays it's actually a really good instruction packing format (though it became really sensible only after AMD64).

PeterisP · on Aug 28, 2014

The article explicitly addresses that the claim is that it matters for tiny processoors of 1-2 mm2 die size, but for the processor power commonly used in phones or such, that particular cost is insignificant.

fosap · on Aug 28, 2014

Yet out-of-the-box-computing claims to beat all of these by a order of magnitude. I'm looking forward to see how they compare.

aortega · on Aug 28, 2014

ABI is too high level to have any effect in power.

hayfield · on Aug 28, 2014

You can have an impact on power consumption at a range of levels. From choosing appropriate algorithms, to switching to different data types, to changing compiler flags. Add together a few of these 'easy' 5-10% energy savings and you've just reduced your application's energy consumption by a third (OK, it's not quite that simple, but the principle stands).

A couple of citations: http://arxiv.org/pdf/1406.0117v1.pdf (algorithms / data types) http://arxiv.org/pdf/1303.6485.pdf (compiler flags)