Actually there is plenty of precedent for that conclusion. Scaling CPU cores up (=efficient but high performance cores) is super-hard -- making low-power cores is pretty simple in comparison --; almost everyone fails or failed at it. Currently there are two companies on the planet that do that sort of thing, and not so long ago it was only one. Notable names that failed hard in this are: Intel, DEC, Sun, Fujitsu, AMD, IBM, HP, Intel, as well as SGI, IBM and AMD.
It's not unlike designing a race car. You can put a monster engine, but it'll add weight, require more fuel, add inertia which will demand a more complicated suspension and wider, heavier tires to be able to do fast curves, and so on.
The careful balance of every aspect of a modern CPU is an art very few people practice. I'd say IBM and Sun have't left this game just yet for very high performance platforms, but don't expect a Core i3 coming out of either.
True, IBM and Fujitsu/Sun are still competitive in some niches, but not on power efficiency and not for general purpose application. Last time I looked IBM was 3-5 times worse perf/Watt in SAP benchmarks. Of course IBM is still an option in these specialized applications if you just need that raw power, but no one's going to run a data centre on that silicon.
If everything gets cracked to micro-ops anyway, how much effect does your general instruction(not SIMD/Altivec/Thumb etc) set _really_ have on performance and power consumption?
Power goes as C v^2 f. At best. That doesn't change.
If you look at a particular technology node, one of the best correlations of power consumption is chip area. It's pretty much a straight line.
Now, that has probably changed a bit with how aggressively we turn chip areas off nowadays. So, we probably have to alter that to power consumption vs active chip area--but it's still a fixed line for a technology node.
> Whose only competitor is ARM, which is little better in terms of decoding complexity.
T32 indeed has this problem, but not A32. With ARMv8 ARM released the A64 instruction set, which is designed new from ground up and is to my knowledge also not hard to decode. Nevertheless decoding complexity is not that relevant anymore. What is much harder and involves more die area is (super-)pipelining the execution, out-of-order execution etc. But even all this together: What consumes most die area are typically the caches.
Caches are transistor-dense, it's hard to optimize them further. Maybe something like Transmeta's architecture, a hybrid software/hardware approach would yield some improvement.
There is absolutely no reason to reach that conclusion.