> Whose only competitor is ARM, which is little better in terms of decoding complexity.
T32 indeed has this problem, but not A32. With ARMv8 ARM released the A64 instruction set, which is designed new from ground up and is to my knowledge also not hard to decode. Nevertheless decoding complexity is not that relevant anymore. What is much harder and involves more die area is (super-)pipelining the execution, out-of-order execution etc. But even all this together: What consumes most die area are typically the caches.
Caches are transistor-dense, it's hard to optimize them further. Maybe something like Transmeta's architecture, a hybrid software/hardware approach would yield some improvement.
It'd be, of course, totally incompatible with what we have now.