>When the CPU executes 32 bit x86 code it "compiles" it to uops that are totally unrecognizable to us and use dozens to hundreds of registers.
Yes, but the x86 code itself can't address hundreds of registers, only a handful, because it assumes that's all there is (because that's all there was in the actual x86 processors way back). So the fact that you're using a complicated instruction decoder to get around this and make use of much more capable hardware underneath seems to me to be a big source of inefficiency: surely you would have more performance if your ISA could directly use the hardware resources, instead of needing a super complicated instruction decoder.
>So sure, Atom sucks at embedded, but x86 is still king at desktop and beyond, and ARM will never be able to challenge it.
ARM is already challenging it. They have ARM-64 servers now. Here's a place selling them, from a quick Google search:
https://system76.com/servers/starling
>If ARM wanted to challenge x86 in terms of single threaded performance, it would need to do the same thing x86 does: have a super complicated instruction decoder that maps the 32 logical registers defined by the ISA to its 128 physical ones, reschedules everything, renames stuff where appropriate, identify loads that can be elided, etc.
Ok, then why not just make a new (or at least extended) ISA that makes direct use of all those things, instead of needing a super complicated instruction decoder? We already have lots of different ISAs for embedded CPUS: ARM has all kinds of variants (ARMv7, ARMv9, etc.), and MIPS does too. For best performance, you have to compile for the exact ISA you're targeting. We don't do this for desktop stuff mainly because Microsoft isn't going to make 30 different versions of Windows, but for embedded systems it's perfectly normal because everything is compiled from source.
> ARM is already challenging it. They have ARM-64 servers now. Here's a place selling them,
TBH that's not challenging x86, any more than Atom is challenging ARM in the embedded space. The fact that they're for sale doesn't mean they're good. Those server CPUs have terrible performance per core.
> Ok, then why not just make a new (or at least extended) ISA that makes direct use of all those things, instead of needing a super complicated instruction decoder?
Directly accessing all the parts which are hidden behind the ISA is called VLIW, and the performance is terrible every time someone tries to reinvent it. It sucked even when Intel released the Itanium, which ran Windows.
The problem is that many of the data dependencies are dependent on timings which aren't available at compile time. (Integer division, for instance, the latency is sensitive to the values of its operands. To say nothing of cache timing.) A super complicated instruction decoder knows what data it has and what it doesn't while it is making decisions about what uops to dispatch on the half dozen or so lanes it's managing. A sufficiently advanced compiler does not, so a VLIW has to wait for all data in the half dozen or so lanes to become available before it is allowed to dispatch the instruction. If you want to do "interesting" rescheduling/renaming, you need to bring back the super complicated instruction decoder. (AFAIK later Itaniums started down the path of a complicated instruction decoder, but the Itanium was canned long before its complexity started approaching contemporary x86 standards. It would have been interesting to watch that develop.)
I think you're fundamentally misunderstanding how much stuff the instruction decoder does. To be fair, I'm not doing it its full justice, (how can I? It won't fit.) but I think you're too quick to think all a CPU does is perform the assembly instructions which are fed to it. As the article states, modern computers aren't just fast PDP-11s.
> ARM is already challenging it. They have ARM-64 servers now. Here's a place selling them, from a quick Google search: https://system76.com/servers/starling
From the site "Starling Pro ARM is currently unavailable.
We’ll loop you in when we have more to share. Want to be the first to learn when it arrives? Get exclusive access before anyone else."
Which is sadly the typical story with ARM servers - limited availability.
Yes, but the x86 code itself can't address hundreds of registers, only a handful, because it assumes that's all there is (because that's all there was in the actual x86 processors way back). So the fact that you're using a complicated instruction decoder to get around this and make use of much more capable hardware underneath seems to me to be a big source of inefficiency: surely you would have more performance if your ISA could directly use the hardware resources, instead of needing a super complicated instruction decoder.
>So sure, Atom sucks at embedded, but x86 is still king at desktop and beyond, and ARM will never be able to challenge it.
ARM is already challenging it. They have ARM-64 servers now. Here's a place selling them, from a quick Google search: https://system76.com/servers/starling
>If ARM wanted to challenge x86 in terms of single threaded performance, it would need to do the same thing x86 does: have a super complicated instruction decoder that maps the 32 logical registers defined by the ISA to its 128 physical ones, reschedules everything, renames stuff where appropriate, identify loads that can be elided, etc.
Ok, then why not just make a new (or at least extended) ISA that makes direct use of all those things, instead of needing a super complicated instruction decoder? We already have lots of different ISAs for embedded CPUS: ARM has all kinds of variants (ARMv7, ARMv9, etc.), and MIPS does too. For best performance, you have to compile for the exact ISA you're targeting. We don't do this for desktop stuff mainly because Microsoft isn't going to make 30 different versions of Windows, but for embedded systems it's perfectly normal because everything is compiled from source.