Interesting that mobile platforms / CPU's losing 32 bit support whilst the domin...

toast0 · on May 25, 2021

My understanding is AArch64 is more or less fresh vs AArch32, whereas the ia16/ia32/amd64 transitions are all relatively simple extensions. Almost all the instructions work in all three modes, just the default registers are a bit different and the registers available are different and addressing modes might be a smidge different. You would make some gains in decoder logic by dropping the older stuff, but not a whole lot; amd64 still has the same variable width instructions and strong memory consistency model that make it hard.

astrange · on May 25, 2021

If you dropped 16-bit and i386 then you might be able to reuse some of the shorter instruction codes like 0x62 (BOUND) that aren't supported in x86-64.

Decoders aren't the problem with variable length instructions anyway (and they can be good for performance because they're cache efficient.) The main problem is security because you can obfuscate programs by jumping into the middle of instructions.

colejohnson66 · on May 25, 2021

Nitpick: 0x62 can’t be reused as it’s already been repurposed for the EVEX prefix ;)

astrange · on May 25, 2021

Oh yeah, I just knew there's some x86 emulators that don't bother implementing it and nobody's ever noticed.

undersuit · on May 25, 2021

The cost of continuing support is increased decoder complexity. The simplicity/orthogonality of the ARM ISA allows simpler instruction decoders. Simpler, faster, and easier to scale. Intel and AMD instruction decoders are impressive beasts that consume considerable manpower, chip space, and physical power.

mastax · on May 25, 2021

I remember reading (in 2012 maybe) that ISA doesn't really matter and decoders don't really matter for performance. The decoder is such a small part of the power and area compared to the rest of the chip. I had a hypothesis that as we reach the tail end of microarchitectural performance the ISA will start to matter a lot more, since there are fewer and fewer places left to optimize.

Well now in 2021 cores are getting wider and wider to have any throughput improvement and power is a bigger concern than ever. Apple M1 has an 8-wide decoder feeding the rest of the extremely wide core. For comparison, Zen 3 has a 4-wide decoder and Ice Lake has 5-wide. We'll see in 3 years if Intel and AMD were just being economical or unimaginative or if they really can't go wider due to the x86 decode complexity. I suppose we'll never really know if they cover for a power hungry decoder with secret sauce elsewhere in the design.

dragontamer · on May 25, 2021

The state-machine that can determine the start-of-instructions can be run in parallel. In fact, any RegEx / state machine / can be run in parallel on every byte with Kogge-stone / parallel prefix, because the state-machine itself is an associative (but not communitive) operation.

As such, I believe that decoders (even complicated ones like x86) scale at O(n) total work (aka power used) and O(log(n)) for depth (aka: clock cycles of latency).

-------

Obviously, a simpler ISA would allow for simpler decoding. But I don't think that decoders would scale poorly, even if you had to build a complicated parallel-execution engine to discover the start-of-instruction information.

adrian_b · on May 26, 2021

I am not sure what you mean by running in parallel.

On x86, to be able to determine whether a given byte is the start of an instruction, the start of the part of an instruction following the variable-number of prefix bytes or another kind of byte, requires information from the decoding of the previous bytes.

So you may have as many pre-decoders as the fetched bytes, all starting decoding in parallel, but they are not independent and the later pre-decoders need the propagation of signals from the first pre-decoders, so the delay required for determining the initial instruction bytes grows with the number of bytes fetched and examined simultaneously (16 bytes for most x86 CPUs, which on average may contain 4 or 5 instructions).

We do not know for sure where the exact limitation is, but despite the vague and inconsistent information from the vendor documentation, which in a few cases seems to imply better capabilities, until now no Intel or AMD CPU has been shown to be able to decode more than 4 instructions simultaneously (the only exception is that some pairs of instruction are fused, e.g. some combinations of compare-and-branch, and those count as a single instruction for decoding, increasing the apparent number of simultaneously decoded instructions when those pairs occur in the program).

dragontamer · on May 26, 2021

> I am not sure what you mean by running in parallel.

I literally mean in parallel. Computing a FSM from "backwards" and "forwards" simultaneously, as well as "from the middle outward". From all bits, simultaneously, in parallel.

As I stated earlier: Kogge-Stone carry save adder is the first step to understanding this.

Lets take 32-bits of two numbers: A[0:31] and B[0:31]. C [0:32] = A + B.

Simple, yes? Lets focus on purely a singular bit: C[32], the so called "carry bit". It is easy to see that the carry bit (the 33rd bit of the addition operation) depends on all 32-bits of A, as well as all 32-bits of B.

Nonetheless, the carry bit can be computed in parallel with Kogge-Stone (as well as other parallel carry-lookahead adders).

I've explained this process in multiple sibling posts at this point. Please give them a look first, and ask questions after you've read them. I understand that I've written quickly, verbosely, and perhaps inaccurately, but hopefully there's enough there to get started.

mhh__ · on May 25, 2021

>because the state-machine itself is an associative (but not communitive) operation.

Proof?

> But I don't think that decoders would scale poorly, even if you had to build a complicated parallel-execution engine to discover the start-of-instruction information.

Surely we have empirical proof of that in that Apple's first iteration of a PC chip is noticeably wider than AMD's top of the line offering. On top of this we know that to make X86 run fast you have to include an entirely new layer of cache just to help the decoders out.

We've had this exact discussion on this site before, so I won't start it up again but even if you are right that you can decode in this manner I think empirically we know that the coefficient is not pretty.

dragontamer · on May 25, 2021

I'm not a chip designer. I don't know the best decoding algorithms in the field.

What I can say is, I've come up with a decoding algorithm that's clearly O(n) total work and O(log(n)) depth. From there, additional work would be done to discover faster methodologies.

The importance of proving O(n) total work is not in "X should be done in this manner". Its in that "X has asymptotic complexity of at worst case, this value".

I presume that the actual chip designers making decoders are working on something better than what I came up with.

> Proof?

Its not an easy proof. But I think I've given enough information in sibling posts that you can figure it out yourself over the next hour if you really cared.

The Hillis / Steele paper is considered to be a good survey of parallel computing methodologies in the 1980s, and is a good read anyway.

One key idea that helped me figure it out, is that you can run FSM's "backwards". Consider a FSM where "abcd" is being matched. (No match == state 1. State 2 == a was matched. State 3 == a and b was matched. State 4 == abc was matched, State5 == abcd was matched).

You can see rather easily that FSM(a)(b)(c)(d), applied from left-to-right is the normal way FSMs work.

Reverse-FSM(d) == 4 if-and-only if the 4th character is d. Reverse-FSM(other-characters) == initial state in all other cases. Reverse-FSM(d)(c) == 3.

In contrast, we can also run the FSM in the standard forward approach. FSM(a)(b) == 3

Because FSM(a)(b) == 3, and Reverse-FSM(d)(c) == 3, the two states match up and we know that the final state was 5.

As such, we can run the FSM-backwards. By carefully crafting "reverse-FSM" and "forward-FSM" operations, we can run the finite-state-machine in any order.

Does that help understand the concept? The "later" FSM operations being applied on the later-characters are "backwards" operations (figuring out the transition from transitioning from state4 to state 3). While "earlier" FSM operations on the first characters are the traditional forward-stepping of the FSM.

When the FSM operations meet in the middle (aka: FSM(a)(b) meets with reverse-FSM(c)), all you do is compare states: reverse-FSM(c) declares a match if-and-only-if the state was 3. FSM(a)(b) clearly is in state 3, so you can combine the two values together and say that the final state was in fact 4.

-----

Last step: now that you think of forward and reverse FSMs, the "middle" FSMs are also similar. Consider FSM2(b), which processes the 2nd-character: the output is basically 1->3 if-and-only-if the 2nd character is b.

FSM(a) == 1 and ReverseFSM(d)(c) == 3, and we have the middle FSM2 returning (1->3). So we know that the two sides match up.

So we can see that all bytes: the first bytes, the last bytes, and even the middle bytes, can be processed in parallel.

For example, lets take FSM2(b)(c), and process the two middle bytes before we process the beginning or end. We see that the 2nd byte is (b), which means that FSM2(b)(c) creates a 2->4 link (if the state entering FSM2(b)(c) is 2, then the last state is 4).

Now we do FSM(a), and see that we are in state 2. FSM(a) puts us in state 2, and since FSM2(b)(c) has been preprocessed to be 2->4, that puts us at state 4.

So we can really process the FSM in any order. Kogge-Stone gives us a O(n) work + O(log(n)) parallel methodology. Done.

mhh__ · on May 25, 2021

Is parsing the instructions in parallel strictly a finite state machine? Lookahead? etc.

dragontamer · on May 25, 2021

Its pretty obvious to me that the x86 prefix-extensions to the ISA are a Chomsky Type3 regular grammar. Which means that a (simple) RegEx can describe the x86 prefixes, which can be converted into a nondeterministic finite automata, which can be converted into a deterministic finite state machine.

Or under more colloquial terms: you "parse" a potential x86 instruction by "Starting with the left-most byte, read one-byte at a time until you get a complete instruction".

Any grammar that you parse one-byte-at-a-time from left-to-right is a Chomsky Type3 grammar. (In contrast: 5 + 3 * 2 cannot be parsed from left to right: 3*2 needs to be evaluated first before the 5+ part).

klelatti · on May 25, 2021

Maybe but O(n) on a complex decoder could be a material penalty vs a much simpler design.

dragontamer · on May 25, 2021

Sure, but only within a constant factor.

My point is that the ARM's ISA vs x86 ISA is not any kind of asymptotic difference in efficiency that grossly prevents scaling.

Even the simplest decoder on a exactly 32-bit ISA (like POWER9) with no frills would need O(n) scaling. If you do 16-instructions per clock tick, you need 16x individual decoders on every 4-bytes.

Sure, that delivers the results in O(1) instead of O(log(n)) like a Kogge-stone FSM would do, but ARM64 isn't exactly cake to decode either. There's microop / macro-op fusion going on in ARM64 (ex: AESE + AESMC are fused on ARM64 and executed as one uop).

Someone · on May 25, 2021

x86 instructions can straddle MMU pages and even cache lines.

I guess that will affect the constant factor in that O(n) (assuming that’s true. I wouldn’t even dare say I believe that or its converse)

dragontamer · on May 25, 2021

> x86 instructions can straddle MMU pages and even cache lines.

That doesn't change the size or power-requirements of the decoder however. The die-size is related to the total-work done (aka: O(n) die area). And O(n) also describes the power requirements.

If the core is stalled on MMU pages / cache line stalls, then the core idles and uses less power. The die-area used doesn't change (because once fabricated in lithography, the hardware can't change)

> (assuming that’s true. I wouldn’t even dare say I believe that or its converse)

Kogge-stone (https://en.wikipedia.org/wiki/Kogge%E2%80%93Stone_adder) can take *any* associative operation and parallelize it into O(n) work / O(log(n)) depth.

The original application was the Kogge-stone carry-lookahead adders. How do you calculate the "carry bit" in O(log(n)) time, where n is the number of bits? It is clear that a carry bit depends on all 32-bits + 32-bits (!!!), so it almost defies logic to think that you can figure it out in O(log(n)) depth.

You need to think about it a bit, but it definitely works, and is a thing taught in computer architecture classes for computer engineers.

--------

Anyway, if you understand how Kogge-Stone carry lookahead works, then you understand that any associative operation (of which "Carry-bit" calculations is associative). The next step is realizing that "stepping a state machine" is associative.

This is trickier to prove, so I'll just defer to the 1986 article "Data Parallel Algorithms" (http://uenics.evansville.edu/~mr56/ece757/DataParallelAlgori...), page 6 in the PDF / page 1175 in the lower corner.

There, Hillis / Steel prove that FSM / regex parsing is an associative operation, and therefore can be implemented in the Kogge-stone adder (called a "prefix-sum" or "Scan" operation in that paper).

atq2119 · on May 25, 2021

You're right about the implications of Kogge-Stone, but constant factors matter a lot.

In fixed width ISAs, you have N decoders and all their work is useful.

In byte granularity dynamic width ISAs, you have one (partial) decoder per byte of your instruction window. All but N of their decode results are effectively thrown away.

That's very wasteful.

The only way out I can see is if determining instruction length is extremely cheap. That doesn't really describe x86 though.

The other aspect is that if you want to keep things cheap, all this logic is bound to add at least one pipeline stage (because you keep it cheap by splitting instruction decode into a first partial decode that determines instruction length followed by a full decode in the next stage). Making your pipeline ~10% longer is a big drag on performance.

dragontamer · on May 25, 2021

> In fixed width ISAs, you have N decoders and all their work is useful.

ARM isn't fixed width anymore. AESE + AESMC macro-op fuse into a singular opcode, for example.

IIRC, high performance ARM cores are macro-op fusing and/or splitting up opcodes into micro-ops. Its not necessarily a 1-to-1 translation from instructions to opcodes anymore for ARM.

But yes, a simpler ISA will have a simpler decoder. But a lot of people seem to think its a huge, possibly asymptoticly huge, advantage.

-----------------

> The only way out I can see is if determining instruction length is extremely cheap. That doesn't really describe x86 though.

I just described a O(n) work / O(log(n)) latency way of determining instruction length.

Proposal: 1. have a "instruction length decoder" on every byte coming in. Because this ONLY determines instruction length and is decidedly not a full decoder, its much cheaper than a real decoder.

2. Once the "instruction length decoders" determine the start-of-instructions, have 4 to 8 complete decoders read instructions starting "magically" in the right spots.

That's the key for my proposal. Sure, the Kogge-stone part is a bit more costly, but focus purely on instruction length and you're pretty much set to have a cheap and simple "full decoder" down the line.

Symmetry · on May 25, 2021

You're right in a sense, in that even really hairy decode problems like x86 only add 15% or so to a core's overall power usage.

But on the other hand verification is a big NRE cost for chips and proving that that second ISA works correctly in all cases is a pretty substantial engineering cost even if the resulting chip is much the same.

ants_a · on May 25, 2021

It likely wouldn't be 15% with an 8-wide x86 decoder if such a thing would even be possible within any reasonable clock budget. So in that sense a fixed width ISA does buy something. Also, given that chips today are mostly power limited, 15% power usage is 15% that could be used to increase performance in other ways.

gchadwick · on May 25, 2021

> The simplicity/orthogonality of the ARM ISA allows simpler instruction decoders. Simpler, faster, and easier to scale. Intel and AMD instruction decoders are impressive beasts that consume considerable manpower, chip space, and physical power

These claims are often made and indeed make some sense but is there any actual evidence for them? To really know you'd need access to the RTL of both cutting edge x86 and arm designs to do the analysis to work out what the decoders are actually costing in terms of power and area and whether they tend to produce critical timing paths. You'd also need access to the companies project planning/timesheets to get an estimate of engineering effort for both (and chances are data isn't really tracked at that level of granularity, you'll also need a deep dive of their bug tracking to determine what is decoder related for instance and estimate how much time has been spent on dealing with decoder issues). I suspect Intel/AMD/arm have no interest in making the relevant information publicly available.

You could attempt this analysis without access to RTL but isolating the power cost of the decoder with the silicon alone sounds hard and potentially infeasible.

hajile · on May 25, 2021

x86 die breakdowns put the area of the decoder as bigger than the integer ALUs. While unused ALUs can power gate, there's almost never a time when the decoders are not in use.

Likewise, parsing is a well-studied field and parallel parsing has been a huge focus for decades now. If you look around, you can find papers and patents around decoding highly serialized instruction sets (aka x86). The speedups over a naive implementation are huge, but come at the cost of many transistors while still not being as efficient or scalable as parallel parsing of fixed-length instructions. The insistence that parsing compressed, serial streams can be done for free mystifies me.

I believe you can still find where some AMD exec said that they weren't going wider than 4 decoders because the power/performance ratio became much too bad. If decoders weren't a significant cost to their designs (both in transistors and power), you'd expect widening to be a non-issue.

EDIT: here's a link to a die breakdown from AMD

https://forums.anandtech.com/threads/annotated-hi-res-core-d...

klelatti · on May 25, 2021

I'm sure that Arm must have done a lot of analysis around this issue when AArch64 was being developed.

After all the relatively simple Thumb extension had been part of the Arm ISA for a long time (and was arguably one of the reasons for its success) and they still decided to go for fixed width.

gchadwick · on May 25, 2021

Also out of interest do you have a link to an x86 die breakdown that includes decoder and ALU area? Looking at wikichip for example they've got a breakdown for Ice Lake: https://en.wikichip.org/wiki/intel/microarchitectures/ice_la... but it doesn't get into that level of detail, a vague 'Execution units' that isn't even given bounds is the best you get and is likely an educated guess rather than definite knowledge of what that bit of die contains. Reverse engineering from die shots can do impressive things but certainly what you see in public never seems to get into that level of detail and would likely be significant effort without assistance from Intel.

hajile · on May 25, 2021

Here you go. This one combines a high-res die shot with AMD's die breakdown of Zen 2.

You'll notice that I understated things significantly. Not only is decoder bigger than the integer ALUs, but it's more than 2x as big if you don't include the uop cache and around 3x as big if you do! It dwarfs almost every other part of the die except caches and the beast that is load/store

https://forums.anandtech.com/threads/annotated-hi-res-core-d...

Original slides.

https://forums.anandtech.com/threads/amds-efforts-involved-i...

gchadwick · on May 25, 2021

Thanks, my mistake was searching exclusively for Intel micro-architecture. It'd be interesting to see if there are further similar die breakdowns for other micro-architectures around, trawling through conference proceedings likely to yield better results than a quick Google. Just skimming through hot chips as it has a publicly accessible archive (those AMD die breakdowns come from ISSCC which isn't so easily accessible).

The decoder is certainly a reasonable fraction of the core area, though as a fraction of total chip area it's still not too much as other units in core are of similar size or larger size (Floating point/SIMD, branch prediction, load/store, L2) plus all of the uncore stuff (L3 in particular). Really we need a similar die shot of a big arm core to compare its decoder size too. Hotchips 2019 has a highlighted die plot of an N1 with different blocks coloured but sadly it doesn't provide a key as to which block is what colour.

gchadwick · on May 25, 2021

I guess I'm making the wrong argument. I'd agree it's clear an x86 decoder will be bigger, more power hungry etc than an arm decoder. The real question is how much of a factor that is for the rest of the micro-architecture? Is the decoder dragging everything down or just a pain you can deal with at some extra bit of power/area cost that doesn't really matter? That's what I was aiming to get at in the call for evidence. Is x86 inherently inferior to arm and cannot scale as well for large super scalar CPUs because the decoder drags you down or is Apple just better at micro-architecture design (perhaps AMD's team would also fail to scale well beyond 4 wide with an arm design, perhaps Apple's team could have built an x86 M1).

klelatti · on May 25, 2021

I'd always assumed that decoding x86 and x64 had a lot in common so not much to be saved there? Happy to be told otherwise.

Agreed that Arm (esp AArch64) must be a lot simpler.

usefulcat · on May 25, 2021

As I understand it, the fact that x86 has variable-length instructions makes a significant difference. On ARM if you want to look ahead to see what the next instructions are, you can just add a fixed offset to the address of the current instruction. If you want to do that on x86, you have to do at least enough work to figure out what each instruction is, so that you know how large it is, and only then will you know where the next instruction begins. Obviously this is not very amenable to any kind of concurrent processing.

Someone · on May 25, 2021

You can do quite a bit concurrently, but at the expense of hardware. You ‘just’ speculatively assume an instruction starts at every byte offset and start decoding.

Then, once you figure out that the instruction at offset 0 is N bytes, ignore the speculative decodings starting at offsets 1 through N-1, and tell the speculative decoding at offset N that it is good to go. That means that it in turn can inform its successors whether they are doing needless work or are good to go, etc.

That effectively means you need more decoders, and, I guess, have to accept a (¿slightly?) longer delay in decoding instructions that are ‘later’ in this cascade.

colejohnson66 · on May 26, 2021

IIRC, Intel actually does do that. They do a 16-way decode, each a single byte of offset.

dehrmann · on May 25, 2021

> the cost of continuing 32 bit support on x86 is for Intel/AMD

Pretty sure they still have 16-bit support.

It's easier to migrate for a CPU mostly used for Android because it's already had a mix of architectures, there are fewer legacy business applications, and distribution through an app store hides potential confusion from users.

jeroenhd · on May 25, 2021

Another advantage the Android platform has specifically is that its applications mostly consist of bytecode, limiting the impact phasing out 32 bit instructions has on the platform. Most high performance chips are in Android devices like phones and tablets where native libraries are more the exception than the rule.

Desktop relies on a lot of legacy support because even today developers make (unjust) assumptions that a certain system quirk will work for the next few years. There's no good reason to drop x86 support and there's good reason to keep it. The PC world can't afford to pull an Apple because there's less of a fan cult around most PC products that helps shift the blame on developers when their favourite games stop working.

dehrmann · on May 25, 2021

Run-anywhere was (and halfway still is) a huge selling point for Java. It was always clunky for desktop apps because of slow JVM warmup and nothing feeling native, but Android shifted enough of that into the OS so it's not a problem.

klelatti · on May 25, 2021

Good point on 16 bit!

I suppose idly wondering at what point the legacy business applications become so old that the speed penalty of emulation becomes something that almost everyone can live with.

pantalaimon · on May 26, 2021

There is a lot of binary-only software for x86 that can still be run on a modern OS.

For AArch32 not so much.