In designs with a L0 uOP cache, they clock gate the decoder when just running ou...

mhh__ · on Oct 2, 2021

Hence the question mark, wasn't it 3DNow!

The fact that it has a to have a big streaming cache just for the decoding puts it at a loss. Intel have only just gone 6 wide.

Is it the end of the world? No. Do I want a RISC machine? Yes

monocasa · on Oct 2, 2021

3Dnow used an immediate field for an extended opcode, but you still figured out the length of the instruction from looking at the first opcode bytes, so it's not really a suffix. As far as the decoder was concerned, it was just another fixed length, required imm field for the execution unit.

It doesn't have to have a L0 cache; a lot of designs don't.

And the decoder is wider than it seems. x86 instructions have more uOPs mainly from RMW instructions that would be three separate instructions in RISC. It'll be fun to compare Zen 4 (even at Ultrabook TDP) and M1 apples to apples once everyone has access to 5nm.

mhh__ · on Oct 2, 2021

The decoder is wider than it seems but isn't the throughput lower for those instructions anyway? i.e. different decoders for different instructions complexities.

The designs M1 is competing with all have L0s, surely? I guess it could be masked off for a low power design but I would've thought not.

Either way, it's old and ugly, I don't miss it at all even if ARM and friends peripheral story probably isn't as good.

monocasa · on Oct 3, 2021

> The decoder is wider than it seems but isn't the throughput lower for those instructions anyway? i.e. different decoders for different instructions complexities.

The short decoders don't typically have issues with mod/rm memory destination fields. It's a little confusing because there's typically at least two uOP formats inside the cores. The decoders generally discussed spit out uOPS that still look fairly x86 in the semantics they encode (2 address, can be RMW, etc.) just wide, but fixed width. But by the time they've made their way to the functional units they've been cracked into more uOPs so AGU, LD, ALU, ST can all go to different ports that can handle that work.

> The designs M1 is competing with all have L0s, surely? I guess it could be masked off for a low power design but I would've thought not.

Most designs I've seen they're actually most important as a power saving feature. They shine well in microbenchmarking for perf, but not as much for general code. The real win is the clock gating all the i fetch and decode, most of which gets shared with RISC too (think I-TLBs, and Ifetch before decode).