CISC is still an asset rather than a liability, though, as it means you can fit ...

chroma · on Dec 11, 2018

I don't think that's an advantage these days. The bottleneck seems to be decoding instructions, and that's easier to parallelize if instructions are fixed width. Case in point: The big cores on Apple's A11 and A12 SoCs can decode 7 instructions per cycle. Intel's Skylake can do 5. Intel CPUs also have μop caches because decoding x86 is so expensive.

jabl · on Dec 11, 2018

Maybe the golden middle path is compressed risc instructions. E.g the risc-v C extension, where the most commonly used instructions take 16 bits, and the full 32-bit instructions are still available. Density is apparently slightly better than x86-64, while being easier to decode.

(yes, I'm aware there's no high performance risc-v core available (yet) comparable to x86-64 or power, or even the higher end arm ones)

gpderetta · on Dec 11, 2018

Sure, but intel CISC instructions can do more, so in the end is a wash.

chroma · on Dec 11, 2018

That's not the case. Only one of Skylake's decoders can translate complex x86 instructions. The other 4 are simple decoders, and can only transform a simple x86 instruction into a single µop. At most, Skylake's decoder can emit 5 µops per cycle.[1]

1. https://en.wikichip.org/wiki/intel/microarchitectures/skylak...

jawnv6 · on Dec 11, 2018

... so what? most code's hot and should be issued from the uop cache at 6uop/cl with "80%+ hit rate" from your source

you're really not making the case that "decode" is the bottleneck, are you unaware of the mitigations that x86 designs have taken to alleviate that? or are those mitigations your proof that the ISA's deficient

Symmetry · on Dec 11, 2018

That really isn't true in the modern world. x86 has things like load-op and large inline constants but ARM has things like load or store multiple, predication, and more registers. They tend to take about the same number of instructions per executable and about the same number of bytes per instruction.

If you're comparing to MIPS then sure x86 is more efficient. And x86 is instruction do more than RISC-V but most high performance RISC-V uses instruction compression and front end fusion for similar pipeline and cache usage.

gpderetta · on Dec 11, 2018

(generalized) predication is not a thing in ARM64. Is Apple cpu 7 wide even in 32 bit mode?

It is true though, as chroma pointed out, that intel can't decode load-op as full width.

wtallis · on Dec 11, 2018

You can fit more code into the same sized cache, but you also need an extra cache layer for the decoded µops, and a much more complicated fetch/decode/dispatch part of the pipeline. It clearly works, at least for the high per-core power levels that Intel targets, but it's not obvious whether it saves transistors or improves performance compared to having an instruction set that accurately reflects the true execution resources, and just increasing the L1i$ size. Ultimately, only one of the strategies is viable when you're trying to maintain binary compatibility across dozens of microarchitecture generations.

gpderetta · on Dec 11, 2018

The fact is that a postdecode cache is desirable even on an high performance RISC design as even there skipping decode and fetch is desirable for both performance and power usage.

IBM Power9 for example has a predecode stage before L1.

You could say that, in general, riscs can get away without extra complexity for a longer time while x86 must implement it early (this is also true for example for memory speculation due to the more restrictive intel memory model, or optimized hardware TLB walkers), but in the end it can be an advantage for x86 (more mature implementation).

Symmetry · on Dec 11, 2018

In theory, yes. In practice x86-64, while it was the right solution for the market, isn't a very efficient encoding and doesn't fit any more code in cache than pragmatic RISC designs like ARM. It still beats more purist RISC designs like MIPS but not by as much as pure x86 did.

It would be easy to design a variable length encoding scheme that was self-synchronizing and played nicely with decoding multiple instructions per clock. But legacy compatibility means that that scheme will not be x86 based.

bogomipz · on Dec 11, 2018

>"It would be easy to design a variable length encoding scheme that was self-synchronizing and played nicely with decoding multiple instructions per clock."

How might a self-synchronizing encoding scheme work? How could a decoder be divorced from the clock pulse? I am intrigued by this idea.

Symmetry · on Dec 11, 2018

What I mean is self-synchronizing like UTF-8. For example the first bit of a byte being 1 if its the start of an instruction and 0 otherwise. Just enough to know where the instruction starts are without having to decode the instructions up to that point and so that a jump to an address that's the middle of an instruction can raise a fault. Checking the security of x86 executables can be hard sometimes because reading a string of instructions started from address FOO will give you a stream of innocuous instructions whereas reading starting at address FOO+1 will give you a different stream of instructions that does something malicious.

jawnv6 · on Dec 11, 2018

sure, so what's your 6 byte equivalent ARM for FXSAVE/FXRSTOR?

n-gatedotcom · on Dec 11, 2018

What is an example of a commonly used complex instruction that is "simplified"/DNE in RISC? (in asm, not binary)

gpderetta · on Dec 11, 2018

Load-op instructions do not normally exist on RISC, but are common on CISC.

adrianN · on Dec 11, 2018

You do have to worry about the µ-op cache nowadays.