So, in @rygorous's excellent Twitch streams about CPU architecture (first one he...

abainbridge · on Oct 18, 2018

x86 really does decode CISC into RISC-like instructions. They're called micro-ops. Some of the instruction cache stores these translated instructions. People research the details of this. See https://www.agner.org/optimize/blog/read.php?i=142&v=t

The article looked about right to me.

I didn't watch the (3 hour!) video you linked to. Can you give the time offset where the myth you refer to is explained?

gpderetta · on Oct 18, 2018

Intel uops aren't really RISCy at all, at least since after P4: if you look at Agner's tables, you'll see that even complex load-op operations still map to 1 (fused) micro-op in the fused domain and they are only broken down when dispatched to execution units (instruction breakdown was performed even in early CPUs, before the CISC/RISC nomenclature was a thing). IIRC decoded uops are not even fixed size in the post-decode cache: large constants take an additional slot.

Separate load and op instructions and fixed size instructions are pretty much the only things left differentiating RISC and CISC architectures (there is nothing reduced about modern RISCs), so I do not think the claim that x86 CPUs are RISC inside does hold.

I think that Agner, which knows what he is talking about, it is just being loose with terminology.

In the grand scheme of thing it just doesn't matter, it is simply a name. I just dislike it when the x86-is-a-RISC meme get repeated, as if being a RISC somehow is a virtue in itself.

abainbridge · on Oct 18, 2018

I bow in deference to your superior knowledge.

Back in the late 80s, reducing your instruction set was a good idea because it meant you could spend the transistor budget on other things, like pipelines and caches. RISC came to be seen as a virtue in itself.

When x86 was the 80286 was CISC and MIPS and ARM was RISC, then x86 was just bad and wrong. Nowadays x86 is fast and good.

As you kinda said, almost everything about the 1980s definition of RISC has ceased to be true. The only thing left of Patterson and Hennessy's RISC ideas is that they encouraged proper analysis as of how real programs use the instruction set (and cache etc), rather than just adding a bunch more instructions to please some assembly writing customers and aiming for a better Dhrystone score. If we define RISC to mean "doing proper analysis", then x86-is-a-RISC-machine is true :-)

wolfgke · on Oct 18, 2018

> As you kinda said, almost everything about the 1980s definition of RISC has ceased to be true.

A central difference that still exists that RISC processors are typically load/store architectures. That means that before an operand that exists in memory can be used, it has to be transfered to a register.

This means that an instruction like

add eax, [ecx]

does not work, say, under ARM. Under ARM, you have to use

  ldr r1, [r1]
  add r0, r0, r1

Intel found out that using memory addresses both as source and target turned out to be a bad idea

  add [ecx], eax

(since it needs 3 phases: load value from memory, do instruction, store back). No such instructions thus exist in MMX, SSE..., AVX..., ... On the other hand, Intel still believes that using a memory operand as source only is quite a good idea on x86 (look at the encoding of SSE..., AVX..., AVX-512). Nevertheless: having the capability to do such a complicated instruction atomically is very useful for multithreading; consider for example

  lock add [ecx], eax

which adds eax to the memory address in ecx atomically.

Also, a very typical distinction (that Intel only dropped with AVX on) is that CISC CPUs typically use 2 operands per instruction (of which one may be memory) and RISC CPUs have 3-operand instructions. So

  add r0, r1, r2

works on ARM, but under x86, only instructions that were introduced from AVX on (i.e. use a VEX (VEX2 or VEX3) or EVEX prefix (AVX-512); I have to look up whether something like that is also possible with a XOP prefix) have this capability.

Also very often, CISC instruction sets offer complicated addressing modes, such as in x86

mov edx, [ecx+4*eax]

It is not completely clear whether this is worth the complexity or not. On one hand, such instructions are hard to use for a compiler (which is the central reason why they were abolished in RISC architectures). On the other hand, skilled programmers can use them to write quite elegant and fast code.

TLDR: A central difference that still exists is that

- RISC architectures are load-store architectures

- on CISC architectures 2 operands (1 can be memory address) are typically used and "feel more natural"

- on RISC architectures, instructions typically have 3 operands.

- CISC architectures often support much more different and complicated addressing modes than CISC.

dfox · on Oct 18, 2018

The main point of RISC architectures is that they are trivially pipelineable to the extent that making non-pipelined implementation does not make much sense. All the architecture visible differences from CISC are motivated by that. Load-store gets you well defined subset of instructions that access memory and have to be handled specially, 3-operand arithmetics and zero register simplifies hazard detection and result forwarding logic and so on.

wolfgke · on Oct 18, 2018

> The main point of RISC architectures is that they are trivially pipelineable

This was the idea behind the original MIPS (the textbook example of a RISC processor - both literally and metaphorically). Unluckily this lead to the problem that implementation details of the internal implementation leaked into the instruction set. Just google for 'MIPS "delay slot"'. When in later implementations of MIPS, this delay slot was not necessary anymore, you still had to pay attention to this obsolete detail when writing assembly code.

The lesson that was learned is that implementation details should not leak into the instruction set.

Next: About what kind of pipeline are we even talking about? It is often very convenient to offer multiple kinds of pipelines dependent on the intended usage of the processor. For example for low-power or realtime applications, an in-order pipeline is better suited. On the other hand, for high-performance applications, an out-of-order pipeline is better suited. For example ARM offers multiple different IP cores for the same instruction set with different pipelines.

Finally, pay attention to the fact that more regular and more easy to decode instruction set of typical RISC CPUs (ARM is explicitly not a typical one in this sense, in particular considering T32) often leads to bigger code than, say, x86. This turned into a problem when CPUs became much faster than the memory (indeed some people say, this was an important reason why people today think much more critical about RISC). This is also the reason why RISC-V additionally provides the optional "“C” Standard Extension for Compressed Instructions" (RVC). Take a look at

> https://riscv.org/specifications/

The authors claim in the beginning of chapter 12 of "User-Level ISA Specification": "Typically, 50%–60% of the RISC-V instructions in a program can be replaced with RVC instructions, resulting in a 25%–30% code-size reduction.".

> 3-operand arithmetics and zero register simplifies hazard detection

Despite the 3-operand format of ARM, at least the A32 and T32 instruction sets offer 2 additional parts for many instructions:

1. conditional execution: for example ADDNE is only executed when the Z(ero) flag is not set. There are 15 variants for conditional execution, including "always").

2. "S" suffix for many instruction: causes the instruction to update the flags. For example SUBS causes the processor to update the flags while SUB does not.

The conditional execution was to my knowledge dropped in ARM64 because branch predictors got good enough.

So: ARM has other things in the instruction set to avoid pipeline stalling. 3-operand instructions are not among of them. The reason for 3-operand instructions rather is that this instruction format allows the compiler to generate efficient code much more easily.

dfox · on Oct 18, 2018

The stall detection logic remark was meant in the context of traditional MIPS-style in-order single-issue pipeline executing regularly encoded instruction set where the mentioned features lead to both smaller implementation of the detection logic itself (which for the traditional MIPS is the bulk of the control logic) and simpler routing of the signals involved.

On the other hand I completely agree that MIPS-style delay slots are simply bad idea. But for me ARM's conditional execution and singular flags register is similarly bad idea that stems from essentially same underlying thought.

abainbridge · on Oct 18, 2018

Damn, you are right. I thought the load-store architecture was no more in ARM thumb 2. I was wrong. Thanks for the info.

blattimwind · on Oct 18, 2018

It's only a superficial analogy. Micro-ops were RISC-like in the sense that they used to do one / few things. But their implementation is unlike RISC, micro-ops typically being very large (100+ bits wide) and not even necessarily of fixed length; you may imagine a specific bit in a micro-op more or less directly controlling a certain control line somewhere in an execution unit. Conversely micro-ops also can do a whole bunch of things at the same time.

dfox · on Oct 18, 2018

You have to distinguish between micro instructions in the meaning of "line in microcode store" which for the horizontally microcoded CPUs contain bits that more or less directly map onto datapath control signals and micro operations in the superscalar x86 sense, which typically are more or less reformulation of x86 instructions into something that is both more easy to execute in parallel (which involves breaking instructions into their constituent suboperations, which are RISC-like in the load-store sense, not in the other RISC characteristics) and maps better to the actual execution units (which may involve combining instructions).

deepnotderp · on Oct 18, 2018

This is a point of much contention, micro ops are NOT RISC like (some are even variable length). However, the one argument that does have some merit is that risc is supposed to be load/store and internally x86 cpus are load/store m.