This is no doubt obvious to hardware folks, but one enlightening moment is when ...

cgrand-net · on Oct 18, 2018

"Now I understand that rdi, rsi, etc. are nothing but compressed node identifiers in an interference graph. Ideally we'd have infinite registers: each instruction would read from previous registers and output to a new register. And so we would never reuse any register, and there'd be no data hazards"

I'm under the impression that we don't see the causality flowing in the same direction.

To me we first had a limited set of registers (imposed by the ISA), then to get better perf through out of order execution, cpus had to infer a deps graph and use register renaming.

Ironically all this silicon is spent to recover information that was known to the compiler (eg through SSA) and lost during codegen (register allocation).

nuriaion · on Oct 18, 2018

The space in an instruction is very limited. (If the representation of an instruction needs more bits you need more bandwidth, more cache space etc.) So it can be beneficial to only address 8 registers but then have a detection for spilling of spilling to RAM etc. (It can even be beneficial to specify only 2 registers (a = a + b instead of a = b + c) and replace a copy of registers with a rename.

gpderetta · on Oct 18, 2018

Exactly. In principle even memory can be renamed, although I'm not sure any current CPU actually does it (there are rumors). It would be great if the actual SSA graph could be directly passed from the compiler to the CPU, but what's saved by getting rid of renaming would probably be used to handle the much harder decoding. It would probably have implications for context switching overhead.

Taniwha · on Oct 18, 2018

And essentially that's the idea that the VLIWs with no interlocks have - one I worked on made the output registers of the execution stages architecturally visible

rayiner · on Oct 18, 2018

You’re really reconstructing a data depence graph not an interference graph. In an interference graph, there is an edge between nodes that are alive at the same time. That information remains implicit in the OOO machine (because a physical register stays allocated to a value through retirement). In a data dependence graph there are edges from operations to their operands. You need to reconstruct the data dependence links explicitly—during renaming you need to map the input operands to the correct renamed physical registers. This is where “renaming” comes from. You “rename” the architectural register operands in each instruction with the corresponding physical register or tag: https://www.d.umn.edu/~gshute/arch/register-renaming.xhtml. After renaming, the data dependence graph will be explicit in the reorder buffer.

rayiner · on Oct 18, 2018

A couple of more points. The architectural registers are real places. Say you write to RDX, don’t do anything with it for a million instructions, then read from it. Where does the data come from? There is an architectural register file and RDX is a specific register in it. The renaming is only temporary. An instruction doesn’t hang on to a ROB entry and rename register until every instruction that may use it has executed. When the instruction is retired in order, the value is written back to the architectural register file, and the rename register and ROB entry is freed for some other instruction. So the renaming is ephemeral, only lasting until retirement.

Tuna-Fish · on Oct 18, 2018

> A couple of more points. The architectural registers are real places. Say you write to RDX, don’t do anything with it for a million instructions, then read from it. Where does the data come from? There is an architectural register file and RDX is a specific register in it.

This was true on all the P6-derived cores, but is no longer true on any modern CPU that uses a PRF.

On PRF cpus (SNB and everything after it), the final backing store of register data is the PRF, and the architectural registers are just pointers to this file. Every instruction gets renamed in the frontend, and there simply is no architectural register file.

Also, ROB does a very different task on those CPUs, to the point that still calling it ROB is misleading. The current architectural register names are held in the frontend, as part of the register rename hardware, not in the ROB. ROB holds on to instruction status, and the PRF register names for all the sources. So an used PRF entry can be garbage collected when it no longer exists in either the architectural name file or the ROB.

For a more detailed explanation, read: https://www.realworldtech.com/sandy-bridge/5/

rayiner · on Oct 18, 2018

That's correct, but a bit in the weeds. I was addressing OP's statement: "rdi, rsi, etc. are nothing but compressed node identifiers in an interference graph." In a modern CPU, this is true only for speculative state. If I have "ADD RAX, 1" and RAX was last written by an in-flight operation, I don't care about RAX qua RAX. It's just a name that connects an operation that produces a value to an operation that consumes it. However, if the instruction that last wrote RAX was committed long ago, what we care about is the architectural state of RAX. That distinguishes modern OOO CPUs from pure data flow machines.

This fact is easiest to see in a ROB design: https://courses.cs.washington.edu/courses/cse378/10au/lectur... (page 7). The front-end rename table may point either to a ROB entry, or the architectural register file. Note, however, that the basic distinction exists in a PRF design too--some physical registers will hold architectural values (pointed to by the RRAT or similar structure) while others will hold speculative values. We can quibble about terminology, but there will be a subset of the PRF, that is pointed to by the RRAT, that contains the machine's non-speculative, architectural state.

Tuna-Fish · on Oct 18, 2018

> However, if the instruction that last wrote RAX was committed long ago, what we care about is the architectural state of RAX.

What you care about is which PRF entry will contain the valid value of RAX for the instruction going through rename. Unlike in earlier designs, post-SNB ones always do exactly the same thing during rename, that is, allocate a clean PRF register as destination and read the PRF number currently stored under the name RAX in the RAT and store that as source. Whether that entry was written to by the previous instruction, or has been stale for a thousand instructions is entirely irrelevant.

The ROB design document you posted is explicitly of an older system, that is different from modern ones. On post-SNB designs, the rename table entry and ROB entries can only point to the PRF (the ROB no longer ever contains register state!), and there is no architectural register file.

> Note, however, that the basic distinction exists in a PRF design too--some physical registers will hold architectural values

While technically true, this is extremely misleading. The architectural values reside in whatever rename register was last assigned to them. For the entire pipeline, there is no distinction between architectural registers and the rest of them.

My original complaint is of this sentence in your OP:

> When the instruction is retired in order, the value is written back to the architectural register file

Which is explicitly false of every Intel CPU released since 2011 (except maybe some atoms?). PRF machines never move values from PRF. It is the final source of truth for all registers. When "ADD RAX, 1" is executed, a new PRF entry is allocated for the result, and once that is written, it remains where RAX lives until another instruction that stores to RAX is executed, no matter how far away that is.

rayiner · on Oct 18, 2018

> What you care about is which PRF entry will contain the valid value of RAX for the instruction going through rename... The ROB design document you posted is explicitly of an older system, that is different from modern ones.

You're getting caught up in where the data is stored while I'm talking about the nature of the data conceptually, vis-a-vis OP's analogy between register renaming and graphs. I didn't post the ROB example to disagree with you about how post-SNB CPUs work re: PRF (your description is correct). I posted it because ROBs are a good illustration of the difference between speculative values and committed values, since they store speculative values in one place (the ROB) and committed values in the other (the ARF). The same distinction exists with PRFs (see below), but it's harder to see.

> The architectural values reside in whatever rename register was last assigned to them. For the entire pipeline, there is no distinction between architectural registers and the rest of them.

Architectural registers are different because they will have an RRAT entry pointing to them. And there certainly is a distinction between PRF entries that hold architectural state and PRF entries that hold speculative state, e.g. when there is an exception. That's why the RRAT exists in the first place--so the CPU can identify the subset of the PRF that holds the architectural state. If there was "no distinction" you wouldn't have an RRAT.

Concrete example:

MOV EAX, 15

-------------- < committed

MOV ECX, [EBX]

IMUL EAX, 5

ADD EAX, 1

-------------- < renamed

Say the first instruction has committed, the second has issued, and the others are waiting to issue. Say EAX in the first instruction is allocated to PR0, in the third it's PR1, and in the fourth its PR2. The RRAT has an entry for EAX that points to PR0. The front-end RAT has an entry for EAX that points to PR2. PR0 holds committed state. PR1 and PR2 (will) hold speculative state. The reservation stations and ROB encode a data-flow dependency between the fourth and third instructions. (PR1 is an operand to the fourth instruction, which produces PR2). There is no data dependence graph for anything that's committed, however. It's just values in architectural registers.

If the load throws a page fault, what happens? The machine grabs the architectural state from the RRAT. Execution restarts with PR0 in the front-end RAT entry for EAX, and the page-fault handler sees EAX == 15. If the third and fourth instructions issued in the meantime, their results are thrown away. That is the difference between architectural and speculative state.

> Whether that entry was written to by the previous instruction, or has been stale for a thousand instructions is entirely irrelevant.

It's relevant to the analogy that OP was making between register renaming and graphs. In a data dependence graph, all data flow can be represented with edges from an operations that produce operands to the operations that consume those operands. In an out-of-order machine, that graph exists (after renaming) only for instructions that are still in flight. For instructions that were committed long ago, the graph is collapsed to the result, denoted by an architectural register.

> Which is explicitly false of every Intel CPU released since 2011

The point (for purposes of responding to the OP's analogy) is that you need to update the architectural state when an instruction retires. The fact that "post-2011 Intel CPUs" do this by flipping a pointer rather than copying the data is an implementation detail that is educational, but not really relevant.

KMag · on Oct 18, 2018

Interesting. I always figured there was just a lookup table mapping ISA registers to physical registers that was used by instruction decode, a bit vector (one bit per physical register) indicating which physical registers were in use, and each micro-op would carry with it which physical register(s) would be freed up upon instruction retirement.

If a register might or might not be renamed, what's the advantage of usually storing a given register in a given location? Is the main reason power savings, so checking a single bit to check if a register has been renamed, before paying the power and time to perform a lookup in the renaming table? Is access to a renamed register then slower than access to a non-renamed register?

rayiner · on Oct 18, 2018

You need to be able to recover architectural register state (e.g. if there is a branch misprediction, or an instruction throws an exception). Many implementations store speculative results in a reorder buffer, and write to an architectural register file on commit. The architectural register file always reflects the committed state of the processor. In such an implementation, your renaming table (the "lookup table mapping ISA registers to physical registers") might point to either a ROB entry, if the most recent write to an architectural register was an instruction that is still in flight, or the architectural register file.

Another way to do it is to use a physical register file to store data, as Tuna-Fish explains above. In that case, your front-end table always points into the PRF, but it may be a speculative value (written by an instruction that's still in the ROB and not yet committed) or a committed value. But that doesn't let you recover the architectural state on an exception. So you keep a second table (in the Pentium 4, it's called the RRAT), and whenever an instruction commits, you update that table to indicate that the most recent committed version of a particular architectural register can be found in a particular physical register.

There is at least a third way to do it, which doesn't use an RRAT. MIPS R10k rolls back the ROB to recover the last committed architectural state.

Tuna-Fish · on Oct 18, 2018

You are more or less correct, he was wrong. Architectural registers are just pointers to the PRF these days.

monocasa · on Oct 18, 2018

Are you sure about that? The HDL I've seen doesn't appear to do it that way. If I'm reading BOOM's source correctly for instance, there's not a separate architectural file.

AtlasBarfed · on Oct 18, 2018

As was pointed out in another thread once which I don't recall, almost all aspects of assembly programming in x86 aren't really low level anymore. It's just another abstracted high-level language, although one that still mirrors/is tied to the fundamental idea of an x86 machine.

As detailed in the article, x86 opcodes are translated/compiled on the fly to micro-ops.

The registers are renamed as needed to squeeze data.

Memory locations are actually fulfilled by any of three to five levels of cache, RAM, or swap.

Cores are variable in their number of issues, and can share the same instruction pipelines in hyperthreading.

zaarn · on Oct 18, 2018

While not quite infinite, what you want sounds awfully close to a belt machine;

Each instruction reads data from positions on the belt and then puts a few values at the start of the belt. This causes values on the end to "fall off".

The MILL CPU is an example of this architecture.

ridiculous_fish · on Oct 18, 2018

The belt approach is sort of hybrid between registers and stack architectures. A disadvantage is that values on the belt may be at unknown locations after branches:

    var x = something();
    if (somethingElse() { stuff(); }
    print(x); // where is x??

Here x has to be spilled and reloaded. This is worse than a register machine where x can be assigned a register for its entire lifetime.

I think this style of architecture is at its best for straight-line code, but suffers with branchy code, loops, etc. which will require a lot of spilling.

Symmetry · on Oct 18, 2018

The main reason to adopt a belt is that you don't need to encode the destination (or destinations on an instruction that returns multiple values). Instruction cache pressure is typically a critical design constraint on VLIW machines which is presumably why they went this route.

Also, that simplifies the compilers job and ties into the spiller design, etc.

fosk · on Oct 18, 2018

> Ideally we'd have infinite registers: each instruction would read from previous registers and output to a new register. And so we would never reuse any register, and there'd be no data hazards.

Crazy thought but would it be possible to build a decentralized and distributed CPU on top of the blockchain?

Edit: to the downvotes, care to explain? I am genuinely curious to know the answer, and curiosity should be a driving factor in this community

KMag · on Oct 18, 2018

If the blockchain in question is Turing-complete, then it would be possible (by trivial consequence of definition) to build an emulator for any other Turing-complete (or even non-Turing-complete) deterministic processor on the blockchain. However, the latency per operation would be prohibitively high for most purposes, unless each "CPU Instruction" is complex enough to greatly stretch definitions and lose meaningful distinctions.

So, the answer to the most strait forward interpretation of your question is basically "Yes, it's obviously possible and likely trivial, but probably not practical"

deepnotderp · on Oct 18, 2018

Latency.