Imagine a cpu that executed this sequence of instructions: r1 = r2 * r3 r2 = r3 ...

vharish · on Nov 15, 2020

Nice explanation. Wouldn’t mind reading more.

ithkuil · on Nov 17, 2020

I'll try. On pipelining:

Most operation above a certain level of complexity can be broken down in individual smaller steps that have to happen in order and each step takes some time. Once one such step is done, the next step is performed etc.

If the logic for each of those steps is implemented in hardware, once a step is done executing, the logic just sits there idle, waiting for the next operation (composed of smaller steps) to be executed.

Pipelining is a technique that allows to make use of that otherwise idle logic to start performing the first step of the next operation.

For example, let's consider this small program:

    r3 = r1 + r2
    r2 = r3 + r1
    r1 = r3 + r4

A hypothetical CPU would execute this instruction in 12 clock cycles:

    load instruction 1
    decode instruction 1
    perform r1 + r2
    write result in r3

    load instruction 1
    decode instruction 2
    perform r3 + r1
    write result in r2

    load instruction 1
    decode instruction 3
    perform r3 + r4
    write result in r1

A pipelined execution looks like this:

    load instruction 1 
    decode instruction 1 | load instruction 1  
    perform r1 + r2      | decode instruction 2 | load instruction 3
    write result in r3   | perform r3 + r1      | decode instruction 3
                           write result in r2   | perform r3 + r4
                                                  write result in r1

and completes in 6 cycles. Once the pipeline is running at full capacity, it produces 1 result per clock cycle, achieving a 4x speedup over the naive approach above.

This example architecture has a pipeline depth of 4.

The execution is still strictly in order. The idea is that you can start doing some work for the next instruction before the previous one is fully completed.

For example, you can decode the next instruction after you decoded the current one. Also, you can perform a computation on the next instruction as soon as you have computed the results of the previous one, even before you actually have written the results in the actual destination register, provided there is additional logic that ships the result of the current operation as an operand of the arithmetic unit for the next cycle.

The maximum duration of each step is bound by the clock period. The faster the clock, the less time you have for a single step. The deeper the pipeline, the faster the clock can tick and still produce one operation per clock tick. We'll see next some of the many things that prevents us from just riding this idea to the extreme and deepening the pipeline to ludicrous amounts and harness ludicrous clock speeds.

You may have noticed that this dummy architecture here has been carefully designed so that the next instruction can consume the result of the previous instruction if the instruction stream so requires.

Imagine a different architecture that is divided in 6 steps, where the + operation itself is pipelined in two steps.

    load instruction 1
    decode instruction 1
    access r1, r2
    start +
    continue +
    write result in r3

    load instruction 2
    decode instruction 2
    access r3, r1
    start +
    continue +
    write result in r2

    load instruction 3
    decode instruction 3
    access r3, r4
    start +
    continue +
    write result in r1


    load instruction 1
    decode instruction 1 | load instruction 1 
    access r1, r2        | decode instruction 2 | load instruction 3
    start +              | access r3, r1        | decode instruction 3
    continue +           | start +  (HAZARD!)   | access r3, r4
    write result in r3   | continue +           | start +
                           write result in r2   | continue +
                                                  write result in r1

Now, here the second instructions depends on r3 which is has not yet finished computing! This is known as a data hazard. A common way out is to introduce a pipeline "stall", i.e. a no-operation step is injected in the pipeline to resolve the hazard.

    decode instruction 1
    access r1, r2        | decode instruction 2
    start +              | access r3, r1        | decode instruction 3
    continue +           | nop                  | nop
    write result in r3   | start +              | access r3, r4
                           continue +           | start +
                           write result in r2   | continue +
                                                  write result in r1

This short stall is not a full pipeline flush. Ok, so we saw a data hazard, but so far all this has been pretty straightforward.

So far we were looking at a linear instruction stream. Let's see what happens when you have a branch:

    r3 = r1 + r2
    call func1
    r1 = r3 + r4

where func1 is:

    r2 = r3 + r1
    return

Let's see what our dummy 4-stage pipeline machine would do:

    load instruction 1                   
    decode instruction 1 | load instruction 2  
    perform r1 + r2      | decode instruction 2    | load instruction 3
    write result in r3   | perform pc + 1          | decode instruction 3 (HAZARD!!)
                         | write pc=func1, ra=pc+1 | perform r3 + r4      (^^^^^^^^)
                                                   | write result in r1   (^^^^^^^^)

calling a function basically means setting the program counter (aka instruction pointer) register to point to a different location instead of the next instruction. In order to return from the function you also need to save the return address somewhere (some CPUs use a "link" register, some use the stack, here I used a return address "ra" register); the return address is the address of the next instruction in the instruction stream before jumping to the function body.

As you can see, once we set the program counter and tell the CPU that the program flow continues from another place, the work the CPU has started doing in the pipeline becomes invalid.

    load instruction 1
    decode instruction 1 | load instruction 2
    perform r1 + r2      | decode instruction 2    | load instruction 3
    write result in r3   | perform pc+1, &func1    | nop
                         | write pc=func1, ra=pc+1 | load ins func1.1
                                                   | decode ins func1.1 | load ins func1.2
                                                   | perform r3 + r1    | decode ins func1.2
                                                   | write result in r2 | read ra
                                                                        | write pc=ra

Jumping to a different address thus incurs in additional delay, because we don't know yet where the jump is taking us until we have decoded the instruction and performed the necessary control flow adjustments.

In our dummy example this adds 2 cycle latency. Not all is lost since we can still do a little bit of pipelining, but for all intents and purposes this is a pipeline flush.

Some architectures (especially the older RISCs but it's still quite common in DSPs) work around this problem by introducing one or more "branch delay slots"; these are instructions that appear physically after a call/jump/branch instruction but get executed before the branch is actually taken.

A equivalent program for such an architecture would look like:

    call func1
    r3 = r1 + r2
    r1 = r3 + r4

where func1 is:

    return
    r2 = r3 + r1

And be executed as

    load instruction 1
    decode instruction 1    | load instruction 2 
    perform pc+1, &func     | decode instruction 2 | load ins func1.1   |
    write pc=func1, ra=pc+1 | perform r1 + r2      | decode ins func1.1 | load ins func1.2
                              write result in r3   | read ra            | decode ins func1.2
                                                   | write pc=ra        | perform r3 + r1
                                                                        | write r2

By avoiding the pipeline flush we can now complete the same task in 7 cycles compared to the 9 cycles shown above. Branch delay slots are no panacea though and modern CPU have largely moved away from them, favouring other techniques to predict whether and where a branch is likely to happen. I choose to mention branch delay slots in order to illustrate that pipeline stalls are not inherently necessary whenever branches are involved.

Stalls are ultimately caused by data dependencies. The program counter is just data, on which the instruction stream itself depends on.

(In the real world obviously things get much more complicated but you get the idea)