Hacker News new | past | comments | ask | show | jobs | submit login

Imagine a cpu that executed this sequence of instructions:

    r1 = r2 * r3
    r2 = r3 * r4
    r3 = r4 * r5
    r4 = r5 * r6
On a first glance it might seem that the execution order matters because if you reorder lines in this source coxe the meaning indeed changes (e.g. the first instructions depends on an input in r2 but r2 gets a new value in the second instruction).

But if you rewrite the names of the registers (variables) and use a larger set of registers (variables) then you can express the same semantics without overwriting any registers. If you do so you can run the operations in parallel on different execution units within the same core.

The ability to run multiple functions at the same time is called "multiple (or wide) issue".

There are different kinds of functional units inside a cpu. For example, some can do basic arithmetic but not divide. At any given time, some functional units will be busy or free. The amount of effective parallelism present in the input stream you can effectively use is bound on whether you can successfully map your instructions into functional units that are free at that time.

Often, you while cannot run a given instruction because for example the multiplier unit is busy, the next instruction could be run because it depends on the divider functional unit, which at that time is free.

The ability to execute instructions out of order is called, well, "out of order execution". It depends on the ability to rename registers and keep track of actual data dependencies. The cpu needs some memory to keep track of that data.

It's unrelated pipelining (which I could give a try explaining too if anybody cares to hear a similar explanation)




Nice explanation. Wouldn’t mind reading more.


I'll try. On pipelining:

Most operation above a certain level of complexity can be broken down in individual smaller steps that have to happen in order and each step takes some time. Once one such step is done, the next step is performed etc.

If the logic for each of those steps is implemented in hardware, once a step is done executing, the logic just sits there idle, waiting for the next operation (composed of smaller steps) to be executed.

Pipelining is a technique that allows to make use of that otherwise idle logic to start performing the first step of the next operation.

For example, let's consider this small program:

    r3 = r1 + r2
    r2 = r3 + r1
    r1 = r3 + r4
A hypothetical CPU would execute this instruction in 12 clock cycles:

    load instruction 1
    decode instruction 1
    perform r1 + r2
    write result in r3

    load instruction 1
    decode instruction 2
    perform r3 + r1
    write result in r2

    load instruction 1
    decode instruction 3
    perform r3 + r4
    write result in r1

A pipelined execution looks like this:

    load instruction 1 
    decode instruction 1 | load instruction 1  
    perform r1 + r2      | decode instruction 2 | load instruction 3
    write result in r3   | perform r3 + r1      | decode instruction 3
                           write result in r2   | perform r3 + r4
                                                  write result in r1
and completes in 6 cycles. Once the pipeline is running at full capacity, it produces 1 result per clock cycle, achieving a 4x speedup over the naive approach above.

This example architecture has a pipeline depth of 4.

The execution is still strictly in order. The idea is that you can start doing some work for the next instruction before the previous one is fully completed.

For example, you can decode the next instruction after you decoded the current one. Also, you can perform a computation on the next instruction as soon as you have computed the results of the previous one, even before you actually have written the results in the actual destination register, provided there is additional logic that ships the result of the current operation as an operand of the arithmetic unit for the next cycle.

The maximum duration of each step is bound by the clock period. The faster the clock, the less time you have for a single step. The deeper the pipeline, the faster the clock can tick and still produce one operation per clock tick. We'll see next some of the many things that prevents us from just riding this idea to the extreme and deepening the pipeline to ludicrous amounts and harness ludicrous clock speeds.

You may have noticed that this dummy architecture here has been carefully designed so that the next instruction can consume the result of the previous instruction if the instruction stream so requires.

Imagine a different architecture that is divided in 6 steps, where the + operation itself is pipelined in two steps.

    load instruction 1
    decode instruction 1
    access r1, r2
    start +
    continue +
    write result in r3

    load instruction 2
    decode instruction 2
    access r3, r1
    start +
    continue +
    write result in r2

    load instruction 3
    decode instruction 3
    access r3, r4
    start +
    continue +
    write result in r1


    load instruction 1
    decode instruction 1 | load instruction 1 
    access r1, r2        | decode instruction 2 | load instruction 3
    start +              | access r3, r1        | decode instruction 3
    continue +           | start +  (HAZARD!)   | access r3, r4
    write result in r3   | continue +           | start +
                           write result in r2   | continue +
                                                  write result in r1
Now, here the second instructions depends on r3 which is has not yet finished computing! This is known as a data hazard. A common way out is to introduce a pipeline "stall", i.e. a no-operation step is injected in the pipeline to resolve the hazard.

    decode instruction 1
    access r1, r2        | decode instruction 2
    start +              | access r3, r1        | decode instruction 3
    continue +           | nop                  | nop
    write result in r3   | start +              | access r3, r4
                           continue +           | start +
                           write result in r2   | continue +
                                                  write result in r1
This short stall is not a full pipeline flush. Ok, so we saw a data hazard, but so far all this has been pretty straightforward.

So far we were looking at a linear instruction stream. Let's see what happens when you have a branch:

    r3 = r1 + r2
    call func1
    r1 = r3 + r4
where func1 is:

    r2 = r3 + r1
    return
Let's see what our dummy 4-stage pipeline machine would do:

    load instruction 1                   
    decode instruction 1 | load instruction 2  
    perform r1 + r2      | decode instruction 2    | load instruction 3
    write result in r3   | perform pc + 1          | decode instruction 3 (HAZARD!!)
                         | write pc=func1, ra=pc+1 | perform r3 + r4      (^^^^^^^^)
                                                   | write result in r1   (^^^^^^^^)

calling a function basically means setting the program counter (aka instruction pointer) register to point to a different location instead of the next instruction. In order to return from the function you also need to save the return address somewhere (some CPUs use a "link" register, some use the stack, here I used a return address "ra" register); the return address is the address of the next instruction in the instruction stream before jumping to the function body.

As you can see, once we set the program counter and tell the CPU that the program flow continues from another place, the work the CPU has started doing in the pipeline becomes invalid.

    load instruction 1
    decode instruction 1 | load instruction 2
    perform r1 + r2      | decode instruction 2    | load instruction 3
    write result in r3   | perform pc+1, &func1    | nop
                         | write pc=func1, ra=pc+1 | load ins func1.1
                                                   | decode ins func1.1 | load ins func1.2
                                                   | perform r3 + r1    | decode ins func1.2
                                                   | write result in r2 | read ra
                                                                        | write pc=ra
                                                  
Jumping to a different address thus incurs in additional delay, because we don't know yet where the jump is taking us until we have decoded the instruction and performed the necessary control flow adjustments.

In our dummy example this adds 2 cycle latency. Not all is lost since we can still do a little bit of pipelining, but for all intents and purposes this is a pipeline flush.

Some architectures (especially the older RISCs but it's still quite common in DSPs) work around this problem by introducing one or more "branch delay slots"; these are instructions that appear physically after a call/jump/branch instruction but get executed before the branch is actually taken.

A equivalent program for such an architecture would look like:

    call func1
    r3 = r1 + r2
    r1 = r3 + r4
where func1 is:

    return
    r2 = r3 + r1
And be executed as

    load instruction 1
    decode instruction 1    | load instruction 2 
    perform pc+1, &func     | decode instruction 2 | load ins func1.1   |
    write pc=func1, ra=pc+1 | perform r1 + r2      | decode ins func1.1 | load ins func1.2
                              write result in r3   | read ra            | decode ins func1.2
                                                   | write pc=ra        | perform r3 + r1
                                                                        | write r2
By avoiding the pipeline flush we can now complete the same task in 7 cycles compared to the 9 cycles shown above. Branch delay slots are no panacea though and modern CPU have largely moved away from them, favouring other techniques to predict whether and where a branch is likely to happen. I choose to mention branch delay slots in order to illustrate that pipeline stalls are not inherently necessary whenever branches are involved.

Stalls are ultimately caused by data dependencies. The program counter is just data, on which the instruction stream itself depends on.

(In the real world obviously things get much more complicated but you get the idea)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: