Hacker News new | past | comments | ask | show | jobs | submit login

Sure, but it seems disingenuous to throw up a straw man like that rather than address how this stacks up against something that's pipelined properly.

AFAIK you're still going to be constrained by propagation delay either way.

I totally understand the switching power argument but less so the performance argument.




To make things very concrete, imagine that your pipeline has an execute stage where various operations get executed. Say you have operations for:

1. bitwise AND, which is extremely fast because each bit of the answer is just the AND of the corresponding bits of the inputs.

2. ADD, which is still a fast but is definitely slower than the AND: each result bit depends not only on the corresponding input bits, but also on the earlier bits (to propagate carries).

In some barbarically simple timing model:

- Each result bit for the AND might take a single gate delay (because it is a single AND gate), but

- The highest result bit for the ADD might take (say) 10 gate delays.

You'll also need a bit of logic to choose between the above computations depending on the opcode. Let's suppose this selection logic adds another 5 gate delays.

Long story short: when executing an AND, the whole result is ready after 6 gate delays. But when executing the ADD, some bits are not ready until 15 gate delays.

In a typical clocked design, you will need to run the clock slowly enough to accommodate the slowest such delay, i.e., even when you are executing ANDs, you're running the clock slower than 15 gate delays. The clock needs to run slow enough to accommodate any operation, and doesn't dynamically change on some kind of per-opcode basis (because that would be insanely hard to coordinate with the other pipe stages).

In contrast, in an asynchronous design, as far as I understand it, you don't have a clock at all. Instead, the result has an additional "ready" signal associated with it and, whenever the result is ready (a data dependent computation), the next stage can consume it.

Ideally this would mean your execute stage could process the AND operations in just 6 clocks instead of having to wait 15 clocks. Ideally, it might also mean you don't need to design your pipeline quite so carefully: in a clocked design, a single slow path slows down the entire pipeline; in an asynchronous design, that one particular path may be slow, but that doesn't slow down everyone else.


> In contrast, in an asynchronous design, as far as I understand it, you don't have a clock at all. Instead, the result has an additional "ready" signal associated with it and, whenever the result is ready (a data dependent computation), the next stage can consume it.

You can do the exact same thing in clocked designs as well. The AND produces a "ready" signal that allows it's output to skip the stages needed by the ADD side (or conversely, you can have the ADD side produce a stall signal that stops the pipeline). You can actually see this in modern processors - some instructions can take variable amounts of time depending on the instruction arguments (notably loads and stores, but also sometimes multiplies and divides).


The point is that in a synchronous design the delays have to be multiples of a clock cycle. Whenever a latency is not an exact multiple, the circuit is idle. And clock cycles can not be aribtrarily short because of the overhead for latching etc.

Also it takes time to schedule instructions. In a simple processor with a 5 stage pipeline you can simply stall the entire pipeline, but since you stall all other instructions too, this is costly. And in a modern superscalar out-of-order processor, stall is even more expensive and you cannot reschedule all instructions for the next cycle at the end of the previous cycle because rescheduling is too complex.


Hm, I'm still not sure I understand.

> The point is that in a synchronous design the delays have to be multiples of a clock cycle.

Not really, you can pretty much always do rebalancing between stages so that you end up with a multiple of the clock. And if you can't, you can locally skew the clock to borrow time between stages.

> Also it takes time to schedule instructions. In a simple processor with a 5 stage pipeline you can simply stall the entire pipeline, but since you stall all other instructions too, this is costly.

This is an unrelated higher level architectural distinction then asynchronous vs synchronous. The scheduling cost doesn't go away when you use asynchronous circuits.

Also why is stalling bad here? If the circuit takes 16 gate delays on some inputs and 6 gate delays on others, it doesn't matter if we use async or sync design; a fast operation behind a slow operation is still going to wait (stall) for the operation in front of it to complete. That's just a fundamental property of in order execution (which again, isn't related to async or sync circuits)

> And in a modern superscalar out-of-order processor, stall is even more expensive and you cannot reschedule all instructions for the next cycle at the end of the previous cycle because rescheduling is too complex.

What? In canonical OoO design, the default is stall! An instruction will only ever proceed to the next stage if it's dependencies have been satisfied. When a stall happens you don't need to reschedule because the instruction won't have been scheduled in the first place!

The important part from the grandparent was this:

"Ideally, it might also mean you don't need to design your pipeline quite so carefully: in a clocked design, a single slow path slows down the entire pipeline; in an asynchronous design, that one particular path may be slow, but that doesn't slow down everyone else."

Which is true! If you do a bad job of balancing your pipeline stages (or can't balance them statically because of variation/whatever) then the single slow path slows down the entire clock. However, when you can rebalance the pipeline statically, as in the example give, there's no reason that you have to design your pipeline to wait for the slowest path.

But perhaps I misunderstood the example; let me know if I'm missing anything.


edit: Ghettoimp said something very similar, and probably better than I said it.

Say you've got a 2-stage pipeline (for simplicity). Maybe stage 1 has variable execution times, depending on the instruction that is being executed. Maybe stage 2 is faster than the worst-case for stage 1. In all cases, the clock will need to be slower than the slowest step of the pipeline, which means that the circuit may sit idle for a bit of time when stage 1 completed in faster than worst-case time.

In an unclocked equivalent, that idle time can potentially be eliminated in the cases where it's not necessary. When stage 1 does a fast operation and stage 2 is ready to receive the result, the data can advance through the pipeline before the clock pulse would've been received in a clocked circuit. Both are constrained by propagation delay, but a clocked circuit is constrained both by propagation delay and the timing of the clock.


At first you said pipelining was made to address the issues brought up in the paper, now you are saying you just want to see how it stacks up to pipeling, which are two different things, so don't call it a straw man to give Ivan Sutherland the benefit of the doubt.


The straw man is the part I quoted.

He talked about slow operations that gate your chip. Pipelining explicitly address this. Unless their async units have better hold times than d-flops they will both be gated by propagation delay. It's a straw man since it never mentions pipelining at all.

[edit]Not that I don't have a ton of respect for Sutherland(esp in graphics domain) but it would be nice to see something that admits other approaches.


not quite - pipe stages have costs - both in area and flop delay. In particular a particular flip-flop might have a setup time on it's input and a clk->Q delay - for really fast clocks this might be close to 1/4 your clock period.

For example let's suppose we have a combined flop delay of 1nS and we have a combinatorial delay (the logic we want to calculate) of 9nS - we can clock this at 100MHz, or we can pipeline it 3 ways split the combinatorial block into 3 3nS chunks - each pipe stage still has a 1nS flop delay so total pipe stage delay is 4nS (250MHz) - we split the logic in 3 but only got a 2.5 times performance increase because of fixed costs

Pipelining is a great tool but there is a law of diminishing returns that kicks in here


Thanks, this is the kind of reply I was hoping to get instead of being told I can't comment because some has an important last name.


I never said that, don't turn yourself into a victim.


But you didn't quote anything


You pipeline whether it's synchronous or asynchronous. The point of being asynchronous is to eke out better performance when only the faster portions are your pipeline are active.


Other points:

* "clock-speed" adapts automatically to gate speed, rather than be dictated (which has to be set conservatively),

* allows power savings in multiple ways (no globally propagated screw sensitive clock, finer grained clock gating by construction, and for NCL: wider supply range adaptability)

* the absence of a global clock by definition means less simultaneous switching, which reduces strain on power supplies (decoupling) and gives much better EMI.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: