Hacker News new | past | comments | ask | show | jobs | submit login
A Case for Asynchronous Computer Architecture (2000) [pdf] (yale.edu)
65 points by mahami on Nov 30, 2021 | hide | past | favorite | 57 comments



I tried to do a clockless fully async bus interface in around 1988 in a chip I was designing at Masscomp for a fast data acquisition system. Never got built, but it was fun trying, and it would've been really fast. "Lower design complexity" though: hahaha! Nope.


I worked with Alain Martin at Caltech, and I always loved the idea of asynchronous circuits. When I became an FPGA engineer, I realized the big problem with both FPGAs and asynchronous logic: the tooling doesn't generalize well to other domains, so you have to be a narrow specialist to make progress.

If someone could convert synchronous verilog to async circuits under the hood, they may see huge gains in speed and power use for their circuits, but that is a huge uphill climb.


There is an FPGA company, Achronix, that claimed to do this. Their FPGA architecture was apparently asynchronous, and they had tools that compiled synchronous designs onto it. Don't know how good their tech was, but they got bought by Intel and are still making it AFAIR.


Their async tech worked fine, but FPGAs likely don't get the same benefit as ASICs from async logic. Not to mention debugging is hard, so if you're not committed to a design, you may not want to put in the effort.

The Achronix folks are going strong today (still independent), but with a much more conventional FPGA. The on-chip network may be async, but they hide it well. I hope they have lots of success in the future.


Achronix is still an independent FPGA company No longer focused on asynchronous FPGA technology Currently shipping high performance synchronous FPGA technology on 7nm Learn more about the Speedster7t FPGA: https://www.achronix.com/product/speedster7t-fpgas


It's a classic idea. There were some early asynchronous mainframes built from discrite logic. It might come back. It's an idea that comes around when you can't make the clock speed any higher.

It's one of those things from the department of "we can make it a little faster at the cost of much greater complexity, higher cost, and lower reliability". That's appropriate to weapons systems and auto racing.


I think the long term driver won't be speed but power consumption, something that Asynchronous Computing has the potential to materially improve.


Having a common clock reference (per core) is essential for reducing latency between components. If you have to poll or await some other component arbitrarily, there will necessarily be extra overhead and delays in these areas. There will also need to be extra logic area dedicated to these activities. Make no mistake, just because there's no central clock, doesnt mean you are magically off the hook. You still need to logically serialize the instruction stream(s).

Even for low power applications, you would probably use less battery getting the work done quickly in a clocked CPU and then falling back to a lower power state ASAP. Allowing the pipeline effects to take hold in a modern clocked CPU should quickly offset any relative overhead. Heterogenous compute architecture is also an excellent and proven approach.

Certainly, there are many things that happen in a CPU that should not necessarily be bound by a synchronous clock domain (e.g. ripple adder). But, for these areas where async cpu a clear win, would we actually see any gains in practice using real software? Feels like there's a lot of other strategic factors that wash out any specific wins.


My understanding--which seems to coincide with this article and which Wikipedia seems to agree with (not that that necessarily means much for this)--is that in an asynchronous circuit latency would be lower, not higher, as the clock is required to wait for the worst-case performance while a clock-less system can proceed immediately once only the required inputs have arrived (or even attempt to speculate on partial inputs, something which would offer no value if you would have to end up waiting for the next tick anyway).


This is correct. It happens at multiple levels. Oversimplified:

* An async add operation takes variable time based on the number of carries, whereas a sync one is set to the worst-case.

* The clock for an ALU is set for the worst-case even when doing something faster (e.g. an ADD rather than a NAND)

* If you have multiple logic stages handled in one clock cycle, the problem is compounded. The clock is set by the slowest stage for all components in the system.

* If your system is doing nothing, you're still clocking it. Clocks are adjusted, but not at a nanosecond-by-nanosecond level.

All-in-all async gives a nice power boost and a nice performance boost (not enough of a boost to displace an entrenched ecosystem, mind you, but a nice boost nonetheless).


Yeah that's the theory, but reality is different and probably why we don't see any in production. (The last company that would admit to a tiny bit of clockless logic, Wave, folded).

The reality is that doing clockless logic introduces a lot of overhead at every state, both area and timing. There is different styles and the issue are different for them, but the bottom line is that nobody has been able to realize the theoretically wins in production (note1). And that's not even addressing the lack of tooling.

note1: the closet IMO is Ivan Sutherlands group which have some very impressive claims, but still nothing you can run out and buy.


I don't think that's right.

1) I think an architecture change -- any architecture change -- is expensive. Intel and AMD dumped many billions of dollars into R&D around existing architectures, and an asynchronous one starts without a lot of that benefit.

2) There's a ton of stuff -- chipsets, RAM, software, etc. -- built up around synchronous. The engineering cost go up astronomically.

3) That's not to mention baseline engineering costs.

Async won't give a 2x boost to performance. I would guess it'd be 10%, maybe even 30%. That's not nearly enough to justify the investment.

Ivan Sutherlands' work certainly won't compete with teams 10+x times that size and investment.


There is no 10% boost - it's fiction

ADDED:

With the millions spent on getting just a minor single digit improvement, you think the big players wouldn't jump on clock-less immediately if they could? Note, Intel did use (does?) use domino logic in the ALU and FPU. The benefit just isn't there for clock-less. I personally know of companies that tried and gave up on it.

To your other points, you wouldn't boil the ocean; you keep everything else clocked as usual and bridge between them.


I think there's a long graveyard of conceptually better architectures which died because people underestimated costs, from Transmeta, to all sorts of SIMD and MIMD systems, to VLIW.

I agree it's not too expensive to prototype, but it's super-expensive to do *right*.


Clock distribution eats a lot of power at gigahertz frequencies, and a lot of gates.

> If you have to poll or await some other component arbitrarily, there will necessarily be extra overhead and delays in these areas.

You don't poll. You have a lot of small input-clocked domains which work at a speed with which data comes.


True, but asynchronous circuits need a lot of extra gates and signals for detecting when an operation is completed and notifying the next stage that it can proceed.

It is very difficult to estimate which of the 2 approaches will need less area and power for some given requirements.

It is likely that for a sufficiently complex device an asynchronous implementation will use less power, but the effort to design bug-free complex asynchronous logic is much higher than for synchronous designs, which is probably the main reason why very few commercial asynchronous devices have existed.


This seems to be from 20 years ago, the most recent citation was from 2000 and it describes a MIPS chip built on a 1998 process.


And not even a mention of AMULET (https://en.wikipedia.org/wiki/AMULET_microprocessor)


Nor this one:

https://authors.library.caltech.edu/43698/1/25YearsAgo.pdf

It was the original paper for this that got me interested in building silicon tools


Came here to say this.


We had the pleasure of hosting Dr. Manohar at a CIRCT weekly discussion session earlier this year. He presented much more recent work if anyone is interested. The talk and discussion was recorded here: https://sifive.zoom.us/rec/play/Bg99_niHh9OG_8uE_nhaz6otxvA0...

EDIT: talk begins around 7 minutes.


Yes, but I thought that it could be interesting to look at research on the topic from 20 years ago to compare it with present progress.


Has there been much progress? I remember hearing a lot about asynchronous logic circuits back in the 90s, but don't hear about much in the way of breakthroughs since then.


Could you add the publication year in the title of your submission?


(2000)


Added above. Thanks!


Asynchronous would work better, but we're unlikely to get there -- too big a change.

It's like:

* having ECC everywhere

* having a single display standard (as opposed to HDMI/DisplayPort/USB-C/DVI/VGA/...)

* some kind of architecture where a single bad expansion card (USB, PCIe, etc.) can't crash a whole computer

... and so on

On one hand, no brainer. On the other hand, it hasn't happened.

NVidia is breaking ground on the move to SIMD/MIMD-style architectures, as predicted at the same time, and only because it gives a 30x boost in performance. Async will probably net us a 50% performance boost or something.


> * some kind of architecture where a single bad expansion card (USB, PCIe, etc.) can't crash a whole computer

If you mean IOMMU, we do have that. It doesn't seem completely doable because someone could still plug an etherkiller into the card.


I don't care much about hostile attacks. I just lost a few weeks until I debugged my computer was crashing due to a failing wifi card. I care about that sort of thing. That's totally fixable.


Eh, you probably already have the OS architecture for that. The vendor just isn't using it.


I don't think async can make things faster but it can make them more energy efficient and the incentives for that is still close to none as our economic models reward waste until all EROEI is depleted.

But you need to add the ability to switch things off dynamically, meaning cores on CPU/GPU; so far the industry has solved this with little.big but that requires all software to change, it's going to take time that we unfortunately do not have as hardware is closing the ownership model.


I will raise an import distinction: asynchronous logic != dynamic logic.

There can be dynamic synchronous logic, and vice versa.

Dynamic vs. static determines whether the circuit as such needs to be driven by any constant pacing input, whether embedded clock, or external clock, vs. not needing it to arrive to a settled state (to latch.)

If you are to speak strictly, asynchronous vs. synchronous determines whether that pacing input is external, or recovered from input.


Do you mean domino logic?


Domino is _one_ version of asynchronous, but that's using a different notion of Asynchronous than the article. Because of the ambiguity, we talk today of clock-less logic, which comes in variants, most notably delay-insensitive and quasi-delay-insensitive. The latter is faster, but less immune to noise (has has terrible timing analysis issues).


Mini-MIPS isn't that different from a conventional out-of-order superscalar microarchitecture. The article even says:

  However, the MiniMIPS pipeline structure can execute instructions out-of-order with respect to each other because instructions that take different times to execute are not artificially synchronized by a clock signal.


Does this mean that the chip isn't clocked? Doesn't that give you a complete metastability nightmare? How does it work?


No metastability nightmare.

One way to do this is to have each component have an output clock, which raises when it's output is known stable. If an adder has no carries, that takes 1ns. If it has each possible carry, it takes 2ns. You have a second clock propagating backwards to know when the next stage is ready for it's next input.

You still have timing. It's just set to when a component is ready with output, or ready to receive input.

Everything goes faster and uses less power.


And there are no unnecessary state changes, like you get with a clocked circuit that changes state with the rhythm of the clock, whether or not it is useful. At high frequencies that equates to a lot of power lost.


Well to some extent metastability is a result of having a clock (and the tiny feedback loops that we embed in synchronous flops to create storage) - instead you use logic structures that are designed to be asynchronous and that can adjust their timing to not have these issues


Memory cells are the thing that uses the vast majority of power in a CPU. And they are used everywhere, cache, uOP cache, BTB, etc.

Async CPU solved a problem that would have marginal benefit in a metric we care about

Also, I imagine, they would need to be implemented assuming the worst timing delay from the processes. They can't be binned like modern CPUs.


That doesn't sound right? Dynamic power is consumed by toggling wires, and memory cells are going to be one of the places where toggling is rare because you can't access all memory all the time.

Am I missing something?


The comparison of power usage is often done in the context of external memory. When talking about in-chip memory it becames an apples to orange comparison.

For start, it doesn't make sense to power gate a SRAM. So they are always leaking power. And despite writes not being common, reads are. Most application with SRAM reads all the metadata in parallel looking for a match (and often the data too due timing constraints and increased size of control logic because the extra complexity). And reading uses power.


Volatile memory consumes constant power to remember its value. Processing circuits only consume power when activated. And it's difficult to get the memory bandwidth saturated in a way that keeps all circuits busy. Computers do work in bursts; Then they wait for data. And practically all classical computer science data structures trash cache, like linked lists and OOP in general.


You're confusing DRAM and CPUs - CPUs almost only use static SRAM cells internally which don't require refresh


Wikipedia says "SRAM is volatile memory; data is lost when power is removed."

So it must consume power to retain its value.


i would have thought an asynchronous finite state machine type of system could be used to create a computer?


Modern clocked processors don't account for worst-case timings. Instead, instructions take variable count of clock cycles to complete.

In some sense they're already asynchronous, despite clocked.


Certainly, modern CPUs are pipelined, but each clock cycle is still the worse-case time for all steps in the pipeline.


> each clock cycle is still the worse-case time for all steps in the pipeline

The pipeline takes variable count of clocks to complete an instruction. The number depends on the instruction, input data of the instruction, and quite a few other things. In some exotic cases it even depends on power state, e.g. some Intel CPUs took ~20k cycles to power on their AVX pieces, during that window AVX instructions are much slower.

If for any reason the pipeline is unable to deliver the result by the end of the clock, CPUs don’t delay the clock, they continue running the clock. You simply gonna get the result on some later clock cycle.


That's exactly what I mean.


Can you write more about this or provide some examples? Of course, memory access has had variable timing “forever”, but the idea that other functional units can vary their timings for instructions is new to me.


Well for example imagine you have a 64-bit adder - and you add two numbers together - let's assume that on e of the in puts is '1' - how long does it take until the output is stable? it depends on the second value an d more importantly how long it takes for all the carries to propagate to the MSB - for a naive circuit and input of 0 the output will stabilise very quickly, for an input of 0xffff_ffff_ffff_ffff it will take 64 adder delays - an async circuit can have simple additions run faster than the worst case ones (which will still work). While a synchronous circuit would have a clock that could only go as fast as the slowest case (or pipeline things so that the output appears multiple clocks later)


> imagine you have a 64-bit adder

Integer addition is too easy. On all modern computers, add instructions take at most 1 cycle. Even vector ones like vpaddq AVX2 which adds four 64-bit numbers to another four numbers.


That's a pretty naive adder implementation. A more efficient version would stabilize after two ticks.


yup, but I'm using it as an example to make a point here (hence my use of the word 'naive') - also the whole point of doing async stuff is that you don't have to worry about 'ticks'


Even in async you have to worry about ticks, it's just that the ticks aren't global.

Think of the extent to which a clock can propagate as a lightcone, within that domain everything is synchronized, but it need not be synchronized with things on the outside. The smaller the domains the more asynchronous a design gets. But you'll never be async all the way down, at some point you will have to worry about stabilizing your outputs and passing them on to the next stage in that stable situation.

Compare synchronous serial lines with an asynchronous interface such as a centronics printer interface. The former will happily send zeros + clocks all day long absent a signal, the latter will strobe it's 'ready' output only when there actually is data, but it will still have to output that pulse, which serves as a very local clock.


A good source of that info is https://www.uops.info/

For instance, on my CPU which is AMD Zen 3, the idiv instruction (it computes integer division and modulo) takes between 9 and 19 cycles for 64-bit version: https://www.uops.info/html-instr/IDIV_R64.html#ZEN3 That’s for the operand already in a register i.e. no RAM access involved.

Whether it takes 9 cycles, 19 cycles, or something in between, depends on the arguments of the instruction, i.e. on the numbers being divided.

Same applies to quite a few other instructions: floating point divisions (divps, divpd), floating point square root (sqrtps, sqrtpd), even 64-bit integer multiplication (imul).

It’s not just the math. Jumps, branches and function calls take very different count of cycles depending mostly on two things: predicted or not, and the state of micro-ops cache at the target address. Albeit these effects are very hard to measure reliably, depends on the code too much, probably for this reason uops.info doesn’t have latency figures for jmp/call/etc.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: