Computing 10,000x More Efficiently (slides)

patrickyeon · on Feb 20, 2012

I'm about to head out the door, but here's what I see. He's got two fundamental changes to the way a modern CPU is designed:

* Internal representation of floats is logarithmic

* Small processing elements are connected in a mesh network that is made explicit to the programmer

The outcome of the second is that there is more work for a programmer, as details of the hardware are no longer hidden. The tradeoff is that none of the overhead to hide those details is required in hardware anymore. This is not new in other application areas, I'm sure DSP folk (and probably GPU folk) can tell us all about it.

The first gives us very fast arithmetic. What I think he's describing is, instead of working with x and y, you would work with log(x) and log(y). If `x * y = z`, then `log(x) + log(y) = log(z)`. (Additionally, the logarithm of exponentiation is multiplication, so square roots are just bit shifts!) Addition can be done in an approximate manner as well, using a bit more work. This is all from slide 5.

Releasing the hardware from the requirement that it be extremely accurate (he sets up his logarithmic representation to allow for ~1% error) allows him to make all this much smaller. He's shown that it can work "well enough" in some cases, and is working on finding more applications for the tech.

This isn't a free lunch (and isn't proposed as such), and not applicable across all of general purpose computing. But I would think there are lots of areas it could improve; whether it's huge datasets and customers that can afford expensive, custom hardware, or low power application areas that would benefit from approximate methods only a full-blown desktop can handle now (pattern recognition or noise reduction on phones and cameras, for example).

rhino42 · on Feb 20, 2012

The other issue will be the I/O bottleneck. A practical application would need to have on-chip memory to buffer data (not mentioned here, I think). Even so, the applications will be severely limited by having data with enough processing needs.

I spend most of my time working on FPGAs, and I would imagine that in the end, a practical implementation of this chip would involve a good deal of similar work. Data would have to be piped from one core to adjacent cores, to keep the inter-chip bandwidth tractable. This could result in portions on the edges left unused because the data can't get there.

In summary, the applications for such a chip are limited, and will require a different skillset from what we usually ascribe to a programmer, but in those constraints these chips could be incredibly high-performance.

EDIT: I haven't played with log-scale arithmetic yet, but I think that adds / subs will be much more computationally complex--probably more so than multiplies in the current linear-scale numbers. Just a thought.

omaranto · on Feb 21, 2012

I don't think addition and subtraction are too bad in this model. The slides mention that there is a simple circuit computing F(t):=log(1+2^t). So if you have two numbers x and y, represented by their logarithms log(x), log(y), the log of their sum can be computed using F as follows: log(x+y) = log(x) + log(1 + y/x) = log(x) + F(log(y) - log(x)).

Whether this is better or worse than multiplies in the current linear-scale representation depends on how hard it is to compute F, I guess.

patrickyeon · on Feb 20, 2012

It's not especially obvious, the way his site is laid out, but there's a more conversational article at http://web.media.mit.edu/~bates/Summary.html

russell · on Feb 20, 2012

An interesting piece of hardware that I have seen was used in ray tracing where high speed reduced precision FP was used to find polygons that were not visible, those where the low precision hid a change and those that were otherwise visible Conventional high precision processing was needed only for those with changes. The overall performance improvement was 100x.

We are probably close to hitting the wall for general purpose solutions on general purpose hardware, but the future is wide open for application specific hardware.

Palomides · on Feb 20, 2012

are we actually going to see application-specific hardware, or are we going to see something like an FPGA as a coprocessor in consumer hardware?

Tuna-Fish · on Feb 20, 2012

FPGAs really wouldn't help much. The theoretical gains sound good, but frankly, the practical speed of a cpu is measured by how fast it can run crappy code. FPGA in a cpu would just be a pointless waste of silicon that almost nobody would program for.

Application-specific hardware is already being added to the cpus -- the newest Intel ones have a hardware RNG and hardware video encoders and decoders.

rhino42 · on Feb 20, 2012

There's a team at UC-Berkeley which is working on compiling C++ to a combination of CPU/GPU/FPGA code right now, which might change things. I think you need a special architecture to take full advantage of it though.

bnegreve · on Feb 20, 2012

"already" ? GPUs are almost hardware implementations of directX/openGL APIs and it is not new.

russell · on Feb 20, 2012

In the ray tracing example I mentioned above, FPGAs got a 10x improvement. Going to silicon got another 10x.

algoshift · on Feb 20, 2012

Scaled integers give you an instant and very significant boost in performance. And, what's best, in most cases the error is far smaller than 1%. I've done tons of real-time image processing work on FPGA's where you almost never use floating point hardware due to how resource intensive (and slow) it is.

Another construct that is extremely fast, cheap and can even represent nonlinear and discontinous relationships is the good-old lookup table. In hardware (FPGA) you get one result per clock cycle.

Now, of course, if you can deal with the error and name the tune with just a handful of transistors it could be a good tool to have.

It might make more sense to have a hybrid. Build a CPU that includes a traditional FPU as well as a "physics-based" FPU. Then the programmer can decide which way to go by trading off between accuracy and deterministic behavior against speed and errors.

beza1e1 · on Feb 21, 2012

The traditional way has "IEEE 754" as a standard. Is there any such document for "physics-based" floating point? I'd rather call it "log-based" by the way.

And is this really floating point? As far as i understood, it is fixed point. So integers, just not twos-complement or unsigned interpretation.

_juof · on Feb 20, 2012

How does fpga+scaled integers compare with a gpu ?

robomartin · on Feb 21, 2012

An FPGA can't compete with a dedicated chip. The additional hardware switches and routing layers needed to make an FPGA programmable means that there's a limit to how fast it can go and how much logic you can pack per square millimeter. What you can do is use an FPGA to validate the design and then convert it to an ASIC if the market can justify the costs involved.

Zarathust · on Feb 21, 2012

Not sure about scaled integer but I checked fpga vs gpu for bitcoin mining and the fpga solution was an order of magnitude more expensive.

cousin_it · on Feb 20, 2012

> Now each instruction (increment, load register, occasionally multiply) invokes >10M transistor operations

This was the most interesting statement in the slides to me. Can someone explain why so many?

narag · on Feb 20, 2012

He seems to be writing about floating point operations. Considering instruction pointer logic, buses, cache, branch prediction and all that jazz, not so surprising, but it still seems a lot to me.

KaeseEs · on Feb 20, 2012

Think of how complex merely fetching and decoding an instruction is on a current x86 machine.

Now consider how the pipeline, multiple issue, hazard detection and mitigation, and out-of-order execution are implemented.

Ten million is probably an underestimate by a fair margin.

wmf · on Feb 20, 2012

According to Wikipedia the Alpha 21264 was 6M transistors not counting caches and the Pentium Pro was 5.5M total. 10M doesn't sound far off for today's processors. Since these are some of the earliest out-of-order superscalar processors, they may give an idea of the minimum cost of those features.

KaeseEs · on Feb 20, 2012

A quad core i7 has 1.16 billion transistors, for reference. The question posed was whether each instruction might touch 10 million, or .86%, of these.

mturmon · on Feb 20, 2012

Invocations of massive parallelism are always exciting and often convincing because of the huge speedups you can get for niche applications. But historically, SIMD has not been successful (http://en.wikipedia.org/wiki/SIMD). Or, as a NASA program manager in the area of parallel computing once remarked more colorfully in the mid-90s, "SIMD is dead".

You could argue that GPUs have been a successful back door pathway for SIMD. This seems analogous to the offboard processors that are used in medical imaging or military signal processing (radar) -- there is a SIMD unit, but it is controlled by a general-purpose computer, and the whole thing is in a peculiar niche.

DarkShikari · on Feb 21, 2012

But historically, SIMD has not been successful

Then why does Intel keep adding larger, more expanded SIMD instruction sets with every new CPU?

Why did ARM invent NEON and place it on all the high-end ARM chips, taking up more space than the entire main ALU?

Why are so many DSPs, for so many purposes, basically dumb SIMD machines? Broadcom's VideoCore, for example, is just a specialized 512-bit SIMD chip with insanely flexible instructions and a gigantic, transposable register file.

OpenCL/CUDA are SIMD languages -- controlling them with a "general purpose computer" isn't any different from your typical Intel chip, which is also a big SIMD unit controlled by a general-purpose computer.

It's rather difficult to call something that is an integral part of billions of CPUs "not successful". Without SIMD, much of modern computing would be crippled by an order of magnitude -- more for GPUs, whose shader units are basically SIMD engines and little more.

yvdriess · on Feb 21, 2012

- Because SIMD is 'easy' from the processor's point of view. Basically the compiler it tasked to bundle parallel computations together. SIMD died out back in the 90s because no one could get that magical parallelizing compiler to work. Similar story with VLIW ISA.

- Modern GPUs are not your 80's SIMD. Programs are written with very fine granularity. Lots of processor real estate is dedicated to scheduling/task switching, basically bundling fine-grained computations where possible. Modern GPUs are more like the 80's experimental dataflow machines than the SIMDs.

DarkShikari · on Feb 21, 2012

SIMD died out back in the 90s because no one could get that magical parallelizing compiler to work.

That "magical parallelizing compiler" does work, and is known as a "human being". This curious type of compiler has been used for over 20 years to produce countless lines of SIMD code, ranging from Altivec to SSE.

mturmon · on Feb 21, 2012

The list of examples is beside my point, and the point of the original article.

The original article seems to be talking about SIMD as a general-purpose computer, with general-purpose applications, not as a secondary unit to another system.

My comment acknowledged the uses you list above, especially for GPUs, in which SIMD is a secondary processor to a controlling general-purpose computer.

I programmed Conway's Life on a 32x32 DAP in 1986, and I guess I feel like I've seen these claims once already.

hinathan · on Feb 20, 2012

This concept - trading accuracy for efficiency - has always made intuitive sense to me, it's great to see someone building it out.

mrb · on Feb 20, 2012

This blew my mind. It seems such a simple idea to "move the instruction set closer to hardware". I am amazed no one thought about it before, because it seems so obvious.

The author provided convincing examples that 1% floating point accuracy is sufficient in many domains.

kristianp · on Feb 20, 2012

Here's the patent: http://www.wipo.int/patentscope/search/docservicepdf_pct/id0...

kghose · on Feb 21, 2012

People should look into neuromorphic engineering, which has been around for a while, which is basically computations using analog VLSI circuits (aVLSI). http://en.wikipedia.org/wiki/Neuromorphic

As a first approximation these are the analog computers our fathers and grandfathers used.

crntaylor · on Feb 21, 2012

Here's a fun fact. There are approximately 365 * 24 ~ 10,000 hours in a year. If he can pull this off, then problems which currently take a year of CPU time could be solved in _one hour_ with the proviso that your results can be off by up to 1% (assuming that the error can be tracked and doesn't compound over the lifetime of the calculation).

I can think of a lot of data mining/machine learning applications for which that would be a huge benefit.

keturn · on Feb 20, 2012

Any news since 2010?

MaysonL · on Feb 20, 2012

No, but the author is CEO of Singular Computing, LLC: http://singularcomputing.com/

(not that there's anything there)

aidenn0 · on Feb 20, 2012

So it's basically a digital slide-rule. Am I the only one picturing a million guys in neckties moving a slide rule at a rate of 1 billion times per second?

robomartin · on Feb 20, 2012

Raise your hand if you want the product liability exposure caused by your code potentially having 1% (or whatever) errors in calculations.

sarchertech · on Feb 20, 2012

This isn't for general purpose computing. This would only be used for specific domains where 1% errors don't matter. No one's going to be using this for financial transactions.

robomartin · on Feb 20, 2012

Is this the reason for the downvote?

First, I'd ask you to name, say, 1,000 applications where this would be acceptable. Games might be an obvious one.

OK, what else? Because, you see, without a significant ecosystems of applications that can tolerate these kinds of errors the business case to fire-up a foundry and make chips just can't be made.

Military? I don't know. You can justify even the most ridiculous things under the cover of military spending.

I would venture to say that the vast majority of the applications that can deal with a 1% error will work just fine using integer math which, if done correctly, can be far more accurate as someone else pointed out. And, I would further venture to say that they already use integer math.

In other word, in the context of the vast majority of applications this is not a problem needing a solution.

I've done lots of analog circuit designs (yes, analog...scary) where you creatively make use of the physics of the junctions to produce various non-linear transfer functions. It works well and can produce amazingly compact circuits that do what would require lots of digital circuitry. The errors, potential thermal drift and process variations can kill you though.

To give it a different spin, an FPU required an external chip 20 or 30 years ago. Today you have processors smaller than a fingernail running a multi-GHz with pipelined FPU's. My vote is for further development in the direction of faster and more accurate FP calculations rather than faster and less accurate (and full of other liabilities) FPU's.

DarkShikari · on Feb 21, 2012

To give it a different spin, an FPU required an external chip 20 or 30 years ago. Today you have processors smaller than a fingernail running a multi-GHz with pipelined FPU's.

GPUs have so many FPUs (literally thousands) that their silicon footprint is quite large. Large SIMD FPUs, like those in the Sandy Bridge, are also large and power-hungry, so much that Intel actually has an (undocumented) feature where the Sandy Bridge powers down half of of the FPU if it hasn't been used recently.

With regard to the importance of accuracy, even Intel realizes the importance of being able to improve power and performance with lower-accuracy FP calculations: see for example their demo hardware here: http://www.theregister.co.uk/2012/02/19/intel_isscc_ntv_digi... .

"The FP unit can do 6-bit, 12-bit, and 24-bit precision and, by being smart about doing the math, it can cut energy consumption by as much as 50 per cent and - perhaps more importantly - you can do the math a lot faster because you only operate on the bits you need and don't stuff registers and do calculations on a lot of useless zeros."

robomartin · on Feb 21, 2012

Sure. Makes sense. The question really is about whether or not some of these applications can tolerate errors in the order of 1%. I contend that those application are few and far between, particularly in this day and age. People don't want to go backwards.

For example, I can do photorealistic rendering of mechanical models from Solidworks. A 1% error in the output of these calculations would not be acceptable. The quality of the images would not be the same. Play with this in Photoshop and see what 1 1% error means to images. I would have zero interest in a graphics card that ran faster, used less power but offered a 1% error in calculations. That's an instant deal breaker.

I can see the competitors' marketing: "Buy the other brand if you want 1% errors".

Does a radiologist care about the processing of an MRI happening faster? Sure. Can he/she tolerate a 1% error in the resulting images? Probably not. He probably wants more detail and resolution rather than less. I wouldn't want him/her to be evaluating MRI's with a baked-in 1% error. I think it is ridiculous to think that this sort of thing is actually useful outside of a very narrow field within which errors would be OK.

I'm still waiting for someone to rattle-off 100 to 1,000 applications where this would be actually acceptable. Without that it is nothing more than a fun lab experiment with lots of funding to burn but no real-world utility.

Let's check in ten years and see where it's gone.

DarkShikari · on Feb 21, 2012

Sure. Makes sense. The question really is about whether or not some of these applications can tolerate errors in the order of 1%.

I think the idea here is threefold:

1. The vast majority of computations do not need extremely high precision. A billion times more floating-point operations go into rendering video for games every day than for MRI processing (made-up statistic). Even now, we have double-precision floating point for this reason: most operations only need single-precision.

2. Applications that need a particular level of precision can be written with that level of precision in mind. If you know the ability of a particular operation to provide precision, and you are aware of the numerical properties of your algorithms, you can write faster code that is nevertheless still as accurate as necessary. Most of this isn't done by normal programmers, but rather by numerical libraries and such.

3. Many performance-intensive applications already use 16-bit floating point, even though it has very little native support in modern CPUs. AAC Main Profile even mandates its use in the specification. The world is not new to lower-precision floating point, and it has gained adoption despite its lack of hardware support on the most popular CPUs.

The entire idea that this is "let's make ordinary computations less accurate" is a complete straw man and red herring that nobody actually suggested.

speleding · on Feb 21, 2012

"... name, say, 1,000 applications where this would be acceptable...what else?"

Image processing is already a large market but it is constrained by the fact that doing image processing on today's silicon is expensive.

If instead of 1000 faster you read "1000 cheaper to do the same thing" you can imagine lots of computer vision / pattern (voice) recognition / noise reduction stuff that becomes viable and the market will be huge.

Gravityloss · on Feb 21, 2012

Yes! Or 1000x less power hungry.

Immediately obvious application: real time moving image recognition in your phone.