Computing 10,000X more efficiently (2010) [pdf]

hershel · on Feb 25, 2014

Singluar computing have built a chip based on this tech, and deployed in a military UAV. It enabled tracking of objects(which ws impossible before due to power constraints), at 6400x the performance/power of other best known methods.

http://www.defensetechbriefs.com/component/content/article/1...

http://proceedings.spiedigitallibrary.org/proceeding.aspx?ar...

From the paper: "The hardware is designed to perform high dynamic range arithmetic with roughly 1% error per operation. Singular has developed and studied varied algorithms for producing high quality results despite the approximate hardware. These studies use a perfect simulation of the accelerator’s arithmetic. Tasks that have been explored include summing thousands of numbers without accumulating error, k-nearest neighbor classification (KNN), foreground/background separation using Gaussian mixture models, iterative reconstruction tomography, deblurring using the Richardson-Lucy deconvolution algorithm, FFTs, radar processing, neural net learning, and other tasks. Most of these algorithms need slight adaptations to prevent cumulative effects of the 1% error, but with those adaptations all perform as desired."

samstave · on Feb 25, 2014

And these are precisely the things, which we know little, that the NSA may be exploiting with extreme effectiveness that scares the shit out of me.

I recall an article from 2005 that described a system that could track BULLETS IN MID FLIGHT over a vast swath, in-the DARK, which provided detailed view in all ballistics in a fighting theater....

m_darkTemplar · on Feb 25, 2014

It's probably not of much use to the NSA where breaking encryption requires precise math that will fail quickly if operations are off by 1%.

sophacles · on Feb 25, 2014

I can however see lots of application in various pattern analysis techniques based on machine learning. There's lots of places in those that a bit of fuzz won't really matter, at least for first pass filtering.

Similarly voice recognition would probably have to deal with far larger errors from the transmission and capture of sound anyway.

Just because it's not good for decryption, doesn't mean it isn't good for the overall set of NSA operations.

klipt · on Feb 26, 2014

Yeah, even human brains can do voice recognition, and they're noisy as fuck.

Houshalter · on Feb 25, 2014

For floating point operations sure, but if I understand it correctly it could also provide a speed advantage to other operations. It means 1% of the operations will be wrong, but what does that matter when it allows you to search 10,000 more keys. Perhaps you could get even more speed up with specialized hardware for decryption, and then buy rooms full of them.

valarauca1 · on Feb 25, 2014

This topic was already discussed albeit satirically in another publication. Namely the, "The Slow Winter."[1] Which while comedic makes good points about hardware architecture.

[1] https://www.usenix.org/system/files/1309_14-17_mickens.pdf

ced · on Feb 24, 2014

Off-topic: if an integral is an approximation of a physical process, then that process is an approximation of the integral. I.e. we could in principle have a "chip" running a miniature lab experiment (like water flowing) with sensors that would output "an approximate solution of the Navier-Stokes equations". Numerical analysts would then write new versions of their algorithms assuming "an efficient hardware solution of N-S". Has this concept ever been used anywhere? I guess quantum computing would qualify.

hcarvalhoalves · on Feb 25, 2014

> I.e. we could in principle have a "chip" running a miniature lab experiment (like water flowing)

It looks like this is a thing:

http://en.wikipedia.org/wiki/Fluidics

http://en.wikipedia.org/wiki/Water_integrator

http://en.wikipedia.org/wiki/MONIAC_Computer

thisisdave · on Feb 25, 2014

Yes, but it's been largely displaced by digital computation: http://en.wikipedia.org/wiki/Analog_computer

anigbrowl · on Feb 25, 2014

Analog computers used to be a thing. If you want to get into this now it's actually a good time, thanks to the resurgence in modular synthesizers, which are just analog computation optimized for audio ranges. You can easily find modules specifically for math operations and so forth.

smoyer · on Feb 25, 2014

We used to use a carefully biased diode in the feedback loop of an op-amp to create logarithmic signals. I was expecting the article to dispense with digital electronics (in the core of the machine) and use tuned physical properties to actually do the calculations.

What a disappointment that the first few slides imply such a break-through and then we find that the research is simply about reducing precision to the minimum needed. And it's patented? (rolls eyes)

foxhill · on Feb 25, 2014

it's patented? 16 bit floats have been a thing for a long time.. :/

edit: never mind. somehow convinced myself that this was just 16 bit floating point arithmetic. it is not.

smoyer · on Feb 25, 2014

I wrote an IEEE single-point compatible floating point library for embedded 8088/8086 processors in the mid-to-late '80s ... my employer was very cheap!

On the other hand, I learned a ton about efficient ways to implement various algorithms and remember a great way to do square roots that involved an initial multiplication (using normalized binary FP numbers) and then converged in about six iterations.

foxhill · on Feb 25, 2014

you'd have to write a solver for this special hardware, which may see your performance disappear. double precision is usually required for N-S solvers. accounting for errors means injecting noise and averaging (probably) over multiple iterations.

to get 64 bits of precision where only 16 bits are available would be.. tricky.

that said, with 1M PEs on a chip.. i'd definitely give it a go.

source: GPU+CFD are my thing.

edit: seems there is even less than 16 bits of precision available..

squigs25 · on Feb 24, 2014

I think human brains are probably wired for speed over accuracy too. This might have a lot of potential in machine learning and eventually AI, where information processing speed is often more important than accuracy.

daenz · on Feb 24, 2014

If you believe in evolution and natural selection, then your functionality is just fast and accurate enough to keep you from dying before mating. That's as good as the optimization gets, except by random accident.

lkbm · on Feb 25, 2014

My genes are more likely to be passed on for multiple generations if I survive long enough to mate AND care for my children. If thy die at age five, my mating did nothing.

AnimalMuppet · on Feb 25, 2014

True, but his point still stands - if you believe in evolution and natural selection, your brain is designed for "good enough fast enough" rather than "truly correct" reasoning.

nooron · on Feb 25, 2014

If you find this topic interesting, Robert Nozick's "Invariances: The Structure of the Objective World" explores how this fact should affect epistemology and ethics.

It's a tough book but it's worth your time.

russell · on Feb 25, 2014

I believe that the theory of evolution by sexual selection is far more important. Same guy, a refinement on his original theory. In that case optimization can take many different forms. Parents arraigning marriages of children or reverence for grandmothers. Or for that matter, the old boy's explanation of peacock feathers.

mistercow · on Feb 25, 2014

>then your functionality is just fast and accurate enough to keep you from dying before mating

Approximately. And locally.

acd · on Feb 24, 2014

There is probably an efficiency reason the human brain has 85 000 000 000(85 billion) neurons that operate with lossy compression operate in parallel drawing less than 20W of power.

byerley · on Feb 25, 2014

Probably more relevant is the greatly reduced clock speed (ignoring that the brain doesn't so much have a centralized clock - neurons only fire at <1k Hz compared to the ~2k Ghz processors used in clusters).

The technology also seems somewhat incomparable. The purely chemical signal processing at the synaptic cleft presumably saves a lot of power.

agumonkey · on Feb 25, 2014

maybe it's more a throughput thing, if the brain is massively parallel, it's a giant pipeline with many things in flight all the time

varelse · on Feb 24, 2014

Each neuron in the human brain has ~10^4 connections to nearby neurons. This makes it an insanely data-parallel computer. This PoS is a task parallel nightmare.

throwaway_yy2Di · on Feb 25, 2014

I think this guy successfully patented half-precision floating point. No really, look:

http://www.google.com/patents/US8150902

What a troll!

ChuckMcM · on Feb 24, 2014

This guy went on to found a company as I recall, which I think was then acquired. Its a pretty clever way to make fast floating point.

hershel · on Feb 25, 2014

Who did squire them ?

nobodysfool · on Feb 24, 2014

It would be nice to see where he is with this now - that was 2010 and it appears nothing has been written since about it.

ChuckMcM · on Feb 24, 2014

An update from 2013 apparently : http://www.bdti.com/InsideDSP/2013/10/23/SingularComputing

hoilogoi · on Feb 25, 2014

It would be funny if a technology pioneered by "Singular Computing" were a primary contributor to reaching the Singularity.

eli_gottlieb · on Feb 25, 2014

It wouldn't be funny, it would probably be the intended point of the name.

dfc · on Feb 25, 2014

Can anyone explain what page eleven, "Tomography," is supposed to mean? The example on the previous page seemed straight forward, when I got to page eleven I had no idea what I was looking at.

vladtaltos · on Feb 25, 2014

radon transform is used to produce tomography images. just take a look at the wikipedia page... s/he's talking about that...

dfc · on Feb 25, 2014

I guess(??) I could have been clearer.

I understand what tomography is. I am curious how to interpret the three images and the differences. What does this page demonstrate to the reader? To me they look almost exactly the same. The one big difference is that the original has the the intersecting arrows. (I am also not really sure what the arrows are supposed to convey.)

hbar · on Feb 25, 2014

It's a simulation showing how 1% FP error doesn't have much effect on the result. They look almost the same...that's the point.

If you don't know what the arrows represent then you maybe don't understand what tomography is after all.

pointernil · on Feb 24, 2014

I understand the speed and energy efficiency of the proposed design can only be reached with specially designed hardware, but couldn't some of the expected effects be achieved on existing cpu-core-hardware designs by using the lower precision requirements?

Also this could be nice code golf idea: solve algorithm xyz only with a strictly limited cpu instruction set, which could be chosen to be those operations which are very quick resp. very energy efficient on a target platform...

trhway · on Feb 24, 2014

specifically it would be interesting to compare GPU (instead of classic CPU) vs. proposed imprecise approach

jejones3141 · on Feb 25, 2014

Reminds me of Albert Edgar and Samuel Lee's paper in CACM in 1979, "FOCUS Microcomputer Number System".

GFK_of_xmaspast · on Feb 25, 2014

1% error is, in some sort of half-assed sense, comparable to a signal-to-noise ratio of -40dB which strikes me as a lot of noise to be adding to a problem.

fenollp · on Feb 25, 2014

This HW is a lot like [the Parallella's](http://www.parallella.org/)

IvanK_net · on Feb 24, 2014

This guy is totally stupid. I just read 4th slide.

1% error - what that means? Error in mantissa? Shorter mantissa? Simplified IEEE?

He writes O(5K) and presumes it to be "better" than 500K ... wtf? Does he know what O() means?

So, he basically "rounds" the values in images, mentions some recent technologies and says "if we had better engineers, we would have flying cars blah blah blah ...".

slashcom · on Feb 24, 2014

Shit HN Says: the adjunct professor at CMU who's designing floating point hardware for processors is "totally stupid."

I'm guessing he intentionally underspecified for his audience, and if asked, could tell you in excruciating detail what 1% error means.

But skimming the patent [1], it seems that there's sometimes an error in a mathematical operation, measured as the computed value minus the expected value. The result doesn't seem surprising if you read how he defined addition in his hardware.

[1]: http://www.google.com/patents/US8150902

varelse · on Feb 24, 2014

This design is utter computational tripe that completely ignores Amdahl's Law or any notion of data-parallelism.

His 1% error comes from relying on 7-bit logarithmic floating point used in a task-parallel manner. This is a non-starter for HPC. While most theoretical models certainly have 1% or greater underlying errors in accuracy, a 1% or worse error in precision is going to doom these algorithms to numerical instability without Herculean additional effort that would obliterate the computational advantage here.

Neural networks? See The Vanishing Gradient Problem.

Molecular dynamics? It's numerically unstable without 48-bit or better force accumulation as proven by D.E. Shaw.

NVIDIA, AMD, and Intel have invested a huge sum in manycore processors that are already too hard to program for most engineers. These processors are a cakewalk in comparison to what's proposed here.

Finally, even if you did find a task amenable to this architecture (and I'd admit there may be some computer vision tasks that might work here), where's the data bus that could keep it fed? We're already communication-limited with GPUs for a lot of tasks. Why do we even need such a wacky architecture?

jblow · on Feb 25, 2014

Computational approximation has been shipped en masse very successfully.

You mention GPUs but did you know that GPUs already do a lot of approximate math, for example, fast reciprocal and fast reciprocal square root?

You mention how approximation must be impossible in all these applications (because REASONS) but all methods that numerically integrate some desired function are doing refined approximation anyway. If you have another source of error, that lives inside the integration step, it may be fine so long as your refinement is still able to bring the error to zero as the number of steps increases.

Your diagnosis of "utter computational tripe" and the accompanying vitriol seem completely inappropriate.

varelse · on Feb 25, 2014

Really? I don't know about GPUs? That's news to me! Did you know that the precision of the fast reciprocal square root on NVIDIA GPUs is 1 ulps out of 23 bits? That's a world of difference away from 1 ulps out of 7 bits. I wouldn't touch a 7-bit floating point processor. Life is too damned short for that.

And that's because I have spent days chasing and correcting dynamic range errors that doomed HPC applications that tried to dump 64-bit double-precision for 32-bit floating point. It turns out in the end that while you can do this, you often need to accumulate 32-bit quantities into a 64-bit accumulator. Technically, D.E. Shaw demonstrated you can do it with 48 bits, but who makes 48-bit double precision units?

I stand by the computational tripe definition (with the caveat that Hershel has now posted an app where this architecture is possibly optimal). My objections to the broad extraordinary claims made in the presentation above.

And hey, you're a game developer, let me give you an analogy: would you develop a software renderer these days if you were 100% constrained to relying on mathematical operations on signed chars? It's doable, but would you bother? Start with Chris Hecker's texture mapper from Game Developer back in the 1990s, I'm guessing madness would ensue shortly thereafter. Evidence: HPC apps on GPUs that rely entirely on 9-bit subtexel precision to get crazy 1000x speedups over traditional CPU interpolation do not generally produce the same results as the CPU. If the result is visual, it's usually OK. If it's quantitative, no way.

stephencanon · on Feb 25, 2014

> who makes 48-bit double precision units?

IIRC Cray did (but they called it “single”). =)

Snark aside, I agree broadly with the points your making here. This isn’t especially groundbreaking; this is using the fact that logarithmic number representations don’t require much area to implement if you don’t need high-accuracy and are willing to trade latency for throughput (something that FPGA programmers have been taking advantage of since forever), and then going shopping for algorithms that can still run correctly in such an environment.

GFK_of_xmaspast · on Feb 25, 2014

"Don't integrate the noise" is, generally, pretty good advice.

thisisdave · on Feb 25, 2014

>Neural networks? See The Vanishing Gradient Problem.

For what it's worth, Geoff Hinton and others have had a lot of success in the last few years by intentionally injecting noise into their neural network computations. The best-known example is [0], but he's also had some success just adding noise to the calculation of the gradient in good old sigmoid MLPs--even reducing the information content of the gradient below a single bit in some cases.

Going back further, stochastic gradient descent and other algorithms have shown for decades that trading accuracy for speed can be very desirable for neural network computations.

[0] http://arxiv.org/pdf/1207.0580.pdf

varelse · on Feb 25, 2014

Except the degree of error introduced by techniques like "knockout" and "maxout" are a hyperparameter under user control as are the many possible implementations of stochastic gradient descent.

What's being suggested here seems equivalent to saying that because you love the crunchy fired top of creme brulee, why not go ahead and torch the whole thing?

As for contrastive divergence with Restricted Boltzmann Machines overcoming the vanishing gradient problem. That's true, but has anyone even tried doing this with 7-bit floating point and demonstrated it even works? I'm assuming recurrent neural networks relied on at least 32-bit point (correct me if I'm wrong). A reply to me flickered on here briefly indicating they haven't built a processor based on this yet which the author then deleted.

I think this would be a much more interesting architecture if it would just stop trying to reinvent floating point or "if it ain't broke, don't "fix" it."

thisisdave · on Feb 25, 2014

>As for contrastive divergence with Restricted Boltzmann Machines overcoming the vanishing gradient problem. That's true, but has anyone even tried doing this with 7-bit floating point and demonstrated it even works?

I wasn't referring to RBMs here, although it does look like I misremembered the details. See slides 74-76 of this presentation for a brief sketch [0]. In these feed-forward MLPs, the information content from each neuron is capped at one bit (which is quite a bit less than seven).

I seem to recall him saying something similar (and with more detail) in a Google Tech Talk a few months later [1], although I'm not sure.

[0]: http://www.ipam.ucla.edu/publications/gss2012/gss2012_10743.... [1]: http://www.youtube.com/watch?v=DleXA5ADG78

varelse · on Feb 25, 2014

Indeed, but I quote: "The pre-training uses stochastic binary units. After pre-training we cheat and use backpropagation by pretending that they are deterministic units that send the REAL-VALUED OUTPUTS of logistics."

So without trying to sound like a pedant, the manner in which the error is introduced is effectively a hyper parameter, no? The skills involved in doing this correctly will land you the big bucks these days:

http://techcrunch.com/2014/01/26/google-deepmind/

My problem is that I think the programming experience of an entire hardware stack based on 7-bit floating point would be abysmal and that companies that tried to port their software stack to it would die horribly (to be fair they'd deserve it though). OTOH If they'd just stick to the limited domains where this sort of thing is applicable, my perception of it would flip 180 degrees - it's probably a pretty cool image preprocessor. What it's not is the replacement for traditional CPUs.

But the people behind this seem to have utter contempt for the engineers tasked with programming their magical doohickey, or putting this in TLDR terms:

1. Scads of 7-bit floating point processors 2. ??? 3. PROFIT!!!

Houshalter · on Feb 25, 2014

>Neural networks? See The Vanishing Gradient Problem.

Only on deep networks. And as another comment mentioned, NNs can withstand and even benefit from noise.

There are numerous ways around it was well. You could just run the network several times (probably in parallel) and average the results together. Same with most of the other problems. It does defeat the purpose a little bit, but it would still retain most of the speed up. You could train multiple networks and average them together (also very successful even in deterministic nets.) Perhaps the network itself could be made more redundant or error tolerant.

Hell if the thing is as fast as promised, you could even do genetic algorithms and get reasonable performance.

varelse · on Feb 25, 2014

"NNs can withstand and even benefit from noise."

This is the sort of broad statement I put in the same fool file as "We used simulated annealing because it's guaranteed to converge." Bonus points for lobbing the genetic algorithm grenade into the mix(1).

In the former case, it's the judicious use of noise rather than basing the entire application on an architecture designed to inject 1% error into every calculation (because that seems to me to be a really good definition of programmer hell) and in the latter case, simulated annealing only converges assuming you run it for INFINITE TIME. But that doesn't stop people from saying this or even Jonathan Blow, someone I otherwise respect for his indie game dev creds and success therein, making an equally uninformed statement such as equating 1% computational error to 0.00001% error.

Getting useful information out of Neural Networks requires extreme attention to detail. If it didn't, Google wouldn't be paying up to $10M a head for people who can do this - the DeepMind acquihire.

Imagine the alternative here: I train a cluster of GPUs or FPGAs and I get useful results months before the engineer working with 1% error everywhere gets a network implementation that even converges. I then design and tape out custom ASICs to do this 10x faster within a year. See bitcoin mining for a successful example of this approach.

1. My doctorate made extensive use of genetic algorithms, I'm very familiar with their strengths and weaknesses.

Houshalter · on Feb 25, 2014

I never claimed anything about simulated annealing, and I know the noise is less than ideal, but adjustments can be made. Really my point stands, there are plenty of ways around it, and with the supposed 10,000X speed up it would be a massive advantage.

varelse · on Feb 25, 2014

Even the supposed 10,000x speedup is utterly bogus...

The FPU here has 5,000 transistors.

If you can stomach the lack of full IEEE compliance, you can build a full 32-bit FPU with ~100,000 transistors:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.71....

So really what we have here is at best a ~20x performance boost in exchange for an absolute programming nightmare.

Or you can switch to DSP-like fixed point math with ~25K transistors per ALU:

http://www.idosi.org/wasj/wasj2%284%29/12.pdf

So now we're down to ~5x better performance in exchange for that programming nightmare.

And just to be thorough, if you're hellbent on nontraditional arithmetic, here's an error-free 32-bit integer ALU made with 1696 transistors, that would be >3x more efficient than the architecture here (assuming all you care about is throughput):

http://www.iosrjournals.org/iosr-jece/papers/Vol3-Issue6/B03...

So now we're talking at best 1/3 to 1/4 the performance of what they could get if they dropped logarithmic math and went to something more predictable. Someone call Vinod Khosla, we're gonna be rrrriccchhhh...

jblow · on Feb 25, 2014

O() is colloquially used to mean "approximately", among people who do actually know what O() means.