"It’s done in hardware so it’s cheap"

csense · on Aug 4, 2012

Slightly off topic, but it's widely accepted among physicists that the act of computation expends energy [1]. Thus, there are actually limits to how much the cost of a given computation can be reduced, regardless of how cleverly we build the computer, or what we build it out of (silicon, DNA, fiber optics, whatever) [2].

[1] http://en.wikipedia.org/wiki/Landauer%27s_principle

[2] If we're willing to use algorithms that don't destroy information, or willing to operate at arbitrarily low temperatures, as I understand it there's no theoretical limit to how small we can make the energy costs, but these restrictions seem highly impractical.

kragen · on Aug 5, 2012

While this is true, current hardware is several orders of magnitude away from the Landauer limit, so there's still quite a bit of room for building computers more cleverly before we confront it.

However, reversible algorithms are not in fact particularly impractical. They require a somewhat different way of thinking about things, but they're dramatically easier than e.g. quantum computing speedup.

Similarly, cryogenic computation is entirely reasonable, especially in space.

ars · on Aug 5, 2012

"several orders of magnitude"???

We are an order of magnitude worth of orders of magnitude away from that limit.

Specifically: 5.539×10^11 times as much energy (for the I7 920, picked randomly).

http://www.wolframalpha.com/input/?i=130+w+%2F+%280.0178+ele...

kragen · on Aug 7, 2012

Thanks for doing the calculation! I'm not checking it right now, but it's in the right range. However, in the tradeoff between shortest time of execution of serial programs and power efficiency per instruction, the i7 is way over at the shortest-time-of-execution extreme. Common GPUs are about a thousand times as power-efficient, and many embedded microcontrollers are nearly as power-efficient as GPUs. Check out the Bitcoin hardware performance Wiki pages for hard data.

i_cannot_hack · on Aug 4, 2012

Your wikipedia link is broken. Try this:

http://en.wikipedia.org/wiki/Landauer%27s_principle

csense · on Aug 5, 2012

Thanks, edited and fixed!

robomartin · on Aug 4, 2012

As someone who has done extensive work in image processing using custom hardware I am not really sure what he is talking about. Is this intended to suggest that software is cheaper than hardware? Or that it has performance advantages over specialized hardware? Not sure.

It's tough to beat smartly-designed specialized hardware in image processing. Some of the things I've done would require ten general purpose computers running in parallel to accomplish what I did in a single $100 chip. So, yes, less cost, higher data rate, reduced thermal load, reduced physical size, less power requirements, etc.

Maybe I don't get where he is going with this?

masterzora · on Aug 4, 2012

It is tough to pull a coherent thesis from this piece but it seems to be close to "be aware of the costs involved because, although specialised hardware can be useful in many situations, it is not a magic wand you can wave." The way it opens he seems to imply that he is used to people excusing slow/inefficient ideas by handwaving that doing it in hardware will be fast without doing any real critical evaluation on what gains hardware can actually bring. The fact that some systems can see real gains from specialised hardware does not serve as a counterargument if this is in fact his thesis.

_delirium · on Aug 4, 2012

You can get a bit better idea where he's coming from from his previous series (linked in the post) responding to people arguing that high-level languages would be faster if hardware were designed for them. In his view the Lisp-machine idea of HLL-specialized hardware design rarely pans out vs. just RISC with an optimizing compiler. This piece seems to be applying the same critique to algorithms more generally, that "we'll just do it in hardware" isn't a magic win, because it's not always an issue of impedance mismatches with the hardware that can be simply fixed by choosing different hardware.

The places he suggests you can get a win seem sensible: 1) cases where the cost of dispatching instructions and handling intermediate results dominates, in which case a CISC-ish specialized instruction implemented in silicon may be a win over stringing together simpler operations; and 2) cases where you can get extra parallelization in hardware that isn't available through general-purpose instructions (e.g. doesn't map on nicely to SSE-style instructions).

rbanffy · on Aug 4, 2012

Lisp machines belong to a past so distant I don't think their are relevant for exploring the possibilities of contemporary hardware. The design constraints moved - clock is in the gigahertz range, memory is orders of magnitude slower and we have to deal with multiple levels of cache between them. And that's just the start.

In fact, the concept of "doing it in hardware" (be it specialized logic or dedicated generic processors) is alive and well (and bearing fruit) on every mainframe manufactured.

The fact we all use similar x86 boxes designed to be compatible with MS-DOS is a tragedy.

smashing · on Aug 4, 2012

I don't know if tragedy is the right word for this. We live in an era where the marketplace has more of an effect on the trends in computing than academia. Nothing tragic about that.

rbanffy · on Aug 5, 2012

It really is. It took years after the Amiga for the average PC to have preemptive multitasking. The same story repeats with 64-bit machines and multi-processors. In the latter case, it took so long for the average PC to have more than one CPU/thread most current software (and programming languages) simply isn't designed for it and cannot take advantage of those extra compute engines. Even if it were, most programmers never learned how to write software this way.

The PC standard set us back at least a decade, most probably two.

yuhong · on Aug 5, 2012

"It took years after the Amiga for the average PC to have preemptive multitasking. "

Partly thanks to the MS OS/2 2.0 fiasco, also resulting it taking ten years after Intel released the 80386 before 32-bit programming became popular. Needless to say, the x64 transition went much better.

rbanffy · on Aug 5, 2012

It did, but how long after the arrival of the first 64 bit CPUs did that happen? I ran Windows on an Alpha that wasn't much more expensive than a high end Pentium PC while being unbelievably faster. Would it have survived in a world where being able to run MS-DOS isn't so important? I'd say yes. 32 bit programs were the norm with Amigas ans Atari STs from day one. It was only the PC world that lagged years behind every competing platform and that dragged us back when it finally eradicated its technically superior competition.

Most of the kludginess in every modern x86 computer is dictated by the need to emulate parts of an IBM 5150.

dkersten · on Aug 5, 2012

And AMD forced Intels hand on the move to 64bit.

rbanffy · on Aug 5, 2012

Intel had already moved on. All AMD did was to make Intel put the x86 back into the 64 bit picture.

I'd prefer a clean break.

dkersten · on Aug 6, 2012

The parent of my comment mentioned 80386, so my comment was about x86 and not, for example, Itanium. In the x86 world, AMD forced Intels hand.

And I totally agree - I would also prefer a clean break.

beagle3 · on Aug 6, 2012

I'd also prefer a clean break, something clean like ARM (although it's getting dirtier).

But the Itanium that intel moved on to? That isn't a clean break. It's a mistake. I'm so happy that AMD was able to force them to make something useful instead.

yuhong · on Aug 5, 2012

And notice that PX00307 did not mention multitasking at all:

http://news.ycombinator.com/item?id=3441885

Would Cutler or Letwin consider this acceptable?

shrughes · on Aug 5, 2012

Being multithreaded is hardly important at all. You can just have single-threaded programs, and they'll work. Almost all software doesn't need to take advantage of extra compute engines. Reasonably designed software that could take advantage of multi-threading can be modified to do so, and poorly designed software cannot -- it doesn't really have that much with being written on single-threaded machines as it does with being poorly designed. You can always spawn a worker process if your current process is some hair-brained monstrosity.

cwzwarich · on Aug 4, 2012

The ancient style of HLL-specialized hardware no longer makes sense, but that doesn't mean that there aren't other improvements that make sense. For example, support for read and write barriers (especially read barriers) would dramatically improve garbage collection.

_delirium · on Aug 4, 2012

That's plausible, although I'm somewhat skeptical why it wouldn't have shown up in someone's hardware if it truly worked out to be worth the cost, since there is a lot of enterprise software built on the JVM and .NET that would benefit from garbage collection speedups, and which is being run by companies willing to pay for those speedups.

Fwiw, Yosef K.'s own follow-up to his "HLL CPU challenge" did acknowledge several proposals he received as plausible candidates: http://www.yosefk.com/blog/high-level-cpu-follow-up.html

cwzwarich · on Aug 4, 2012

It did show up in somebody's hardware. Azul made (and is still making) hardware for Java that does this. They also release an x86 JVM that is able to fake it a bit by using HW virtualization support and double mapping pages.

Spooky23 · on Aug 5, 2012

So I guess the premise is that you have two extremes: minimal instruction set (RISC) and complex (a CPU designed to run Python, for example).

IMO, as is often the case, the answer lies in the middle. Look at the tremendous impact that adding AES acceleration features to x86 processors has on applications that require encryption.

_delirium · on Aug 5, 2012

That seems to be what he's arguing, actually; that you can get hardware speedups if you carefully target very specific things that a parsimonious addition of hardware features can enable.

beagle3 · on Aug 6, 2012

Never noticed that tremendous impact - in fact, one of the reasons rijndael became AES was because it was already very fast on contemporary 32 bit computers.

What impact did you observe?

radicalbyte · on Aug 4, 2012

No, he's saying that you can't assume that "hardware" can improve your algorithm when you're following a strategy that is fundamentally wrong.

"The devil is in the details".

s_tec · on Aug 4, 2012

Multiplying two 32-bit integers will always cost a certain amount of power. Building specialized hardware can get rid of other costs, like the cost of decoding instructions, but the fundamental cost of the multiplications will never go away.

If your algorithm requires a million 32-bit multiplications, that sets a firm lower bound on how costly it is. There is no way to magically perform all those multiplications for free.

sparky · on Aug 4, 2012

Yes and no. Yes, any given multiplier topology will have some fundamental costs, but the lower bound on the energy for any particular set of multiplicands, latency requirements, and precision requirements will often be much lower than building a canonical multiplier. If you can't make assumptions about any of these (e.g., if it's a multiplier in a general-purpose core), the costs are more uniform (though unless leakage dominates, the energy will still be value-dependent, since dynamic energy is proportional to activity factor). In a domain-specific context though, you often can; these assumptions are borne out in the datapath units of nearly any DSP, GPU, or ASIC.