How Much Processing Power Does it Take to be Fast?

elblanco · on April 25, 2010

This is really a fantastically important principle. When we put into perspective what modern machines should be capable of, it's really quite disappointing what they end up doing.

Recently I was arguing with one of my company's engineers when I noticed that selecting a few tens of thousands of things on-screen took 1-2 seconds. He responded that there's a lot going on there. I responded that I'm selecting a few tens of thousands of things on-screen on a machine that's capable of billions of CPU operations per second and trillions if I count in the GPU. What I'm doing isn't even a rounding error when compared against those staggering numbers. It should be so fast that those things are selected before I even move my finger from the mouse button. But it's not, I can actually count seconds before something happens...and I know that we aren't doing billions of things in the meantime.

All this talk of canvas demo this, and browser app that fails to bring into focus that the things we are fawning over, simple graphics and interactivity on a postcard sized rendering window running on a multi-core multi-GIGAHERTZ machine with assisting co-processors for everything from floating point to window rendering, doing things that we were all doing on machines with dozens of Mhz not even 15 years ago, obviates how tall the software stack is between the hardware and the user.

It's not that my engineer's code was bad, far from it, it's just that it was lazy. He was relying on libraries built on top of libraries built on top of libraries on down the line. Basically it's turtles all the way down, and we're all building on top of that giant turtle pile when we could be tearing away at relativistic speeds around a pulsar.

I actually get a little depressed when I think about it.

three14 · on April 25, 2010

I've been wondering if anyone has actually checked - is the performance loss really just because of all the layers? Is there a straightforward alternative?

jheriko · on April 25, 2010

The obvious alternative is to not use all the layers, but that is anything but straightforward.

One way the slowness of libraries can become obvious is during optimisation - as a simple example memory copying can be optimised if you are doing large copies - the C library memcpy has to work in the general case and afaik it typically just loops over bytes and copies them one by one, which is probably optimal if you are copying a small number of bytes like 2 or 3 - probably a common case, but on modern CPUs you can get substantial speed ups by writing your own partially unrolled loop to copy 4-bytes at a time, or even more if you are willing to write assembler code where you can copy 16-bytes at a time, and with non temporal cache hints. Think about how many routines copy memory about by using this library... and this is just one example. In an actual use case of software rendering I used this to copy a 320x240 framebuffer and my final, assembler optimised version was a good 15% faster than using memcpy.

The problem is that libraries are convenient and they have to work in a large number of cases which may prevent them from using the choice of algorithm that is optimal for your problem. Even the fact of being in a library requires some small slowdown from not being able to inline, e.g. the C standard library math functions can be optimised just by writing equivalents that can be inlined - the gain is small per call but it still exists.

I'm not 100% sure but the C library math functions may do things to undo the features of the FPU as well, e.g. the fsin instruction fails for values over 2^64 iirc, the library function might do expensive operations to get around this, in which case the gain of using a single fsin instruction will be significant, perhaps more than twice as fast as the equivalent C library function.

Some of this is the rational behind my FridgeScript language (which tries to be fast at floating point ops), which is measurably faster than the MS C++ compiler provided that the code is clean (FridgeScript does no optimisation to a very good approximation, so things like foo+1+bar+1 mostly end up as three additions instead of two)

three14 · on April 25, 2010

Thanks. I'm still not sure where the thousands of times the speed of old processors is going, though - 15% doesn't account for it, unless there are a lot of layers - 50 or so, no?

http://www.google.com/search?hl=en&q=1.15^50

silentbicycle · on April 25, 2010

K (http://kx.com) seems to be, but unfortunately it's closed-source and quite expensive.

From my (limited) experience with it, it's quite fast, especially for an interpreted language. Its primitives are extremely cache-friendly and data-parallel, which helps performance immensely. Its design also cuts away layers and layers of typical boilerplate.

wheaties · on April 25, 2010

Yeah but back then it took one guy 6 months of bit fiddling in assembly/C to develop that kind of game. Now we have art departments, story board artists, script editors, a legion of QA, and of course, the developers. Any time you take some development process that layered, eventually, the performance at all costs people get pushed to the side. Although, I do agree with the spirit of the article.

wanderr · on April 26, 2010

This. The performance improvements in modern hardware are primarily beneficial to developers, not users. Developers have always aimed for "fast enough," and the less time and effort it takes to reach that goal, the more functionality can be added in the same number of man hours.

The problem is that people have different definitions of what fast enough means, and edge cases might not meet anyone's defenition of fast enough, if they just weren't considered.

hxa7241 · on April 25, 2010

I suppose it is not entirely surprising, since modern programming has drifted away from its most fundamental purpose -- processing data in a certain amount of space and time.

Look at common programming languages: the control of how much storage space is used is in the background, and the control of time taken is non-existent. The dominant, and almost only, concern is representational structure -- how the software is understandable and manipulable.

This is reasonable -- representation is important and comparatively valuable -- but the riches of modern hardware speed have brought a kind of decadence to programming.

extension · on April 25, 2010

Programmers get first dibs at surplus computing resources and they rarely leave much behind for users. It's not really fair.

Games are the exception.

silentbicycle · on April 25, 2010

Games are not the only exception. Pretty much anything soft-realtime that needs to do elaborate graphics (3D rendering, etc.) is pushed towards similar trade-offs.