The "C is Efficient" Language Fallacy

tptacek · on May 4, 2009

This argument is as old as the hills. It's probably true in a lot of niche situations. But the fact is, most C code is faster than higher-level language code, because:

* C programmers have more freedom to arrange data in memory to exploit locality

* C data structures need less bookkeeping

* C programs manage memory manually, and so lack GC overhead

* C programs can easily swap in different allocators for different work sets

* C function calls are usually wired into the code, not indirected through (several layers of) tables.

That's to say nothing of bytecode interpretation overhead, which is probably a straw-man argument.

I buy the instruction scheduling argument for inner-loop numerical and vector scenarios, but even in performant code, that's usually less than 10% of the total, and from what I've seen, both C and HLL code tends to delegate that to machine-specific assembly libraries.

baguasquirrel · on May 4, 2009

I have a friend at Intel who works on their realtime raytracing efforts and he swears by C/C++. Indeed, his arguments mirrors yours and go even farther. There's all sorts of magic you can play when you've got access to the actual bits and bytes. For starters, they do things like stash data in the low-precision bits of floats and the lower three bits of pointers. I wholeheartedly concur with his opinion that in raw performance terms, C/C++ is the fastest.

The problem with his argument and yours is that C is NOT faster for The Rest of Us because it it amounts to premature optimization just by virtue of using it. And because it takes so much effort to get C code to just work, the WHOLE program ends up being slower than the corresponding code in Haskell, OCaml, or perhaps even CLisp. Having worked my last job in a company where the programmers preferred C to C++ and C++ to Python, it startles me how utterly slow our system was. And ditto for the previous job.

At my first place, we had a streaming video server that had blindingly fast compression routines, but it was all held up by a UI loop that harked from the prototype days. Thus the argument that C/C++ is/are the best languages for fast code is moot. It is only the fastest if you've all the time in the world to write it.

tptacek · on May 4, 2009

"Premature optimization" is another hoary old argument that gets dragged into this debate every time it comes up. But I don't think it's valid.

C programs start up faster than HLL programs. They run faster. They consume less memory. They tend to be more responsive.

From "looking up an object by its string name" to "splitting a string on the comma character" to "running the following functions on a 10hz timer", simple operations that all programs do have less overhead in C than they do in many high level languages, because the C program isn't creating and disposing of hundreds of little objects every time it does one of those things.

Hey --- I write most of my code in Ruby. I'm not saying you should use C. But you shouldn't make up bad reasons not to use it when there are so many good reasons to cite.

lallysingh · on May 4, 2009

Small C programs start up faster than HLL programs. Small C programs run faster.

When they get bigger, they quickly bloat up to compensate for how hard C makes it to write large-scale software. Oh, and it takes a lot longer to modify those C programs, for the same reasons.

C's malloc() time is also a lot slower than GC'd languages. For them, a memory allocation isn't much more than a subtraction.

tptacek · on May 4, 2009

First, cite sources for specific cases where a mainstream malloc() is "a lot slower" than a specific GC'd allocation in a mainstream HLL. This rings more truthy than true to me.

Second, fixing malloc slowness is among the easiest and fastest optimizations you can make in a C program (in most cases, a pool and freelist will get you 90% of the way there), and no GC'd allocator is faster than pool and arena allocation (which is usually just a couple ALU operations in both the alloc and free cases).

Third, large C programs tend to start up faster than large HLL programs. For instance, Eclipse is slower to start for me than Firefox --- and if you're thinking of Firefox as your archetypal "big bloated C++ program", remember that Firefox is by design the most complicated application most people ever run.

To make that argument work, you have to find a HLL program of comparable complexity. You didn't do that; you just imply that one exists. I'm calling you out on that.

jjs · on May 4, 2009

> First, cite sources for specific cases where a mainstream malloc() is "a lot slower" than a specific GC'd allocation in a mainstream HLL. This rings more truthy than true to me.

> Second, fixing malloc slowness is among the easiest and fastest optimizations you can make in a C program (in most cases, a pool and freelist will get you 90% of the way there), and no GC'd allocator is faster than pool and arena allocation (which is usually just a couple ALU operations in both the alloc and free cases).

In a semispace or generational GC, allocating memory is just bumping a pointer.

tptacek · on May 4, 2009

You're not counting the cost of deallocation.

lallysingh · on May 5, 2009

Not an easy thing to count globally. Your GC cost (at least in mark-sweep) is dependent on the number of objects you're not deallocating. The rest are implicitly destroyed.

tptacek · on May 5, 2009

You can't decouple the cost of malloc() from the cost of free(); if malloc has to do any more work than simply bumping a pointer (or, in the general case, grabbing a mutex and then bumping a pointer), it's a concession to free() (or to defragmentation, a side effect of free).

jjs · on May 5, 2009

Semispace (and generational) collectors do decouple the cost of allocation from the cost of collection by only considering the set of live objects. They also compact (defragment) as a side effect of collection.

brl · on May 5, 2009

In the case of a minor collection, you don't do a full mark-sweep. You only look at a small fraction of the live objects. Generally you only need to trace objects in the youngest generation which are reachable from GC roots. Since you expect most young generation objects to be eligible for collection, this is extremely fast.

It is unusual for a lot of references to exist from older generations to the youngest generation, but when this occurs there are data structures for detecting and resolving these references very efficiently.

jjs · on May 5, 2009

In the collectors I mentioned, deallocation is a matter of moving the live objects to the other semispace, and considering the "current" semispace to be garbage.

If you're allocating a lot of short-lived objects, then explicitly keeping the live objects is much more efficient than explicitly culling the dead objects.

jdvh · on May 4, 2009

> just bumping a pointer

If only. You have to check for out-of-memory, and throw an exception if so. (This check can sometimes be optimized away, though). You also have to mark the size of the allocated block, so the garbage collector knows how many bytes to copy during the sweep phase.

Also, garbage collectors usually allocate from a shared memory pool, so every allocation also involves some mutex operations.

Allocation is bumping a pointer in theory only.

brl · on May 5, 2009

> You have to check for out-of-memory, and throw an exception if so.

You have to check if there is space for the allocation in the current space. This is two native instructions, a compare and a conditional branch. If the allocation fails, you don't throw an exception, you launch a minor collection.

> You also have to mark the size of the allocated block

Yes and no. When an object is initialized, type information is recorded that contains the size. I don't think that the Java VM explicitly stores the size of the allocation somewhere. It certainly doesn't need to since the same information is available by just looking at the object itself.

> Also, garbage collectors usually allocate from a shared memory pool, so every allocation also involves some mutex operations.

In the Java VM this problem is addressed by dividing the Eden space into chunks where each chunk is private to a single thread.

> Allocation is bumping a pointer in theory only.

An object allocation in Java is around 10 assembly instructions. More than just 'bumping a pointer' but not a lot more.

viraptor · on May 5, 2009

Actually firefox is a large hll program. It's a javascript / xul engine in a large part. The frontend of firefox is actually javascript AFAIK. Maybe someone has more precise information?

1 and 2 were already commented on, so I'll just add that, there was a proof that a copying gc with enough memory can be faster than manual allocation. http://www.cs.umass.edu/~emery/pubs/04-17.pdf

tptacek · on May 5, 2009

The paper you're citing doesn't contain the evidence you claim it has, and in GC vs. "manual" studies, you have to make sure they're not comparing GC to malloc(). Malloc has design goals other than speed. Performant code very often replaces malloc.

viraptor · on May 5, 2009

You're right - I made a stupid copy&paste mistake with both documents open :/

This is the link I was thinking about: http://www.cs.ucsb.edu/~grze/papers/gc/appel87garbage.pdf

tptacek · on May 6, 2009

Doesn't this paper basically say:

* Garbage collection is faster than manual frees when you provide processes with so much memory that collection happens so rarely that its cost is lost in the noise, and

* Garbage collection is faster than manual frees when you ignore all the constant factors in both collection and manually freeing, for instance the cost of taking page faults?

viraptor · on May 6, 2009

Yes - that's true.

But it also gives you additional information: how to check if the cost can be ignored and whether it will be lost in the noise. And that's important, because in serious VMs you can often control the GC parameters. That means you cam make sure that those asymptotes are as close to reality as possible for your exact problem.

It's the same story for hashed collections. You can find big O costs for their operations, but they're amortised costs. Sure - once in a while you have to realloc / rehash / copy the table depending on the implementation - but then you know how to judge if that collection is ok for you. You can modify the starting size exactly in order to lose the rare cost in the noise. The trick is knowing the limits.

Yet somehow many people who love hashmaps hate GC...

You're right about what the paper says, it just didn't spell out the reason why that info is useful ;)

brl · on May 5, 2009

Just from the title alone, I can tell I'm going to like this paper :)

stcredzero · on May 4, 2009

"Premature optimization" is just another way of saying "use the right tool for the job." Look at Mercurial. It's not quite as fast as Git, but it's just fine for its purpose. Most of it is written in Python, with some speed critical parts written in C. This strikes me as using the right tool for the right job at the right time. Write most of the app in something like Python. This makes it easier to write concise, understandable code that reads like pseudocode in a short time. When you have the system running correctly and get it factored in a way you find nice, then rewrite the speed-critical parts of it in C.

Having a running model of your system to profile is one of the best tools you can have for optimization.

tptacek · on May 4, 2009

As true as that is, it's orthogonal to the actual debate at hand. There are plenty of good reasons to pick Python, or Java, or Scala, over C/C++. But this article claims performance is one of those reasons, and then crafts a particularly leaky argument to carry that claim.

stcredzero · on May 4, 2009

I read the article more like: "Writing a program in C is no guarantee of having a fast program. Here are a few examples." (Of both places where C is not the best for speed, and other languages giving good results.)

I think your idea of "the actual debate at hand" is too narrowly focused. As a reader of HN, I'm generally interested in how to get things done in a reasonably efficient manner. I find that "get it running, profile it, understand it, then optimize" is a good thing to think about when discussing performance. It cuts through a lot of delusions people have about their ability to write fast code. (Especially when they try to do it up-front.)

vegai · on May 5, 2009

And notably, Mercurial written in python is much faster than Subversion written in C.

tptacek · on May 5, 2009

Apples and oranges much?

vegai · on May 7, 2009

Yes. Apples also can be compared to oranges by several different criteria.

vorador · on May 5, 2009

Remember that Mercurial is a dvcs.

stcredzero · on May 5, 2009

That's an example of how architecture trumps language.

anamax · on May 5, 2009

> C programs start up faster than HLL programs. They run faster. They consume less memory. They tend to be more responsive.

That's working C programs vs working HLL programs. And, by those metrics, assembly language programs are better still.

> Hey --- I write most of my code in Ruby.

Which suggests that performance of working programs is not the only important factor.

Yes, "premature optimization" is arguably the wrong term for making that suggestion, but it is quite valid.

DeCarabas · on May 5, 2009

The data in the article mentions that the startup time for the C, C++, and OCAML programs were in the noise. So there's at least one data point which suggests that HLL programs do not necessarily start slower than C programs.

stcredzero · on May 4, 2009

Right. C is a power tool. You can use it to create something blindingly fast. You can also use it to create something dog slow. I've seen Smalltalk programs outperform C programs. I know of an instance where a coworker benchmarked a C implementation of a block cipher vs. a Smalltalk implementation, and the Smalltalk ran 3% faster. Why? Naive manual memory management in the C implementation.

tptacek · on May 4, 2009

Another classic HLL vs. C argument: the anecdotal slow C program. Clearly, of these, there are many great examples. Unfortunately, it's a crappy argument, because there are also a zillion incredibly slow HLL programs.

noaharc · on May 4, 2009

I don't think stcredzero meant that as an argument that C is worse than HLL -- it was just a warning that, as with any tool, it can be improperly used.

stcredzero · on May 4, 2009

I'm certainly not against C. I've just used C to write an embedded controller for doing real-time mixing of PWM control signals and laser gyroscope data for a model tilt-wing aircraft. (Which should be able to convert between horizontal and vertical modes in-flight.) I wouldn't use Smalltalk for that. (Yet. If I had 10X more CPU to burn, maybe.) Likewise, I'm not going to write my next Commodities Trading app in C.

gaius · on May 5, 2009

That is a false dilemma tho'. No-one in the Python world has anything invested in a 100% Python solution to anything. Doing the compute-intensive bits in C is in fact expected and encouraged! The same is true in the Tcl camp. Maybe some HLL communities (Java?) like to be "pure" but I've not ever encountered that.

robin_reala · on May 5, 2009

No-one in the Python world has anything invested in a 100% Python solution to anything

PyPy? http://codespeak.net/pypy/dist/pypy/doc/

gaius · on May 6, 2009

That's not for production use, tho'.

wlievens · on May 5, 2009

Java communities typically favor portability (and easy deployment) over what you can gain in terms of speed. A lot of libraries/frameworks actually even advertise themselves as "100% pure java", whereas one might expect "expensive bits optimized in C" as attractive as well (but you rarely see it).

brl · on May 5, 2009

> whereas one might expect "expensive bits optimized in C"

There is no important difference in performance between C and Java according to http://shootout.alioth.debian.org/

wlievens · on May 6, 2009

That all depends on the kind of code, of course.

tptacek · on May 5, 2009

I agree. I like writing my goop in Ruby. But that's not what the article is about.

sfphotoarts · on May 5, 2009

I'd like to see the c code to support this, it's very hard to imagine anything you could do this wrong in something like a block cipher that would ever make it as slow as smalltalk. I'm sorry, I just don't buy this...

stcredzero · on May 5, 2009

The C DLL was one of RSA Data Securities reference implementations from about a decade back. The Smalltalk guy could ask the VM implementor for goodies -- like 32 bit and 64 bit bit-arrays and various primitive operations on them.

So long as you keep an entire algorithm in one method, and restrict yourself to certain optimized operations, the resulting JIT-ed code will look like it was produced by an unoptimized C compiler. There will be no Smalltalk message sends. (This approach is a bit fragile, though, since you have to have esoteric knowledge to maintain code like that and keep the same level of optimization. Most of the time, you'd just implement a speed-critical operation like this as a DLL in C to be called from Smalltalk. Smalltalk/X has a different approach -- you just inline C code in your Smalltalk method.)

Combine that with naive memory management in the C code, and you have C code that loses a benchmark. Mind you, that was bad C code -- reference code only has to be correct, not fast.

"Smalltalk is Slow" usually a sign of ignorance or trolling. For lots of problem domains, the advanced commercial Smalltalk VMs are pretty darn fast.

tptacek · on May 5, 2009

I'm not saying Smalltalk is slow, but I've worked with rsaref since the mid-90s, and I don't remember a block cipher routine in it that was horribly slow due to allocation.

stcredzero · on May 5, 2009

So instead of Highly Optimized Smalltalk (with VM implementor help) = Horrible C, we have Highly Optimized Smalltalk (with VM implementor help) = Average C.

I can live with that.

tptacek · on May 6, 2009

It is easy to sell me on Smalltalk vs. C. Just not on performance. =)

Oxryly · on May 4, 2009

> C programmers have more freedom to arrange data in memory to exploit locality

I think just about everyone here is underestimating the impact of this one "feature". Most optimization approaches are grounded in 1970s computer hardware, where CPU speed and memory speed were two aspects of the approach. That's not the case anymore... memory access times absolutely dominate on modern hardware. Just lookup the number of instruction cycles a cache miss takes for your favorite system. Its appalling.

C and C++ have been able to stay ahead of the curve because they allow strict control over memory layout. Ocaml is often brought up as a competitor, but storing a floating point value in ocaml requires a boxed pointer! Nevermind trying to make an aggregate structure that includes floats and other types bundled together... everything will get boxed and the cache never stands a chance.

C and C++ are the only fully current languages that let you bundle your data exactly as you need it in as small space and in appropriately sized chunks such that memory access doesn't grind your program down. Of course you can fail to take advantage of this ability and write slow code in C or C++; in which case you'll match the benchmarks for your other favorite languages and maybe make a blog post about it. That'd be missing the point though...

DanielBMarkham · on May 4, 2009

I agree.

People don't realize what a power tool C/C++ is because they haven't had it out for a spin at that level. If you've done serious API work, and wondered why you're busy byte-packing, it's because there's some highly optimized code somewhere you're feeding. You can allocate a big honking chunk of memory and create your own world in there. Boxing, for all its goodness, is a performance killer.

silentbicycle · on May 4, 2009

While I don't have enough experience in really low-level tuning, I know that sometimes people get so focused on optimizing their current implementation that they fail to see that their overall design has trapped them in a local maxima. Problems like inappropriate algorithms are usually more obvious in higher level languages, because there's less detail obscuring them. That stuff needs to be right before getting to fine tuning. (Prototyping in another language first helps.)

C and C++ are quite fast locally, but sometimes that works against being fast overall.

ramchip · on May 4, 2009

DSP is one domain where performance of a tiny part of the code (say 1%) is far, far more important than everything else. If you need to apply a filter to remove high frequencies, or if you need a Fourier transform, no high-level tuning or special algorithm will cut it: you need the filter or the transform, even if it's very costly. The bottleneck is real and unavoidable.

I work in data acquisition and in this domain C/C++ is king. The raw number-crunching is essential (for our applications anyway), and any bonus that can be had from SSE operations, cache locality, loop unrolling, software pipelining, etc. can have a major impact on the overall performance. Assembly is still common on DSP processors and the like (they typically provide a special instruction set). Even in C++, we tend to use the bit-twiddling and pig disgusting optimization hacks that low-level access provides :)

Some people use higher level tools such as LabView but the performance is a huge step down. It's OK for prototyping and learning.

cturner · on May 5, 2009

In this thread people are closely tying C/C++. Is objective-c in the same league?

zio · on May 5, 2009

No. Objective C isn't performance oriented, and while it's fine for most desktop apps (!) it's not in the same league as C++. C++ can actually let you use higher level structures that are faster than in C or Objective C because of template meta programming, as shown in the Blitz template library for numerical computation.

http://www.oonumerics.org/blitz/ This is what Templates are made for. And this is what you can't do in something like Java or Objective C. Java Generics boiling down to type cast at run time to and from Object, it doesn't help performance a bit.

tptacek · on May 4, 2009

Maybe, but on the flip side, you could argue that a lot of high-level languages are trapped in hash tables, and are missing a lot of opportunities for fast tree-walking implementations.

It's tricky to really push on high-level languages because for every argument against one, there are 3 other languages that don't exhibit that problem. If there was a platonic ideal of a high-level language to argue from/with, this would be a more useful debate to have.

silentbicycle · on May 4, 2009

Indeed. OCaml is an interesting exception, though: it's very high-level, yet still quite fast. (The trade-off, if you can call it that, is that you have to work within its type system.) It's not necessarily as fast as (or faster than) C, but even relatively naive OCaml code is often only about 50% slower.

There's often a point of diminishing returns in spending time and effort optimizing C, and OCaml can get surprisingly close with much less work.

10ren · on May 4, 2009

The standard approach I heard was to code in a very high level language (like python) to get the algorithms right, then rewrite in C for performance. Python then becomes a sort drawing board as part of the design stage, instead of "coding".

russell · on May 4, 2009

Rewrite only those pieces that are consuming a lot of time or memory, leave the rest in Python, which is particularly good at integration with C/C++. Many programs are a lot of setup, error handling, and edge cases where performance is not an issue. but LOC is.

10ren · on May 5, 2009

I thought about including the idea of only rewriting the hotspots, but left it out as detracting from the main point and it seemed tedious to iterate the details. But then there are 3 replies pointing this out, and which seem to be more valued than mine by the community (according to their votes.)

Ask HN: Does this mean that HN would prefer I wrote comments that do cover all the cases, instead of just sticking to the key point? Or is it just that there seemed to be a gap in my comment, which people naturally wanted to cover?

Note: the replies add more than just "rewrite only hotspots" (like the above one detailing setup, errors, edge cases), and I certainly appreciate extra details being filled in. Is the valuable extra detail the reason for the extra votes? I just feel kind of annoyed that the replies seem to suggest I was stupid in not mentioning the hotspot idea. This has happened to me a few times now. Am I taking it too personally? Is it just a question of different opinions on what is important? Or is it just that I failed to communicate my decision, and then get annoyed when people point out the other branch of the decision?

There is a issue of preference here: I prefer to keep all the code in one language. One factor in this is that my projects are small, and for these, the overhead of switching languages and managing the different source isn't worth it. I'm not running up against efficiency problems either. I'm sure it's different for larger projects, especially multi-person ones, and particularly if the project covers different kinds of activities (for which different languages are suitable), and even more so if there's need to integrate with or reuse existing assets in different languages.

Thanks for any clarifications you may have. :-)

silentbicycle · on May 6, 2009

> I just feel kind of annoyed that the replies seem to suggest I was stupid in not mentioning the hotspot idea.

I didn't mean anything personal, and I don't read any of the other comments that way.

The other comments may have just been voted up by people who also like Lua or Mercurial, or something. I thought it was worth noting that Lua was explicitly designed for that style of development. (Lua's also my favorite language.)

10ren · on May 6, 2009

Thanks. Yes, that's true. I appreciate the information about Lua and an example of Mercurial - it was just the hotspot part. I guess the fairest guess is that a comment is voted up as a whole, based on all the things they add, not just the part that I happen to be concerned about. It seems a bit silly of me now, but I much appreciate your reply.

silentbicycle · on May 4, 2009

Lua's great for this, too (and rewriting just the few hotspots is usually enough, Lua's pretty fast). They designed it to be embedded in C from day one, so it's quite easy.

stcredzero · on May 4, 2009

Mercurial is a great example of this approach. Overall, Mercurial is comparable to Git. (Git does maintain an edge in speed, but they are still competitive.) Most of it is written in Python, though.

barrkel · on May 4, 2009

For me, this argument in favour of C is really missing the point. In the long run, maintenance costs dominate, and unsafe languages like C lead to programs that take more and more programmer time to maintain, and it becomes harder and harder to see the high-level structure and optimize that through refactoring, rather than the low-level bottlenecks that show up in profilers.

lacker · on May 5, 2009

Maintenance costs do not always dominate. If you are working on code that will run on many thousands of machines, then hardware and running-the-cluster costs dominate. Performance improvements that are small in terms of % can still be worth many programmer-months of effort, and it becomes optimal to use C or C++.

barrkel · on May 5, 2009

Of course it depends. If your code is going to last for 20+ years, though (like the code I work on), I can guarantee you that it becomes very difficult to see the forest for the trees.

zach · on May 4, 2009

> This argument is as old as the hills.

Of course. It's the "disproving a generalization from its exceptions" argument fallacy.

jrockway · on May 5, 2009

C programs manage memory manually, and so lack GC overhead

At this point, GC is often faster than manual allocation. (Due to being able to allocate or deallocate a bunch of things at once, rather than having to allocate/free memory whenever the programmer says to.)

tptacek · on May 5, 2009

You too are committing the fallacy of assuming that "managing memory manually" means using malloc() and free(). Fast programs don't use malloc() directly. Hell, even apr programs have standardized on this optimization.

Many GC schemes are probably faster than malloc. But it's a much less credible argument to say that you have a GC that is faster than a custom allocator tuned to a workset. You probably don't.

JulianMorrison · on May 5, 2009

Naive use of the GC is probably faster than any off-the-shelf manual memory management strategy. How many programmer hours are you prepared to pay for to get the last 1% of speed?

tptacek · on May 5, 2009

How many programmer hours do you think it takes to allocate 12-byte objects out a pool of a round number of 12-byte objects with a freelist or a bitmap?

GC faster than off-the-shelf allocator? Probably.

GC faster than off-the-shelf strategy? Doubtful.

And "remaining 1% of performance"? How disingenuous is that? Allocator optimizations have sped up commercial software I've shipped by over 200% in the past. Not sped up the allocator; sped up the program.

vegai · on May 5, 2009

A competent C programmer can clearly make a good, performant program in 10klines. A genius programmer, maybe 100klines.

There's something beyond those figures, though. And it gets worse when a team is involved.

The higher the level of the language rises, the less it allows making bad programs, however big the program is.

neilc · on May 6, 2009

> The higher the level of the language rises, the less it allows making bad programs

I think it's more complicated than that. Ruby is higher level than Java, but I would argue that Java makes it harder to write bad programs, especially in a team setting.

_csoo · on May 4, 2009

What about Forth? As far as I know it can do all the stuff that C can do plus some more nifty things.

jongraehl · on May 6, 2009

Forth is ridiculously simple to implement. But there's no reason to use it, beyond that. It's less expressive than C but not faster.

chancho · on May 4, 2009

Fortran's (alleged) dominance in scientific is probably attributable to tradition (they still teach it to undergrads in non-CS departments) but the native multidimensional arrays have a much bigger impact than aliasing. In Fortran you just index your array like A(i,j,k) and the compiler will compute (and optimize) the addressing for you. In C, a typical (non-computer) scientist who doesn't really focus on mundane shit like this will end up writing something like

    for ( int i=0; i<ni; ++i )
    for ( int j=0; j<nj; ++j )
    for ( int k=0; k<nk; ++k ) {
       a[ k*ni*nj + j*ni + i ] = ...
    }

which sucks. Optimizing multidimensional array access (which characterizes most of scientific computing) is much easier for a Fortran compiler.

Aron · on May 4, 2009

This looks like a lot of calcs for the inner loop but is quite easily optimized. The compiler knows that (j * ni + i) is constant in the k loop and that ni * nj is constant. Check the assembly output. But it is probably better to reorder the loops and traverse linearly through memory so that each cache line brought down is fully consumed in order.

scott_s · on May 5, 2009

The point of that code example is not the calculations, but accessing memory in a cache-friendly way.

chancho · on May 5, 2009

Well, yeah, but since Fortran stores things in column-major that's not really an issue. My broader point was that all of these details are hidden from programmers, so there's less for the programmer to screw up and more room for the compiler to work within.

joeyo · on May 5, 2009

AFAIK, if you want to work with BLAS, LAPACK and friends you have to structure your matrices like this (as a big block of data) rather than pointers to pointers to pointers (ie a[i][j][k]). So even if there is some inefficiency in filling the data structure, presumably the actual matrix multiply or SVD or what-have-you, is actually quite fast.

ezy · on May 4, 2009

This is what people mean when they say pointers may be able to look like arrays, but arrays are not necessarily pointers.

Consider, "float a[X][Y][Z]" and "a[i][j][k]".

ramchip · on May 5, 2009

I don't know how widespread this is, but my uni teaches Python to the physics (and AFAIK most science) undergrads. Some also do C, but no Fortran.

10ren · on May 4, 2009

Java: 1 minute 20 seconds.

About a year later, testing a new JIT for Java, the Java time was down to 0.7 seconds

I've been surprised at the speed of Java recently. I wonder how much improvement is left in dynamic compilation.

The HP project Dynamo was an experimental JIT compiler where the bytecode format and the machine code format were of the same type; the system turned HPA-8000 machine code into HPA-8000 machine code. Counterintuitively, this resulted in speed ups, in some cases of 30% since doing this permitted optimisations at the machine code level. For example inlining code for better cache usage and optimizations of calls to dynamic libraries and many other run-time optimizations which conventional compilers are not able to attempt. http://en.wikipedia.org/wiki/Just-in-time_compilation

alecco · on May 4, 2009

Time-space trade-off. Meaning Java uses a lot of memory for almost anything. Speed isn't everything.

michaelneale · on May 5, 2009

Sometimes memory can cost speed as well.

I find with java essentially you are amortising gains which are payed back with GC at a later date (in many cases I guess it is worth it).

Its not one size fits all !

pohl · on May 6, 2009

This sounds like it had less to do with Java and more to do with a large number of short-lived allocations in your design. A garbage collector can be abused in any language that has one.

kmavm · on May 5, 2009

I worked on VMware's virtual machine monitor from 2000 until June. Overcoming customers' performance fears was a major impediment in the early years. Every fraction of a percent we gave up relative to native ruled out whole classes of applications. Nothing other than C would have been conceivable. Even C++ would have been wildly inappropriate, as it is for most kernels, because so much invisible code can hide behind a close-brace.

The mapping from source to machine representation in C is relatively trivial, which is the source of all C's ups and downs. If you ever plan to count microseconds, cache misses, TLB misses, mispredicted branches, etc., you had better start with a toolchain whose machine-level output is grokable from the source.

nickpp · on May 4, 2009

Read the comments: turns out the author was ignorant of C++ templates (including Blitz++ scientific computing library) and he was lumping C and C++ together in his "benchmarks".

It always annoys me when clueless people judge a language they don't even understand.

Repeat after me: there is no such thing as C/C++.

sundarurfriend · on May 6, 2009

> turns out the author was ignorant of C++ templates

I believe you actually meant 'ignorant of C++ template metaprogramming techniques'. The author seems well aware of C++ templates and even says:

>> the thing I coded immediately before the Stellation experiments was a very hairy template analysis for a C++ compiler

>he was lumping C and C++ together in his "benchmarks".

I couldn't see where, could you point out which comment leads to that inference?

makecheck · on May 5, 2009

Always remember that the total time between you having a problem, and you achieving results, includes the coding (and recoding) time, and the run time. There's a tendency for people to ignore "slow" languages because they focus only on the runtime.

I am well aware that there are good reasons to optimize things in languages like C (and I use them), but consider...

If I take several extra weeks to code, debug and test a C solution, and I could have had a script done much sooner, then my results were not faster overall. Why? Well, the script could be slow as dirt, but if it has a few extra weeks to churn through data and produce results, it may be done before the C program is even ready.

It's also important to remember that not all bugs are in software. Suppose I was looking at an entire problem in the wrong way, and this wasn't apparent until I started seeing results? In that case, my earlier start with a "slow" program meant that this mistake was found much sooner, so the script can be thrown out and redone, producing correct results with not much of a time penalty.

boryas · on May 5, 2009

Also, thought I would mention the fact that if you take a day to write a slow program that takes a week to run, that's cheaper than spending a week writing a fast program that finishes in a day. After all, your time is much more expensive than the computer's!

fauigerzigerk · on May 5, 2009

You're both making a pretty odd assumption here, which is that a program typically runs exactly once and that I am the only user of my own program.

Users' time is also much more expensive than the computer's. That's why we write software in the first place.

makecheck · on May 5, 2009

No, I expect my scripts to run for a long time with many users. This doesn't preclude optimization; some of it is automatic (new hardware, interpreter library improvements), some of it is well established (using SWIG and C to replace only a tiny piece of the program that must be faster).

In some respects, having long-lived software with lots of users makes speed the least of my concerns, because they're always asking for new features, and those are relatively easy to add to scripts.

And the relationship between software speed and productivity isn't linear, because people multitask. If a program takes 10 seconds to run, I might sit and wait for it to complete, without doing anything else. Whereas, if the program takes a minute, I may decide to switch to another quick task, and then return to see results. In this case, both tasks needed to be done, one took longer but it ate up the "slow" runtime of the program, and was only parallelized because of that long runtime.

fauigerzigerk · on May 6, 2009

I was referring to this statement: "Well, the script could be slow as dirt, but if it has a few extra weeks to churn through data and produce results, it may be done before the C program is even ready."

The comparison of development times and running times simply makes no sense if you assume the script is going to run a 1000 times. I agree that this relationship isn't linear. That's exactly why it's pointless to compare the the two numbers as if it were. The only number that is comparable is probably the profit you make in each case.

boryas · on May 8, 2009

Yes, good point. I guess he was referring more to the scientific end of things, where you write a program that runs for days on massive data, and there really isn't a user

AlisdairO · on May 4, 2009

Much as I take his overall point - that a close to the metal language is not always best for performance - the author would be well advised to take a look at the C99 restrict keyword.

setjmp · on May 5, 2009

Just looked it up... looks like you're right. The whole blog post is nonsense.

habibur · on May 5, 2009

Just asking, does currently available compilers support restrict? (or C99 standards)?

I use msvc mostly and gcc sometimes.

AlisdairO · on May 5, 2009

gcc certainly does (use -std=c99). If you don't want to use C99, use __restrict__ to use it as a gcc extension. This is C-specific - I don't believe a similar standard has worked its way into C++ yet, although compilers may have extensions that support it. I don't know about msvc, I'm afraid.

GCC's C99 implementation is mostly complete - you can find out more here: http://gcc.gnu.org/c99status.html

aplusbi · on May 4, 2009

I think a large part of the "C is Efficient" fallacy/truth is that C is more predictable with its assembly output than HLLs. Since ultimately all programs are converted to machine instructions at some point, you will always be able to write machine instructions that are at least as efficient.

The test lies in the ability of the programmer to do so.

madair · on May 4, 2009

Reminds me of The Practice Of Programming, by Kernighan and Pike, a truly excellent book. They cover some of the bum steers of C efficiency quite well there. A classic.

likpok · on May 4, 2009

One issue that I see is that they only (seem) to compare with gcc, which is not particularly good. It would be better to compare against something like icc, which has better register coloring, SIMD support, etc.

This might redeem C a little.

However, the real reason C will not go away any time soon is that there is no replacement for low-level software yet. Nothing eles has quite the same minimal dependencies.

chancho · on May 4, 2009

GCC C vs Intel Fortran : http://shootout.alioth.debian.org/u64/benchmark.php?test=all...

Intel C vs Intel Fortran : http://shootout.alioth.debian.org/gp4/benchmark.php?test=all...

(I'm not even going to link to the GCC Fortran benchmarks. They're embarrassing.)

C is no slower than Fortran on any of those benchmarks, and on some it cleans Fortran's clock. The aliasing issue is the only thing Fortran has going in its favor, but clearly its not ubiquitous. The n-body benchmark, for instance, is fairly typical numerical code. You might even think, since it's simultaneously reading and writing through multiple pointers of the same type, that aliasing is an issue, but its not. And in the rare case that it becomes an issue, there's compiler hints (e.g. C99 restrict) for that.

Picking Fortran over C solely because of aliasing worries is premature optimization of the worst kind.

likpok · on May 4, 2009

This matches with my experiences with ICC doing kernel development. It would vectorize loops and break out the SIMD instructions where gcc would not.

It was quite strange the first time looking through the objdump seeing things like punpwlkd and xmm.

And then discovering what -fast would do to things (it makes icc look at your whole program to optimize, so it does things like ignore CDECL and uses whatever registers it can.

Oxryly · on May 4, 2009

For the given example code the answer is to use the underlying SIMD types and intrinsic instructions. If you're on Intel or PowerPC then the SSE or Altivec registers and instructions are available. Use them and you'll beat any compiler optimization every time. And, most importantly, the chances that up-to-date SIMD types and intrinsics will be available in any language but C/C++ is vanishingly small. Java, Ocaml, Haskell... you name it, you can't properly use SIMD (AFAIK... please let me know if there are exceptions).

And if you're application doesn't fit the native SIMD properly, there's no chance the compiler can really do anything meaningful with it anyway.

jrockway · on May 5, 2009

Some quick googling suggests you are wrong:

http://www.cas.mcmaster.ca/~kahl/Publications/TR/Anand-Kahl-...

http://wwwlasmea.univ-bpclermont.fr/Personnel/Jocelyn.Serot/...

http://tirania.org/blog/archive/2008/Nov-03.html

etc.

Oxryly · on May 5, 2009

I would suggest taking a closer look at those links. They bring up another point in that you'll only find current, usable, robust support for SIMD in C++:

The first link is just a paper; the second is a 1.0 release that is seven years old, it only supports SSE, and it's all in French; the third is only for Mono, it only supports SSE and its SSE support is old and incomplete.

On the other hand, if you try to use SIMD types and intrinsics in C++ you'll find current and comprehensive support from the major compilers on all SIMD platforms.

(I'd love to use a current and comprehensive version of Haskell SIMD, but its just not ready for prime time.)

jrockway · on May 5, 2009

(I'd love to use a current and comprehensive version of Haskell SIMD, but its just not ready for prime time.)

Since you're obviously an expert in the area, why aren't you helping make it ready for prime time?

scott_s · on May 5, 2009

That's really not fair. I know of many things in and outside of my area of research that need improvement, as does any researcher. But we can only work on one thing at a time.

janitha · on May 4, 2009

I disagree with some of the Authors points.

The language is a tool, the efficiency of it really depends on how the programmer who designs and implements a program. You can have a horrible coder write something in C that would be very slow and inefficient, and given the same problem to good programmer and you can arrive at a faster and more efficient result using a bash script.

The reason why lot of people (including my self) believe that C/C++ is a high performance language is not because it fast for all applications, but gives the programmer more control instead of leaving all the fine details to the compiler to second guess. (@tptacek's reasons are perfect for this).

AlisdairO · on May 4, 2009

...except that in this case, when you take full control of the memory management, you slow things down. The problem with using pointers (without using the restrict keyword) is that it can seriously impede out of order processing on modern CPUs. Now, the author has missed the existence of the restrict keyword in C99, but in the absence of that his point is good.

wicknicks · on May 4, 2009

The arguments about language efficiency on a single processor machine are probably outdated today. Most machines have multiple cores. We need good tools which can exploit this CPU architecture. Languages like C/C++ place a large amount of responsibility on the shoulders of a programmer. Effectively, you are writing two programs - one for the task at hand, and the other is memory allocation for it. Control does not always give speedup, also it is not always a boon. It implies managing the different resources on your own. In a large scale application, this is not a great idea. GC is a good alternative over Malloc and Free'ing when the size of the application gets huge.

Also, the scalability of a single system is limited. If you really need extra speed, I think going parallel is the key. Most numerical methods are parallelizable. And C/C++ were not created keeping this mind. It can be a programmer's nightmare to debug multiple threads with memory leaks. Languages like Java do make this task easier using programming paradigms like Map/Reduce.

lacker · on May 5, 2009

It doesn't make sense to use mapreduce as an example of Java making parallelism easier, given that the original Google MapReduce is a C++ framework.

e4m · on May 6, 2009

Boost has portable threads for C++ right now... in fact, the next C++ release (due out this year) will include threads by default. http://en.wikipedia.org/wiki/C%2B%2B0x#Threading_facilities

jamii · on May 5, 2009

The debate seems to be between trusting your compiler to generate good code or writing close-to-metal code yourself.

There might be a third option though - code generation in a HLL. Coconut is a nice example of this:

http://www.youtube.com/watch?gl=GB&hl=en-GB&v=yHd0u6...

Instead the compiler being a black box, Coconut is structured as a set of libraries for code generation, analysis and optimisation. They report outperforming c SIMD by up to 4x on the cell architecture.

bradgessler · on May 5, 2009

Where is the "Assembly is Efficient" Language Fallacy article?

ddg · on May 5, 2009

"In C and C++, there's no such thing as an array - there's just pointers, which you can subscript and a shorthand for pointer arithmetic and indirection(x[n] in C/C++ is the same thing as (x+n).)"

Please remember that "char x[N];" and "char x = malloc(N);" are NOT the same. (Not sure if this is news to anyone, but when I was learning C reading that would have made me think otherwise).

wlievens · on May 5, 2009

He's absolutely right about pointer aliasing. I work on a DSP compiler for architectures strongly geared towards instruction-level parallelism, and when the compiler occurs pointers that may alias (but probably won't, but we can't tell because of the language's liberal use of pointers), many optimizations have to be a lot more conservative.

ilitirit · on May 5, 2009

C is one of the fastest languages and most space-efficient languages out of the box. Other languages get their speed by resource trading (memory, space, time). You could arguably gain these sort of speed ups in C if you invested enough time in writing a VM or JIT-ing/dynamic compiler to host your C program.

illumen · on May 5, 2009

C works in CUDA and OpenCL - which for scientific programming is the fastest thing there is.

The GPU is used with C.

C isn't faster than hand coded assembler on cpus... but is pretty damn quick.

There are specialist compilers for C, like vector C etc.

anyway... whatever. back to typing text into a file now.

malkia · on May 5, 2009

CUDA and OpenCL have C syntax, but they are not C.

oomkiller · on May 5, 2009

This is an interesting article, but I believe that the benchmarks should be updated since a ton of progress has been made in compilers and how they handle aliasing.

jgrant27 · on May 4, 2009

http://news.ycombinator.com/item?id=591897

ramoq · on May 4, 2009

I remember reading this article years ago :)

Allocator2008 · on May 5, 2009

Regarding the discussion of "real" arrays in Fortran vs. "pointer arrays" in c, it seems in the real world often we need resizable arrays, which means the only way to do that is to have pointer arrays. A Fortran array cannot be resized. What if I want to read in a tab delimted file that has ints, and read that into a 2-D matrix, where the matrix rows correspond to the file rows, and the matrix columns correspond to the tab delimited columns in the file. And I have say no idea how big the file is. So the only way I can do this is to have a resizable array. I can "guess" that say my rows/columns won't be more than some arbitrarily large number, but then I likely end up wasting a lot of space. So out of curiousity how would I even do that with "real arrays"? Seems like if I ever need to resize an array, even in just as simple an example as reading in a file, then I need a pointer array, since "real" (fortran-like) arrays will not resize. Am I missing something here?

c00p3r · on May 5, 2009

In most cases manually cooked food is much cheaper and more healthy than purchased one, if you can cook. The same is correct for C/C++ programming in terms of execution speed and resource usage if you have knowledge and experience.

btw, JVM is mere a c++ code.