Hacker News new | past | comments | ask | show | jobs | submit login
When Haskell is Faster than C (paulspontifications.blogspot.com)
93 points by metajack on Jan 18, 2013 | hide | past | favorite | 115 comments



These "faster than C" claims are almost always embarrassing (usually involving C code that would easily win if it were as aggressively optimized as the high-level language) but that's almost not the point.

The real point is the larger narrative. The subtext of these posts is what we are really arguing about. So let's just duke that out directly.

High-level language fans have a point, which is that high-level languages are sometimes an better overall "bang for the buck" in developer time, and that sometimes they can be pretty fast (possibly even out-performing an un-optimized C program). Reasonable C guys aren't arguing against this. We're certainly not arguing that people should use C for everything.

But here's what high-level language fans have to understand. First of all, you depend on us. Your language runtime is (very likely) implemented in our language (possibly with a little assembly thrown in). So as much as you may like your language, it certainly does not obsolete C. C guys like me get cranky when high-level language fans imply that it does.

Second of all, a C+ASM approach will always win eventually, given enough time invested. That is because a C+ASM programmer has at his/her disposal literally every possible optimization technique that is implementable on that CPU, with no language-imposed overhead. What this means is that a higher-level language being "faster than C" is just a local maximum; the global maximum is that C is faster.

Yes, it's absolutely true that in limited development timeframes a higher-level language might still be the right choice, and in rare cases might even have better performance. But for long-term projects that want the absolute best performance, C (or C++) are still the only choice. (But maybe Rust someday).


Just as you're not arguing to use C for everything, people liking high-level languages aren't arguing you should never use C (or, at the very least, a similarly low-level language).

Rather, the point is that using C is very often a premature optimization. It also comes with a sacrifice in terms of productivity, maintainability, safety, testability and readability.

So you should certainly use C. Sometimes. And the question to ask is always: "why C?" and never "why not C?".

Also, you can get quite far by doing something like generating C from a DSL embedded in OCaml or Haskell, giving you both powerful type-level features and many of the advantages of a high-level language without sacrificing low-level control of C. You could use something like CIL[1] to avoid many of C's usual pitfalls. I haven't tried this myself, but I know some people happy with that approach.

[1]: http://www.cs.berkeley.edu/~necula/cil/index.html


> people liking high-level languages aren't arguing you should never use C

Some do; the most frustrating such conversation I can remember with such a person was this: http://news.ycombinator.com/item?id=2537155

This person argued that the only way to make progress in systems programming is to ditch C as the basis of the system and linking model, in favor of something more constrained. That's the kind of talk that gets me cranky. It's precisely because C and assembly are so unconstrained that such a wide variety of programming paradigms can all be implemented efficiently and run under a single OS, and that we can so easily experiment with brand new paradigms without having to get buy-in from the VM gatekeepers.


"Second of all, a C+ASM approach will always win eventually, given enough time invested. That is because a C+ASM programmer has at his/her disposal literally every possible optimization technique that is implementable on that CPU, with no language-imposed overhead. What this means is that a higher-level language being "faster than C" is just a local maximum; the global maximum is that C is faster."

Exactly

The C program presented is very naive. getc? ungetc? malloc inside a loop?

For example, I can create a table to convert 4 characters at once instead of only one. Just as a start. Doable in haskell sure, but it would most likely not use the resources of modern processors (SSE2, AVX, etc)


I wish I could up vote this 10x. When folks were saying "Oh with HotSpotUltimateOptoMix III Java is going to be faster than C code!" or whatever the optimizer de-jour was I would sigh. We have very much reached a stage where there are computers which can dedicate way more cycles to your effort than you need to get it done in time, we can spend this "surplus" on making writing code easier, and that is a huge win. Enjoy it. Don't worry that someone could craft it in C+Asm and give the idle loop more cycles to play with, it doesn't need them.


Unless of course the idle loop isn't idle. Since the end of the assembler era there has always been a trade-off: programmer time vs processor time. And programmer time is about as expensive as it was in the past, cycles have gotten cheaper and cheaper.

But the cost of a cycle still isn't 0, and likely it will never be. Optimizing a chunk of code and making it perform 10 times as fast can be a huge competitive advantage when you're operating a sizeable cluster that is really working instead of idling.

For one-offs and incidental use optimization of any kind is nonsense.


I don't think we disagree. I assert both things are true:

1) There are situations in which extracting the most work out of a given compute infrastructure is the best use of ones time and effort.

2) There are situations in which optimization will only lead to additional free cycles which will go unused thus investment in such optimizations is a waste of time.


Not sure how often (2) arises anymore, since you're often paying in battery life or cloud computing charges.


That's a fair point, since the idle loop can be the halt/sleep state.


These are also people who don't understand you can JIT whatever language you like.

It's not common to JIT C or C++, but you could, and gain all of the profiling/whatever advantages. It's just not commonly done because the end result of a C/C++ compiler is "fast enough". Plus, JIT's have a weird set of tradeoffs. You'd really need a hybrid model where you did expensive high level loop transformations statically, then left the IR alone and JIT'd it[1]. This whole set of ideas quickly gets into TeNDRA/ANDF territory, however.

[1] Of course, profiling at runtime through the JIT may discover some more places to loop transform.


C is itself a HLL, e.g it uses a high level state model based upon mutable variables rather then direct manipulations of processor registers and hardware memory locations.


Actually pointers are a sore spot. Of course we have `restrict` in C (and not C++, for some strange reason) but it leaks elsewhere too, like qsort.

Of course your main point still holds (you could write macros that implement your container without void* everywhere) but I feel those are the issues where a HLL is most likely to outperform a naïve C implementation (since it would take gyrations to most efficiently implement the same in C).


I guess I will be your essay-writing dissenter.

C is a high-level language with fewer features than many other languages, is not necessarily the engine behind other languages, has the problem of its programs poorly implementing a percentage of what other high level languages are capable of doing quickly and securely, and provides slow and troublesome memory allocation out-of-the-box. When comparing the speed of operations, people are rarely comparing apples to apples. And contrary to what is spattered on the boards, C is not an understandable, close-to-the-metal wrapper around assembly instructions (compilers have advanced quite a bit).

I love C. It does feel fast, and I get the illusion of being close to the metal. It was one of my first languages, holding a sentimental place in my heart. Very important things are written in it. It is a high-level language with some okay abstractions.

Is it underneath other high level languages? Maybe, if you mean that the compiler might be written in C in order to bootstrap the language. Of course, one could write the compiler in any language; it's all about translating programmer-friendly symbols into assembly or VM bytecode, right? And speed of compile is a different subject from speed of the compiled program.

But here is the gotcha on raw performance: Your large C program poorly implements a percentage of what other high level languages are capable of doing quickly and securely.

I once foolishly argued in favor of C's performance, saying that one could write a layer that supports all these nice features speedily, such as the data structures I will mention below as well as GC; by the time you do that, you might as well be using a different language. You probably implemented that layer poorly, compared to other languages with large communities pounding at and optimizing that layer. For example, when you implemented your "fast" list with the basic struct and next pointer, did you also implement the new-node creation in such a way as it still uses raw malloc(), as opposed to managing previously-malloced memory efficiently?

How many implementations of a basic list do we need in C? Super large integers? Fixed-point integers? Growable arrays? Lazy/infinite lists? Trees? Hash maps? Surely you don't think these other language designers said to themselves, "Let's support hash maps and make them slow." No, they came up with a fast standard, supported by their language, sometimes complete with various configuration options to make all the tradeoff decisions on making those data structures speed-efficient or memory-efficient for reads or writes. Others, of course, subscribe to a religion, er, a specific tradeoff, such as perl's approach to {}s ("There's more than one way to do it" ... unless you are dealing with hash tables).

What about all the wonderful memory management you can do in C? Aren't you closer to the metal that way, able to make basic memory allocation super speedy? Not really. This is part of the illusion. malloc() is slow enough that developers have rewritten versions of it several times. ROM-based MUDs, for example, manage their own memory, using an initial malloc, of course, but regularly using their own set of allocators and deallocators (free_string, str_dup, etc) on top of that allocation. There are these tricks and more in high level languages, including the sharing of partial structures (kinda like union but with more pointers and fewer bugs associated with those pointers), allowing for resource allocation strategies that can be "faster than C".

If the argument in favor of C's speed at the end of the day is, "When we write crappy programs with buffer overrun holes, memory leaks, and no error handling, it's super fast!", we are (1) not comparing apples to apples and (2) doing ourselves and our customers a grave disservice.

Let's be honest: C is no "closer to the metal" than other high level languages (http://news.ycombinator.com/item?id=3753530 and https://en.wikipedia.org/wiki/Low-level_programming_language...). The days of manually XORing to assign 0 to a variable are well behind us.

Note... I don't mean all high-level languages. There are many very slow implementations of these languages. Programmers are getting better at this stuff in modern implementations, gcc included. And it is fair to say that there are many things other languages can do that are, when you compare apples to apples, faster than C, especially when you factor in modern JIT compilation; and they also do some things slower than a similar function in C.

No, C isn't and won't be obsolete, not until people write popular OSes in other human-readable languages, complete with a body of excellent libraries. We operate in a world of legacy, working code.

Edit: Looks like we both must have not read the article before replying. The author goes over many of these points.


> Let's be honest: C is no "closer to the metal" than other high level languages

This is dead wrong, and your links do not support it. This is exactly the kind of statement that gets me grumpy.

Your link illustrates that an aggressive C optimizer can collapse a chunk of C code down to something smaller and simpler than the original code. This is true.

But what you said is that C is "no closer to the metal" than other high-level languages. Let's examine this assumption.

Take this C function:

  int plus2(int x) { return x + 2; }
You can compile this down into the following machine code on x86-64, which fully implements the function for all possible inputs and needs no supporting runtime of any kind:

  lea    eax,[rdi+0x2]
  ret
Now take the equivalent function in Python:

  def plus2(x):
    return x + 2
In CPython this compiles down to the following byte code:

  3           0 LOAD_FAST                0 (x)
              3 LOAD_CONST               1 (2)
              6 BINARY_ADD          
              7 RETURN_VALUE
Notice this is byte code and not machine code. Now suppose we wanted to compile this into machine code, could we get something out of it that looks like the assembly from our C function above? After all, you are claiming that C is "no closer to the metal" than other languages, so surely this must be possible?

The tricky part here is that BINARY_ADD opcode. BINARY_ADD has to handle the case where "x" is an object that implements an overloaded operator __add__(). And if it does, what then? Surely just a very few instructions of machine code will handle this case, if C is "no closer to the metal" than Python?

Well __add__() can be arbitrary Python code, so the only way you can implement this BINARY_ADD opcode is to implement an entire Python interpreter that runs __add__() in the overloaded operator case. And the Python interpreter is tens of thousands of lines of code in... C.

The end result is that writing the same function in C and Python is the difference between two machine code instructions and implementing an entire interpreter.

This is why I get grumpy when people deny that C is any different than other high-level languages. While this is a somewhat extreme case, you could make a similar argument about most operations that happen in other high-level languages; similar constructs will very frequently have less inherent cost in C.


Seconded. I'd add that languages like Python hide an enormous amount of not-close-to-the-metal-ness in every single statement, because every single statement has the implicit context "Interpreter, please interpret this string relative to your potentially complex internal state".

Haskell is only slightly less prone to this, despite being compiled, since every expression by default becomes a request to instantiate a thunk in the runtime's evaluation tree. Yes, you can contort yourself to avoid this, but at that point you are imperatively programming an expression tree evaluator. This can be fun and rewarding, but it's not that different from scripting a Python interpreter, and it's certainly much further from the metal than C.


Could you explain your second paragraph in a bit more detail? I feel you have honed in on a deep and significant pont, but my passing understanding of Haskell isn't allowing me to fully understand it.

In particular, what would it look like if you were avoiding instantiating thunks?


Very crudely (Haskell experts please chime in!):

Every Haskell program is a lazily-evaluated tree of expressions; the program output itself is the root node and links represent data flow dependencies. The Haskell compiler turns source code into an initial representation of this tree as well as machine code that performs a sequence of reduction steps to turn the tree into a final output value. So a tree node at any given time is either a concrete value, or a chunk of code that can be executed to obtain that value (a thunk).

The compiler has the right to choose an arbitrary (semantics-preserving) node evaluation sequence to reduce the tree down to the final result.

Through strictness annotations (http://www.haskell.org/haskellwiki/Performance/Strictness#Ev...) you can force particular subtrees to be fully evaluated whenever-in-program-execution their parent node begins evaluation; to programmers in other languages this seems alien, but in Haskell you, by default, turn control of operation sequencing to the compiler.

Such annotations are sometimes desirable because the thunk representation of, say, a large chain of arithmetic operations on a list of numbers can be very expensive in memory, so one might want to force it to be reduced down to a concrete double as soon as possible (see, e.g., http://benchmarksgame.alioth.debian.org/u32/program.php?test..., and note all the areas where function arguments are preceded by exclamation points; those are strictness annotations).

Strictness annotations are often important for getting good performance out of certain kinds of Haskell code, and are in some cases not just important but essential for controlling memory consumption.

It can be very hard for non-experts to reason about where and why to use strictness, and I submit that this is because you're essentially doing an indirect second level of programming. The un-annotated source code describes the value-transformation semantics, while the strictness annotations are a side-band language for controlling the tree-evaluator that is embedded into your compiled program.

Hope that helps :)


Thanks for writing this up!


While a fan of your original comment, I think the logic got a little flaky here. Claiming any language is 'close to the metal' is to make implicit reference to particular implementations of said language. In recent years there has been an explosion in the use of techniques that have levelled this old distinction.

For example today it is quite possible to have Javascript OO code that runs more efficiently than, say, calling through structs of function pointers in C, since modern compilers have morphed into such strange runtime beasts that they can make (and test at runtime for) stronger assumptions about the contents of those pointers than a traditional compiler ever could. Where an indirect call in a traditional C compiler will only ever be at best "CALL [rax]" (assuming no indirection required to implement dynamic linking), newer implementations might inline the called function entirely, all dependent on what input the program happens to have been fed on a particular run.

This trend is why I'm not certain of the odds for less expressive languages in medium term. Instead of a struct of function pointers, imagine an inheritance chain of structs of function pointers. Although it can be implemented in C it is a native concept in Javascript, and one that compilers can already optimize. While the JS compiler has explicit information fundamental to the language to infer this structure from (and version any dependent objects to implement a runtime guard), we aren't nearly at the point where our compilers can deduce such an idiom in C or assembly, optimize for it, or solve a huge satisfiability problem at runtime in order to produce guards for it.

It naturally follows that the more semantic information a compiler has available to it, the better chance, at least in the medium term (10-20 years?) it has of applying tractable optimizations.


This sounds like the Sufficiently Smart Compiler line of reasoning that was supposed to make Itanium the architecture of the future and has been forecast to make high-level languages faster than C any day now. In practice, the extra costs imposed by high-level languages (and in particular, the loss of memory management control) have outweighed the benefit of extra high-level information consistently for decades. And I see no reason why this will change.

In particular, the the optimizing VMs you are describing still have to be written in something. The proof is in the pudding: no serious language runtime is written in anything other than C or C++. The argument that C is being displaced will be an empty one until you start to see language runtimes being written in something else.

(Just to clarify, since another commenter missed this distinction earlier: I'm talking specifically about VMs/runtimes, not ahead-of-time compilers. Compilers are frequently written in non-C languages because an ahead-of-time transformation doesn't have the same stringent efficiency and resource usage requirements that language runtimes do).


The equivalent `plus2` OCaml function compiles to:

    camlAdd__plus2_1030:
    .L100:
    	addq	$4, %rax
    	ret
(It's using 4 instead of 2, because ints are boxed; a 0 in the last bit denotes an int, a 1 denotes an address).


Part of the point of the original article is that not even assembler is "close to the metal" any more. How long does that fragment of assembly code take to execute? Depends on whether the instructions are in the I-cache, whether some previous branch prediction has failed, and whether the data are in the cache. All this adds up to a couple of orders of magnitude.


That's cool and OCaml is an interesting language. However I'm sure you know it would not be difficult to come up with a different example where C lets you express "bare metal" idioms that are not as "bare metal" when expressed in OCaml.


> Now take the equivalent function in Python

Those functions maybe similar but they are not equivalent.


Apples and oranges yet again. The original article is about Haskell; an entirely different beast from Python. Haskell compiles down to machine code, not byte code.


I was responding to a comment that said that "C is no closer to the metal than other high level languages", without qualifying which high level languages.

But even Haskell is not as "bare metal" as C. Just because a language can compile to machine code doesn't mean it fills the same role that C does.


I think it is a little unfair to say C the language is not low-level because a C application must implement features (or use libraries) that higher-level languages would already provide.

However, your older post [1] contrasting the code generated by gcc -O0 and -O1 was illuminating and made me rethink my position. Thanks! <:)

[1] http://news.ycombinator.com/item?id=3753530


Of course. That's why people should write 90% of the non-speed-limited code in something higher level and the speed critical things in C

"And contrary to what is spattered on the boards, C is not an understandable, close-to-the-metal wrapper around assembly instructions (compilers have advanced quite a bit)"

Yes, it isn't BUT it is (due to compilers supporting) the language where you can drop down to assembler or something closer to it: http://msdn.microsoft.com/en-us/library/26td21ds(v=vs.80).as... in the easiest way


Edit: iPhone posted to root, meant in reply to top voted comment sry!

don't think that's the underlying point at all. A functional Language gives the compiler more room to optimize - the compiler has more information because a functional language let's the programmer express constraints in ways imperative languages can't - the compiler fundamentally knows more and thus can generate better machine code. The productivity benefits of HLLs are secondary to this discussion.


Thank you.


Where possible, GHC does not compile Haskell via C anymore.


I don't think his example is helping his argument at all.

He cherry picks optimizations for the Haskell, such as using Data.Vector.Unboxed instead of the regular lists and removing calls to isLetter, but then he rolls his own linked list and uses getc in the C version. He doesn't even have the correct return type for main.

Haskell written by decent Haskell programmers is faster than C written by poor C programmers. Not very surprising.


I'm no C expert - too bad to hear that about his C code. However I can tell people here that Vector.Unboxed is a very common optimization as soon as you start thinking about performance in Haskell. Nothing "expertly" about it, really. I, for one, use it in all of my computational Haskell code.


That's actually the problem he's hinting at near the end as he's trying to make his point about optimizing C.

The optimizations he needs to make to the C code didn't seem clear to him when he was writing it, but they're very common optimizations for someone with more experience writing in C.

That he made the seemingly-natural optimization in Haskell while not affording C the same luxury is what's hurting his argument.


"Haskell is easier to optimize than C" != "Haskell is faster than C".


If Haskell is easier to optimize than C, then it could easily be that there's some amount of programmer effort, for which expending that much effort in Haskell yields a faster program than expending that much effort in C. If that amount of effort is in the range of effort most people are able to expend on a class of projects, then Haskell is faster than C for those projects. It may even be that those are most projects.

It is, of course, not the case that Haskell is faster than C with arbitrary effort expended tuning to the specific hardware - no one is claiming that.


Are you writing that in Haskell or in C? Because in Haskell, you'd want to use the /= operator. In C, you'd want to use strcmp.


I think you meant strncmp. Nobody ever wants strcmp, not if they know what it does.


We know what it does and it's the right thing to use. You must be thinking of strncpy or some other ennified string.h function.


You don't know what you're talking about.


It's also important to notice that Data.Vector.Unboxed has the same API as the other Data.Vector implementations and provides the same typechecking that the other implementations do. All it requires is that the underlying type have a Unbox instance. In most cases, you change:

   import Data.Vector as V
to

   import Data.Vector.Unboxed as V
and your code now gets the performance increase. (The reason it's not the default is because it's not most generic. What if you want to have a vector of IO computations? Good luck unboxing that. But if you just want integers, or something, that's much easier.)


If they have the same semantics why doesn't the library do an unboxing implementation for types that support it and not for the rest?


The semantics are different with regard to lazyness: a boxed vector can contain thunks (unevaluated values), so individual elements can be evaluated on demand, while an unboxed vector must be evaluated all at the same time (making it difficult to run a transformation in parallel with multiple threads, for example).


This sounds like exactly the sort of optimization that I'm always hearing a sufficiently smart compiler will make for me.


No, because compilers are prohibited from changing the meaning of the program. Changing a lazy value to a strict value is only permitted when the compiler can prove that the value will always be evaluated. It can't do that for unboxed arrays because that is a global change, so it has to leave it to the programmer.


Fair enough.

My complaint is that the C code isn't given the same chance. He even calls out that reading data with getc is a known performance problem, but then does it anyway. Any book on learning C will point out that getc is slow for reading lots of data, and fscanf or fread should be used instead.


From what I understood from his post, using a buffered input function would a) make the code differ from the specification, b) require more refactoring than the Haskell code needed. b) seems particularly important when the program is not <100LOC, but tens or hundreds of lines of code.


The problem is that his implementation is essentially a test of how he's using libc versus how Haskell is using it.

The pointer arithmetic he's using shouldn't need to be optimized, since the page sizes he's malloc'ing and pulling from memory are small enough that they should stay in the L1/L2 cache for the entire run(he's using 1k blocks of data, most processors use 4k pages). There's almost no optimization to be done there.

The biggest performance hit is actually the single character puts versus a block read or write.

Haskell is probably implemented to read a large block of the file(or perhaps the entire file) into memory, then parse after. That would be a minimal number of system fread calls over the entire run. Versus the C code, where 1 block's parse could be 1024 getc and putc calls.


It's not about system calls, it's about locks. The getc function is buffered (by default) - that's what's going on behind the scenes in that FILE structure. What is slow about calling getc over and over is synchronization around that FILE object (hence the existence of functions getc_unlocked, &c).


I would never encourage use of the f* I/O functions from the standard C library for high performance code. They're all buffered which means there's an extra copy happening. You should use read and other POSIX I/O functions instead.


This is not 100% true.

f* calls are buffered this is true. But depending on your workload this may be a blessing in disguise because if you're parsing a very long stream of unstructured data you'll have to re-implement most of that buffering logic in your own code to deal with the inevitable corner cases where your data overlaps the borders of the buffersize chosen. The obvious optimization here is to pre-allocate a buffer of the right size and read in the data in one go but for a stream of unknown and possibly infinite length (such as used in a filter like this) that is not possible.

So f* function have their place, even in optimized code, but should be used with care. If you're reading data with fixed size blocks (such as in this example) then yes, a simple 'read' call will likely be faster, because as you note the read call will place the data directly into a user accessible buffer without the need to go through more library calls to get at the data. (fread is a wrapper function around the read system call).


My impression is that, while it is possible to come up with a performance oriented Haskell code, it becomes quite ugly and diverges from those beautiful and idiomatic examples found in books. E.g. explicit strictness specifications, real quicksort (with Array.ST), etc..


The same applies to C, of course.


I wouldn't say so. Some books indeed skip some "real-world" details, but that doesn't affect idiomatic C, which is already performance friendly.


rolls his own linked list

Are you saying his implementation is bad, or are the available library linkedlists optimized in very special ways...?


>He cherry picks optimizations for the Haskell, such as using Data.Vector.Unboxed instead of the regular lists and removing calls to isLetter

Neither of those complaints are reasonable at all. What is wrong with using Vectors? That is like complaining that a C++ version uses vectors, which it presumably would.

And how is changing a single function call that is unicode aware to one that isn't unfair? How is using getc in the C version unfair?


The unfairness is that he optimizes the Haskell code but then doesn't do the same for the C code and declared Haskell faster. Not what I'd call a reasonable test.


He profiled both programs and made exactly one optimisation to the Haskell program (removing the call to isLetter). Profiling didn't reveal any obvious C optimisations. So he went with what he had. It all seems pretty reasonable to me.

C fans (I am one, by the way) shouldn't get too upset by this. You still aren't going to write an operating system kernel in Haskell.


The obvious C optimization is using getc_unlocked and putc_unlocked, which cuts the C run-time in better than half, beating out Haskell by a smidge.


And that is a valid criticism. The criticisms that were being responded to were not valid.


For sure. The chorus of "Clearly you should be writing your own buffering" has been bugging me - it's not hard but there's a lot of room for an error and you're doing a lot of obfuscating of the code actually solving your problem reimplementing infrastructure that already exists in libc... If you're consuming a single character at a time from a single thread, I don't think you're likely to beat getc_unlocked by much anyway.


I'm in no position to judge the quality of the Haskell code (I couldn't program my way out of a wet paper bag in Haskell) but after half a lifetime of writing C for a living I can say with confidence that the C code is horribly written and horribly inefficient.

It is tempting to pull the code and fix it.

If you want to compare two languages make sure that you are proficient in both.


I agree - if you're comparing runtime implementations.

I think this is actually a useful comparison, though. I would argue it's a lot easier to become a proficient, performance-conscious programmer in python, java, or even Haskell, than C. And you're more likely to shoot yourself in the foot with C.

From the point of view of someone who hasn't learned either language (maybe a scientist or engineer looking to do some simulation work), the message here is, "with the same time and effort, not only is Haskell as fast as C in many cases, but in some cases it will actually be faster than the C code that you, a beginner, can write."


A similar anecdote from my own experience:

Chesspark had a web and a win32 native client. It kept track of your friends with a roster (the underlying stack was based on XMPP). As a way to make new users feel welcome, I and a few coworkers were added to all new user's roster (like MySpace Tom). It wasn't long before this overwhelmed our clients.

Each client was written by a different developer. One was in JavaScript, and one in C. Due to some poor design choices on the C side, the JavaScript client was substantially faster. I don't remember the exact benchmarks, but I think it was close to 50x faster.

The JavaScript code took less time to develop and less time to fix. The C code did eventually get fixed and outpace the JavaScript on raw speed, but if I were doing it again, I never would have made a native client to begin with.


Nice story.

If I flatly said that JavaScript was 50x faster than C, I'd have butchered your story and some real points of interest here.

Supposing your team is productive in JavaScript, but not too exceptional with C, then a similar time investment can produce significantly faster JavaScript programs. In this way, in real practice, language productivity can totally drown technical performance differences. Even when you need to optimize, optimization is also subject to productivity differences.

It is probably inherently somewhat harder to reach the required level of ability in C to get the required performance, and thus to find people with this ability. Ergonomics and ease of entry don't just make people happy, they also have an impact on the performance of the average program to be written.

But even if it turns out that C is at no disadvantage here - it matters more what you will be able to use well. If C makes insane things possible that you will never do, they aren't a good reason for you to use C.

Moreover: following a general policy of writing JavaScript (or other favored language, within reason) is going to give you good-enough performance in a HUGE variety of problems.

Even if this performance is less than C's for an important program, there are two mitigating factors:

1. The slower performance may still be easily 'good enough'. There are diminishing returns on speed improvements. More speed does not always make the software better.

2. You can get the required speed by optimizing and replacing the bits which are bottlenecking the speed-critical bits, without committing to writing the entire project at 'max speed' out of the gate, which will often be a gigantic waste of resources on speed improvements that no one will ever see.

Most projects have limited applicability and limited lifespans. What matters isn't whether you are at maximum performance. What matters is whether performance is good enough for purpose.


What exactly were these "poor design choices" that affected the C implementation's performance so badly?


Good design choices are better than bad design choices?


More like: good design choices are vastly more important than micro-optimisation, and Haskell is a good design choice.


In general, JavaScript wouldn't take less time to develop and especially less time to fix than a modern C++ 11 / boost code.

That's a stupid myth ("very slow C++ development times") from the guys who suck at programming.


I want to point something nobody ever talk about :

Efficient data containers.

For instance, Haskell and C++ come in standard with a lot of these (maps, sets, various linked lists, ...) whereas the C standard doesn't come with anything like this. Moreover, type parameters and templates from these languages make usage of such structures easier and safer.

I've seen some C codes where programmers has used sub-effective containers to answer the problem (like arrays or simple linked lists) as they were lazy to use a non-standardized or to implement a faster but also more complex container. In C++ or Haskell, these efficient structures come for free.

Also, we can say the same for algorithms and concurrency. Haskell awesome safety and expressiveness made it really easy for me to implement such complex yet efficient systems.

These two features of high level languages tend to build ofter faster programs in those languages.


Haskell maps, being persistent, aren't anywhere close to efficient for mutable maps.


The paragraph beginning with "To put it another way, C is no longer close to the real machine" really drove the point home. The further C gets away from the real machine then the less useful C will become. Higher-level languages have the advantage that a compiler can more easily determine what the program is attempting to accomplish and optimize the result for a specified architecture. This will be very difficult to do for C. As a result I would expect the performance of higher-level languages to start exceeding the performance of C. I guess we're just going to have to change our conventional wisdom when that time comes.


Well that point is complete nonsense so you should undrive it.

Just because a piece of code incurs hardware related performance issues does not mean it's "no longer close to the real machine". Cache misses? Reorder your data, or start inserting prefetch statements. Mispredicted branches? Issue a hint, or structure your code better.

Both of these are profile-guided optimizations. It's very difficult for compilers to optimize for access patterns that will only become clear in the context of execution, and often depend on your target spec machine.


But a JIT can perform such optimizations. Which doesn't help Haskell, but may help JavaScript or Lua beat C in some cases.


I'm doubtful that a JIT could realistically adapt its output based on runtime performance metrics, but if you have links I'd love to read more.



None of the problems he cites are issues with his code, though, nor are they as much of an issue with modern processors. Unless I'm missing something, the presentation he cites are running on the Cell processor, not a modern x86 system.

RAM cache misses are pretty bad, but with the blocks that he's dealing with, he's looking at maybe 1 miss per block, and that's assuming that the processor is letting the blocks get entirely out of the 3 levels of cache to RAM.

The compliment array should stay in memory 100% of the time because it's tiny and referenced constantly, and the individual arrays shouldn't be that bad. He's not branching all that much, so unless the while and for loops are triggering after 1 or 2 characters, it's unlikely that's the issue.

The actual problem he's more likely running into is the OS rescheduling the thread to a different core, which is a gigantic hit.


As many advocates of functional programming point out, in many cases the speed of development is more valuable than the running time of the code. The great strength of Haskell and other FPs is their readability and modularity. Trying to win people over with benchmarks is the wrong approach IMO.


Its more a matter of trying to nix the "Haskell would be nice, but its too slow to be practical" line that many see as a killer.


Does Haskell really have a reputation for being slow? In a world where Ruby is the de facto standard for startups, I wouldn't think Haskell would have anything to worry about.


It does. A lot of people have written ignorant blog posts whereby they show their first "real" Haskell program is extremely slow compared to their heavily optimized C version (with years of experience behind it).

A cursory glance at the code, however, reveals that their Haskell program is using linked lists of linked lists of boxed arbitrary precision integers while the C version uses a 2D array of ints.

Okay, perhaps that's a bit of an exaggeration but you get the idea.


The point is that it's comparable to C in speed, which is almost always good enough.


Before you go and rewrite everything in Haskell for speed, it might be worth trying to reproduce the results from here. I tried to and C was much faster than Haskell:

http://paulspontifications.blogspot.com.au/2013/01/when-hask...


When I try it, the Haskell version is faster (1.3s versus 2.6s on a 98MB input file). GHC 7.0.4 with -O3, GCC 4.6.2 with -O3. Not sure why these results are different...


Hmm... it seems to me that whenever any article is posted that claims "X is faster than C", there are immediately 40 replies saying "Well, the author's C is horrible. If I wrote that, it would be much different."

Okay, as someone who has NOT been programming in C 8 hours a day for years on end, I would actually like to see somebody do this -- to show me what GOOD C looks like.

So if someone wouldn't mind, could you take his C code and show me the improved version? This would really help me understand. (And I don't just mean an example where one line could be improved; I'm talking abou the whole thing.)


Ok, I pledge to re-write this properly and to benchmark the current implementation vs a nice one. I'll post the results. I need something to get my mind off things and this is as good as any. It will take at least until Monday (it is my sons birthday tomorrow).


Hey Jacques, may ask for a code review ? :-)

http://stdio.be/revseq.c

Compiled with "gcc -pipe -Wall -O3 -fomit-frame-pointer -std=c99 -pthread" on my Mac, it's about twice as fast at the blog author's version.

Time spent: coding: 30 minutes bugfixing: 30 minutes

I have a feeling there is some kind of catch in the description of the algorithm in terms of implementing the output, but I for the life of me could not grok whether they wanted me to parse the entry into these three pieces or not...

EDIT to add: The only optimization I made was inlining the tightly-called "subst" function, did it without any profiling (so the optimization process literally took about 30 seconds:). Before inlining this version was still about 15% faster than the blog author's one.


Sure, I'll pick it up in the post. Neat little project this. I won't peek at your code until I'm done.


Thank you! It was indeed a nice little fun exercise. It is interesting the bugs that I made by being tired and not reading the task carefully/not thinking clearly (yesterday was a bit of a long and stressy day):

1) My initial understanding was that I do need to reverse the order, yet somehow after re-reading the article I understood the order does not need to change, and the "reverse" in the name is some kind of jargon. This is quite stupid, and probably not worth mentioning, if only to prove I was tired :)

2) missing that the first iteration of the "business logic" code in my case happens before anything is filled in. Crash.

3) forgetting about the "\n"s - with rather funny "partially correct" output effect.

Very much looking forward to see your code !


Hey Andrew,

Ok, I looked at your code. What you really should do (before coding up the solution) is to look at the problem specification. Other than that I like the 'direct' approach, it isn't quite as fast as what I cooked up but yours is a lot shorter.


Thanks! yes, this goes to show the perils of coding after a 12h work day on Friday :-). This affected the use of fgets() I/O (I understood you have to call the line-buffered routines based on their description)

Enjoyed reading your code. Beautiful. Thanks!


Looking forward to seeing your results from this.


Seconded. How are you going to be posting it?


Blog entry, already working on it. It's fun.


Be sure to post it as a HN submission when you finish!


Subscribing to this.


There already is an efficient multithreaded C version on the shootout site. I don't know if that's what people would consider good production code, but at least it isn't laughably bad.

http://benchmarksgame.alioth.debian.org/u32/program.php?test...


On my 64-bit MBP, this program does not terminate - at least not within the 3 minutes I allowed it to run. (the blog article's one completes within 4 seconds, so I thought 60-fold slack is enough).

I did not debug it though.



I know it works on Ubuntu 64-bit. :-) Just that it appears to be hanging on MacOS 64-bit. The algorithms like this one should hardly have such a catastrophic failure - that's why I am curious if someone else can replicate it hanging on MacOS.

BTW, on 8-core Ubuntu box with 32Gb RAM, my silly half-an-hour hack from yesterday - http://stdio.be/revseq.c - still outperforms this one and completes in 60% of time. So I am getting more and more convinced that either I did not get the spec right while writing it.. Or that I should submit it :-)

EDIT: If you have a chance to give it a whirl alongside, could be fun to compare your results with mine...


I want an article that says "Why it doesn't matter what's faster than what - just ship product people use". I code everything I create using the language that gives me the least amount of friction between me and the working product.

On Android that's Java/C/C++, for my servers it's Python/Ruby/Java/C++/C (everything is service oriented consumable APIs), for the browser it's Javascript. Any time I want something faster - I recode and bind it in C/C++ if at all possible.

I feel like a lot of these language comparison posts are just a massive pissing contest. All that matters is that people use the thing that you made. Billions of lines of code have been written in the past. Make sure your code isn't part of the billion that nobody cares about.


I'm afraid the argument that C+ASM is always faster is flawed in reality. Pure ASM, with a bit of C thrown in, maybe, but this is just as impractical for complex codes as C itself is.

It is well known that for numerical codes Fortran beats the pants off C. Why is this? Because the structure of C programs proves difficult to optimise automatically. Indeed the C committee attempted to address one of the main problems by introducing the restrict keyword (the problem of course is aliasing).

For complex codes, ASM isn't an option. For large functions, high levels of optimisation aren't an option for C because C compilers are incapable of optimisation in a reasonable time frame: I have short code that cannot be compiled in less than 2 minutes on either gcc or clang. Full alias analysis requires data flow which is cubic order on function size and C compilers are incapable of partitioning functions to keep the cost down.

Furthermore, C has an weak set of control operations and an object model which is generally inappropriate for modern software. K&R C compilers were hopeless because of the ABI required the caller push and pop arguments to support varargs, preventing tail rec optimisation of C function calls.

Subroutine calling is useful, but it is not the only control structure. Continuation passing, control exchange, and other fundamentals are missing from C. These things can always be emulated by turning your program into one large function, but then, it isn't C and it cannot be compiled because the best C compilers available cannot optimise large functions.

Similarly, complex data structures which involve general graph shapes require garbage collection for memory management. With C that's not built in so you have no choice but to roll your own (there is no other way to manage a graph). It's clear that modern copying collectors will beat the pants of C in this case.

C++ pushes the boundaries. It can trash C easily because it has more powerful constructions. It had real inlining before C, and whole program compilation via templates. It is high enough level for lazy evaluators to perform high level optimisations (expression templates) C programmers could never dream of. And C++ virtual dispatch is bound to be more effective than roll your own OO in C, once the program gets complex because the C programmer will never get it right: the type system is too weak.

Many other languages generate C and have an FFI, some, like Felix, go much further and allow embedding. Indeed, any C++ program you care to write is a Felix program by embedding, so Felix is necessarily faster than C by the OP's argument: C++ is Felix's assembler.

As the compiler writer I have to tell you that the restriction to the weak C/C++ object model is a serious constraint. I really wish I could generate machine code to get around the C language. Its slow. Its hard to express useful control structures in. It tends to generate bad code. With separate compilation bad performance is assured.

I am sorry but the OP is just plain wrong. C is not assured to be faster, on the contrary, its probably the worst language you could dream up in terms of performance. The evidence is in the C compilers themselves. They're usually written in C, and they're incapable of generating reasonable code in many cases and impossible to improve because C is such a poor language that all the brain power of hundreds of contributors cannot do it.

Compare with the Ocaml compiler, written in Ocaml, which is lightning fast and generates reasonable code, all the time: not as fast as C for micro-benchmarks but don't even think about solving complex graph problems in C, the Ocaml GC (written in C), will easily trash a home brew collection algorithm.

Compare with ATS(2) compiler, written in Ocaml(ATS), which by using dependent typing eliminates the need for run time checks that plague C programs given the great difficulty reasoning about the correctness of C codes. AST generates C, but you would never be able to hand write that same C and also be confident your code was correct.

Compare with Felix, compiler written in Ocaml, which generates C++, can do very high level optimisations, which can embed C++ in a more flexible way than a mere FFI, and which provides some novel control structures (fibres, generators) which you'd never get right hand coding in C.

The bottom line is that OP's claim is valid only in a limited context. C is good for small functions where correctness is relatively easy to verify manually and optimisation is easy to do automatically, and any decent C code generating compiler for a high level language will probably generate C code with comparable performance.

So the converse of the argument is true: good high level languages will trash C in all contexts other than micro tests where they will do roughly the same.


The Rust compiler was also written in OCaml (until it could be self-hosted in Rust itself).


It's not actually the structure of C programs, it's the guarantees the language offers. So only about the first paragraph of your rant is right.

Fortran had almost no memory aliasing, explicit global accesses, and offered almost unbridled implementor freedom. As long as the operations got done, it didn't care what happened behind the scenes.

None of the rest of the things you talk about matter when it comes to optimizing C, to be honest. If you gave me no aliasing by default and explicit globals, I probably could do near as well as fortran (though it would take significantly more analysis) in terms of loop transformations.

Note that "full alias analysis" is statically undecidable. When you say cubic order, you are thinking of Andersen's subtyping based algorithm. There are unification based algorithms that are almost linear time (inverse ackermann).

At this point, we have scaled these algorithms almost as far as you can on a single computer. You can do context insensitive andersens on many million LOC without too much trouble.

Context-insensitive unification points-to can scale to whatever size you like.

Context sensitive unification based algorithms do quite well in practice with 10 million LOC + codebases.

The main reason you don't see unification based algorithms used often in free compilers is because the entire set of algorithms are covered by patents owned by MS Research.

As a final note, note that C++ does not really help optimization in practice, it often hurts it.

It is very hard to teach a pointer analysis algorithm about virtual calling. Most compilers treat them like function pointers that get type-filtered, and do some form of class hierarchy analysis to limit the number of call graph targets, or try to incrementally discover the call graph. It's a bit of a mess.

On the other hand, straight function pointers get resolved either context-sensitively or insensitively.

In fact, C++ makes type based aliasing a lot worse due to placement new being able to legally change the type of a piece of memory, which is very hard to track over a program.

Even outside the realm of alias analysis, C++ involves a lot more structures, which means a lot more time has to be spent trying to get pieces back into scalars, or struct splitting, or something, so that you don't end up having to fuck around with memory every time you touch a piece of the structure.

I could go on and on.

In short: Any of C++'s lower level optimization advantages come from less pointer usage by programmers, not language guarantees.

At the high level, it's from better standard implementations and common usage idioms.

In any case, high level languages, particularly those with memory objects (not real pointers, just memory objects, like Java) usually solve none of the pointer/alias analysis related problems. You are still stuck with the same pointer analysis algorithms.

For example: The only nice thing about java's memory system is that doing structure-field sensitive pointer analysis can only help, whereas in C it can hurt, due to some weirdness.

It's just nobody usually gets around to doing pointer analysis on the higher level languages (because it's harder and offers no particular benefit), they lower their language to an IR that already has a good algorithm in it.

Just in case you were wondering, i'm not talking out of my ass. I wrote GCC's first set of high level loop optimizations, and also, it's pointer analysis.


Lots of interesting information. However, I think you over-reacted about Fortran. After reading the parent comment and your comment it seems both of you are saying the same thing: Fortran has the advantage of having no pointer aliasing.

Regarding unification based algorithms, do microsoft use any of it in their F# compiler. I ask because they have time to time tried to say we wont sue you for F# technology. Dont know how much of those sweet nothings are binding. Given your knowledge about compilers and legal systems I am very curious to hear your opinion.


I don't know off hand if they use it. I know they do in static analysis tools. As nice as MS is, they seem to consider compilers solely a cost center. Their compilers produce "relatively good code", but have never really been state of the art.


The contradiction of "C is always faster than everything" (apparently shown untrue by comparison on the reverse-complement problem) is not "Haskell is faster than C".


Indeed not. The useful point this article makes, I think, is that if you're a good Haskell programmer and a competent C programmer, you can produce working code in both, but you can still produce better-performing code in Haskell.

Essentially, the lesson is that there's little point writing code in C for performance unless you're good at it. If you can write good-enough code in a higher-level language that you're happy with, you might already have reached an optimum.


> The winningest programs are always written in highly optimised C

Actually, for example looking at the winningest programs for x64 single core (http://benchmarksgame.alioth.debian.org/):

Ada: 1, C: 5, C++: 2, Fortran: 1, Haskell: 1, Java: 2, Javascript: 1

So plain C is only winningest in 5/12 = 38%, and C/C++ in 7/13 = 54%.


As always talking about performance differences and optimization without thoroughly profiling and pointing out bottlenecks should be frowned upon big time.

I would really love to see the generated assembly code and see what makes the difference in performance. Anyone up to analyzing?


Another monthly claim of XXX is faster than C. Please improve your C experience first.


I've got an honest question: it's not sarcasm...

How do you write fast multithreaded C-code? The article mentions that the C code is too disconnected from the real hardware which, in this case, has multiple cores.

Do you need to call (non portable?) code setting mutexes manually? (not that it would be a problem)

How do you use the CPU's underlying CAS operation? (by inline assembly?)

As an example: the guys who wrote the very fast LMAX disruptor pattern in Java relied on the fact that Java does provide methods inside the AtomicXXX classes calling CAS operations under the hood. But sadly they couldn't "pick" the one operation they'd like, which would have been faster than the one Java decided to use (it's a RFE if I recall correctly: they'd like Oracle to modify Java so that it uses the faster version when it makes sense).

I take it that in C you can inline assembly and do as you want!?


The program in question copies data from stdin, does the barest minimum of preprocessing, statically remaps all characters, and writes to stdout...

Why on earth is this a demonstration of what Haskell can do effectively? There's no room to exploit anything interesting about type level reasoning.


>Conventional wisdom says that no programming language is faster than C, and all higher level languages (such as Haskell) are doomed to be much slower because of their distance from the real machine.

No: conventional wisdom just says that no higher level programming language is consistently faster than C AND/OR better to reason about with regards to memory consumption and runtime behaviour.

(Conventional wisdom also adds fortran, forth, Ada and C++ in the same "speedy" category).

Conventional wisdom adds that micro-benchmarks of some BS outlier examples (Java where the JIT can take advantage of some known condition to do something clever, etc) do not matter in real life programs, which are far more complex.

Conventional wisdom also adds that C to be speedier it doesn't even have to be highly optimized or carefully crafted by some C programming wizard or anything. Merely avoiding gross mistakes (like using an algorithm of the wrong complexity for the job) will do.

Conventional wisdom concludes that writing in your high level language in an unconventional (non idiomatic) way to get to "as fast a C" speeds is bullshit too, because it doesn't represent idiomatic (and far more common) high level language use.


1) There is the C++ language. Will you ever remember it? Stop comparing everything to C — it is an oversimplified outdated language with specific uses for low-level system programming where you don't need complex data structures or high-level application logic.

2) If you summarize all the benchmarks comparing C to other languages, it turns out that the C language is one of the slowest. Of course, that's hardly the case. It's just the dudes who did the benchmarks suck in C/C++ and suck in programming in general.

3) Haskell makes you 10 times more productive without microoptimizing? ORLY? Try doing some big enough data manipulation with C++ 11 & boost vs Haskell. In C++ you're done with the task as long as you get decent performance with the simplest naive very high-level code, whereas in Haskell you're in the beginning of your optimization journey as the simplest code is not good enough, and after optimization you end up with much more complex, cluttered and hard to understand code than in C++ and its performance is still comparable to the naive C++ version if you're lucky.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: