Parsing with Haskell and Attoparsec

lmm · on Aug 30, 2012

So the Haskell's not actually fully parsing the data, it's returning a lazy value that will do some of the work later? It's a perfectly good approach, but it means the performance comparison isn't really fair - the Haskell approach would slow down the rest of your program if you were actually using the value you parsed.

Comparing the line count between a program that uses a library and one that parses by hand is also somewhat unfair.

danieldk · on Aug 30, 2012

He is only parsing the headers in the final Haskell version, and skipping the body. In the C++ version, he also seems to check the MD5 and/or SHA1 hash (if a hash of the body available and OpenSSL with SHA1/MD5 support is available).

So, it's not exactly clear to me if the C++ and the Haskell version are still doing the same thing. Also, in the C++ version he is parsing stuff by hand, it will probably a lot faster when using a proper lexer (you usually can't beat an optimized finite state machine ;)) and parser.

Unfortunately, the emphasis on performance distracts from the main point: parsers written with attoparsec or parsec are usually understandable, type-safe, and pretty, while offering performance near to that of a C/C++ version.

In other words, if you have enough time, you can probably come up with a C or C++ parser that is as fast or faster than a Haskell-based parser (assuming some overhead of boxing/unboxing, garbage collection, more difficult optimization, ...), but would you care if the Haskell implementation had 80% of the performance and was far more maintainable?

Evbn · on Aug 31, 2012

Can't just make up a number though. What if the Haskell has 10% of the performance and uses 10x the RAM?

DasIch · on Aug 30, 2012

Saying the comparison is unfair because the Haskell solution is lazy makes about as much sense as claiming a comparison between a C and C++ implementation is unfair because the C++ one uses objects.

The point of such a comparison is not to provide a accurate language benchmark by comparing algorithmically identical implementations but to compare the most obvious implementations in those languages that solve a specific problem.

Personally I think the Haskell result would still be impressive at twice the speed of the C++ solution simply due to the LOC used and the time in which it was implemented.

lmm · on Aug 30, 2012

>The point of such a comparison is not to provide a accurate language benchmark by comparing algorithmically identical implementations but to compare the most obvious implementations in those languages that solve a specific problem.

Right, but "parse this text file and then throw the result on the floor" isn't a real problem. The haskell version is deferring some of the work to when the object's actually used - but if you wanted to use that parser in a real program, you'd have to do that work. It's like those webserver benchmarks that you can do much better on by making the server print "HTTP" before it's done any processing.

This seems to be a common problem when benchmarking haskell. You absolutely have to use a program that outputs the real-world result you want, otherwise the clever compiler with its lazy semantics will optimize away all the work.

dons · on Aug 30, 2012

No, that's not actually how laziness works when running a parser...

ky3 · on Aug 30, 2012

The 26 seconds he got at the end does parse all the headers while skipping the raw blobs. It's quite amazing.

batgaijin · on Aug 30, 2012

If you have non length variable ASCII would you still want to use Attoparsec or Data.Binary?