This is a pretty good example of why I don't like Haskell:
* The claim is Haskell is better because it's simpler: the final code is decidedly not simple
* You don't need to know about machine behavior: the example required numerous manual additions, e.g. explicit inlining, explicit strictness, and not using the "obvious" data structures
After all that contortion the only reason it was able to beat the C version was by using multiple cores, specifically on a 4 core machine it was only ~60% faster, and around 80% faster when parsing utf8. Better yet the C implementation actually does true utf8 parsing, via libc, so it's not inlined, whereas the "faster" Haskell code only counts the beginning of multibyte sequences.
I would argue that sans the special cases (which the example doesn't trigger), the C version is the obvious first pass implementation.
Here is my super dumb implementation. Doesn't handle multibyte, doesn't handle stdin -- but I don't think the Haskell version did either, so I don't have a problem with that limit.
All the Haskell version is doing is counting through bytes with the high bit set. If you look at the exciting function in the wc sources they link to it is doing way more work.
I see calls to mbrtowc, which means they support non-utf-8 locales, but I'd love to know, given utf-8, what the semantic difference would be. Are there utf-8 inputs for which the Haskell and C `wc -m` give different answers?
* The claim is Haskell is better because it's simpler: the final code is decidedly not simple
* You don't need to know about machine behavior: the example required numerous manual additions, e.g. explicit inlining, explicit strictness, and not using the "obvious" data structures
After all that contortion the only reason it was able to beat the C version was by using multiple cores, specifically on a 4 core machine it was only ~60% faster, and around 80% faster when parsing utf8. Better yet the C implementation actually does true utf8 parsing, via libc, so it's not inlined, whereas the "faster" Haskell code only counts the beginning of multibyte sequences.
I would argue that sans the special cases (which the example doesn't trigger), the C version is the obvious first pass implementation.