This is a pretty good example of why I don't like Haskell: \* The claim is Haske...

olliej · on Oct 16, 2019

I just wrote a very dumb implementation of wc and it’s easily twice as fast.

I think the core issue is that wc is not a super optimized program. It is sufficiently fast for most purposes and so hasn’t ever been improved.

xrisk · on Oct 16, 2019

Can you test it with GNU wc? I'm curious https://github.com/coreutils/coreutils/blob/master/src/wc.c

olliej · on Oct 16, 2019

Here is my super dumb implementation. Doesn't handle multibyte, doesn't handle stdin -- but I don't think the Haskell version did either, so I don't have a problem with that limit.

https://github.com/ojhunt/wc

It's the simplest thing that could obviously work, but its generally not tested beyond the most basic versions.

olliej · on Oct 16, 2019

I’m comparing to Mac wc, which is apparently what they were testing against, and I was getting twice that perf without anything clever.

That said I’ll try to post my horrifying impl to GitHub later :)

_pvxk · on Oct 16, 2019

What do you mean by true utf8 parsing?

olliej · on Oct 16, 2019

All the Haskell version is doing is counting through bytes with the high bit set. If you look at the exciting function in the wc sources they link to it is doing way more work.

_pvxk · on Oct 16, 2019

I see calls to mbrtowc, which means they support non-utf-8 locales, but I'd love to know, given utf-8, what the semantic difference would be. Are there utf-8 inputs for which the Haskell and C `wc -m` give different answers?

olliej · on Oct 17, 2019

You are correct - I was wrong about utf8, it is just doing a more correct decode than the inline trivial test that the Haskell version does.