What do you mean by true utf8 parsing?

olliej · on Oct 16, 2019

All the Haskell version is doing is counting through bytes with the high bit set. If you look at the exciting function in the wc sources they link to it is doing way more work.

_pvxk · on Oct 16, 2019

I see calls to mbrtowc, which means they support non-utf-8 locales, but I'd love to know, given utf-8, what the semantic difference would be. Are there utf-8 inputs for which the Haskell and C `wc -m` give different answers?

olliej · on Oct 17, 2019

You are correct - I was wrong about utf8, it is just doing a more correct decode than the inline trivial test that the Haskell version does.