Hacker News new | past | comments | ask | show | jobs | submit login

What do you mean by true utf8 parsing?



All the Haskell version is doing is counting through bytes with the high bit set. If you look at the exciting function in the wc sources they link to it is doing way more work.


I see calls to mbrtowc, which means they support non-utf-8 locales, but I'd love to know, given utf-8, what the semantic difference would be. Are there utf-8 inputs for which the Haskell and C `wc -m` give different answers?


You are correct - I was wrong about utf8, it is just doing a more correct decode than the inline trivial test that the Haskell version does.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: