Hacker News new | past | comments | ask | show | jobs | submit login

C has been making strides towards complete Unicode support. I've been having trouble following along though: Am I correct in assuming that there's no actual multi-byte UTF-8 to UTF-32 Rune function and the best approximation depends on whatever wchar_t is? How would I best handle pure Unicode input and output scenarios on a "hostile" OS whose native character encoding is some EBCDIC abomination or a Windows codepage?



Converting arrays of utf8-encoded char to arrays of utf32-encoded 'rune' would probably not do what you want. That still leaves e.g. combining diacritical marks as separate from the characters they modify. If you care about breaking up text into codepoints, you probably also care about that sort of thing. The base unit of unicode is the extended grapheme cluster. In order to actually convert text into extended grapheme clusters, however, you need to have a database that tells you what kind of codepoint each codepoint is. Since c is standardized less frequently than unicode, any kind of unicode or utf support from the specification would quickly get out of date.


Probably link libicu rather than rely on libc.


libicu is a 40MB mess where you need only 5Kb of it. Only case folding and one normalization is needed, with tiny tables.

Additionally the used UNICODE_MAJOR and _MINOR are needed. They are always years behind, and you never know which tables versions are implemented.


  -Wl,--gc-sections


Are you looking for mbstowcs() or mbtowc() ?


wchar_t can be (a) not Unicode in any way, or (b) 16-bit, insufficient to represent a rune.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: