C has been making strides towards complete Unicode support. I've been having tro...

moonchild · on April 14, 2020

Converting arrays of utf8-encoded char to arrays of utf32-encoded 'rune' would probably not do what you want. That still leaves e.g. combining diacritical marks as separate from the characters they modify. If you care about breaking up text into codepoints, you probably also care about that sort of thing. The base unit of unicode is the extended grapheme cluster. In order to actually convert text into extended grapheme clusters, however, you need to have a database that tells you what kind of codepoint each codepoint is. Since c is standardized less frequently than unicode, any kind of unicode or utf support from the specification would quickly get out of date.

loeg · on April 14, 2020

Probably link libicu rather than rely on libc.

rurban · on April 14, 2020

libicu is a 40MB mess where you need only 5Kb of it. Only case folding and one normalization is needed, with tiny tables.

Additionally the used UNICODE_MAJOR and _MINOR are needed. They are always years behind, and you never know which tables versions are implemented.

loeg · on April 15, 2020

  -Wl,--gc-sections

fanf2 · on April 14, 2020

Are you looking for mbstowcs() or mbtowc() ?

beefhash · on April 15, 2020

wchar_t can be (a) not Unicode in any way, or (b) 16-bit, insufficient to represent a rune.