Completely agreed, and this is a pretty nice overview of the problems with think...

thaumasiotes · on June 27, 2022

> Graphemes are almost always the closest to what people mean when they say "character".

But which glyphs[1] are considered graphemes and which are composites varies according to the individual writing system, which is a finer level than Unicode deals with.

In French, é, è, and à are all single graphemes; they are three different vowels and those are their written forms. It's just a coincidence that é and è appear to share an "e" element, and it's also a coincidence that è and à appear to share a diacritic marking. The grave accent has no meaning in isolation, and while the "e" does, its meaning is separate from and unrelated to the meaning of either é or è.

But in Hanyu Pinyin, é is two graphemes, è is two graphemes, and à is two graphemes. (With a total of four graphemes being represented across the set.) É and è share their underlying vowel, while è and à express different vowels but share their underlying tone.

You could demand that strings representing Chinese always use U+0301 COMBINING ACUTE ACCENT alongside U+0065 LATIN SMALL LETTER E, while strings representing French must always use U+00E8 LATIN SMALL LETTER E WITH ACUTE. But that would be insane. Unicode can't generally distinguish graphemes from composite glyphs and doesn't want to.

[1] In this comment, I will define "glyph" to be the unit of output produced by a font. You put some kind of binary data into the font and get image data out; anything that is considered a single indivisible unit of output from the font is a "glyph".

Groxx · on June 27, 2022

Language / Unicode / encoding interaction details is one of those things where I pretty much always learn something whenever it comes up. Including from your comment (thanks!).

It's fascinating how incredibly widely varied every possible pattern / "rule" is. There's just no end to the edge cases.

Don't split or concatenate strings! Period! Ask your translators to do it, they actually have some idea of what is linguistically and socially acceptable!