Hacker News new | past | comments | ask | show | jobs | submit login

Completely agreed, and this is a pretty nice overview of the problems with thinking of strings in terms of "characters" or "code points".

Graphemes are almost always the closest to what people mean when they say "character".

But you also shouldn't do logic based on grapheme, unless you're contributing to harfbuzz and know enough to know exactly why this advice is wrong. Don't split or concatenate strings for any reason if you're doing internationalized stuff. E.g. ask your translators to give you a separate string for a drop-cap, and to remove it from the string that follows, do not just pluck out the first grapheme because it could look like nonsense.

---

You should literally never need or want to interact with Unicode directly, unless you're building the foundational layers of other systems (rendering, Unicode normalization, etc). If you find that you have to, you're probably losing encoding information somewhere - that loss is the bug to fix, don't try to patch it somehow, you'll just cause weirder errors elsewhere - or doing something fundamentally irrational, like splitting a string somehow. The never-ending pain you encounter while doing this stuff is a sign you're Doing It Wrong™ and should step back and question the basics, not that it just needs one more fix to work correctly.

If you're doing single-language logging for developers or whatever? Yeah, go wild. Though watch out for irrationally chopped user input, ya gotta make sure your log analyzer won't choke on bad UTF-8.




> Graphemes are almost always the closest to what people mean when they say "character".

But which glyphs[1] are considered graphemes and which are composites varies according to the individual writing system, which is a finer level than Unicode deals with.

In French, é, è, and à are all single graphemes; they are three different vowels and those are their written forms. It's just a coincidence that é and è appear to share an "e" element, and it's also a coincidence that è and à appear to share a diacritic marking. The grave accent has no meaning in isolation, and while the "e" does, its meaning is separate from and unrelated to the meaning of either é or è.

But in Hanyu Pinyin, é is two graphemes, è is two graphemes, and à is two graphemes. (With a total of four graphemes being represented across the set.) É and è share their underlying vowel, while è and à express different vowels but share their underlying tone.

You could demand that strings representing Chinese always use U+0301 COMBINING ACUTE ACCENT alongside U+0065 LATIN SMALL LETTER E, while strings representing French must always use U+00E8 LATIN SMALL LETTER E WITH ACUTE. But that would be insane. Unicode can't generally distinguish graphemes from composite glyphs and doesn't want to.

[1] In this comment, I will define "glyph" to be the unit of output produced by a font. You put some kind of binary data into the font and get image data out; anything that is considered a single indivisible unit of output from the font is a "glyph".


Language / Unicode / encoding interaction details is one of those things where I pretty much always learn something whenever it comes up. Including from your comment (thanks!).

It's fascinating how incredibly widely varied every possible pattern / "rule" is. There's just no end to the edge cases.

Don't split or concatenate strings! Period! Ask your translators to do it, they actually have some idea of what is linguistically and socially acceptable!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: