Hacker News new | past | comments | ask | show | jobs | submit login

An excellent article, although:

> “Ü” is a single grapheme cluster, even though it’s composed of two codepoints: U+0055 UPPER-CASE U followed by U+0308 COMBINING DIAERESIS.

would be a great opportunity to talk about normal form, because there’s also a single code point version: “latin capital letter u with diaeresis”.




Does anyone know the history behind why there’s two ways to “encode” things like that? What’s the rationale for having both combining and precombined codepoints?


I believe a lot of the "combined" characters are (basically) from importing old codepages directly into Unicode, and they did that so it would be a simple formula to convert from the various codepages in use.

I may be wrong however.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: