Hacker News new | past | comments | ask | show | jobs | submit login

Because Unicode (not UTF-anything, Unicode itself) is/became a variable-width encoding (eg U+78 U+304 "x̄" is a single character, but two Unicode code points[0]). So encoding Unicode code points with a fixed-width encoding is completely useless, because your characters are still variable-width (it's also hazardous, since it increases how long it takes for bugs triggered by variable-width characters to surface, especially if you normalize to NFC).

0: Similarly, U+1F1 "DZ" is two characters, but one Unicode code point, which is much, much worse as it means you can no longer treat encoded strings as concatenations of encoded characters. UTF-8-as-such doesn't have this problem - any 'string' of code points can only be encoded as the concatenation of the encodings of its elements - but UTF-8 in practice does inherit the character-level version of this problem from Unicode.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: