Hacker News new | past | comments | ask | show | jobs | submit login

The precomposed characters only exist for compatibility with existing character sets and encodings. If you don't want to deal with them in your code, just normalize to NFD and they're gone. If Unicode didn't care about compatibility to legacy character sets at all, adoption would have been very different, I guess. By now it's probably a moot point since not supporting Unicode is foolish at best, but in the early 90s things were very different.

As for diacritics, it depends on what you care about for precomposing them. Actual usage for scripts in use currently? Then it's only a handful and the worst thing probably is Vietnamese or Ancient Greek which have a bunch of characters with more than one diacritic.

However, the current system with composable diacritics gives you plenty of flexibility: Need a character with a diacritic that isn't used in any language currently? Just compose them and you got it. Font support may be spotty (note that Unicode and font support are completely separate things – bashing Unicode for bad fonts is a fairly useless endeavour), but at least you can represent that grapheme in text without resorting to embedding images, or overlaying glyphs by other means (cf. TeX). Those options are also not interoperable with any other applications.

It also means that if some language now develops a script based on, say, Latin, and invents a new diacritic that can go on different vowels, you'd only have to encode a single new code point, not five or six of them. It scales far better and also isn't tied to any specific writing system. I can use ´ on a or on ω and it works the same.

And could you elaborate on how “nearly every Unicode program handles them wrongly”? I'd argue that most programs coming into contact with Unicode do little more than passing it along without caring about the contents at all. And trying to shoehorn human language into something an average programmer can handle without error is likely impossible. Language is complex, writing is complex; Unicode is complex as a result of that. This doesn't only apply to text, mind you, there are lots of things that are complex and are often implemented naïvely or wrongly by programmers who don't know any better. That usually means that programs are broken, and many programmers should know better. Not that we should try adjusting the world to broken programs.




> And could you elaborate on how “nearly every Unicode program handles them wrongly”?

A good chunk don't do surrogate pairs correctly (or are even aware of them), the rest get tripped up by the combining character issue. Even for those who understand it, there are no clear answers: "should a combining character compare equal to a precomposed one?" And of course there are 3 levels of UCS support.

The whole existence of an unnormalized form is a gigantic mistake that could have been easily avoided - simply make the unnormalized form an illegal sequence to begin with.

Unicode programming hasn't gotten as bad yet as timezone programming, but they are well on their way :-(




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: