> And could you elaborate on how “nearly every Unicode program handles them wrongly”?
A good chunk don't do surrogate pairs correctly (or are even aware of them), the rest get tripped up by the combining character issue. Even for those who understand it, there are no clear answers: "should a combining character compare equal to a precomposed one?" And of course there are 3 levels of UCS support.
The whole existence of an unnormalized form is a gigantic mistake that could have been easily avoided - simply make the unnormalized form an illegal sequence to begin with.
Unicode programming hasn't gotten as bad yet as timezone programming, but they are well on their way :-(
A good chunk don't do surrogate pairs correctly (or are even aware of them), the rest get tripped up by the combining character issue. Even for those who understand it, there are no clear answers: "should a combining character compare equal to a precomposed one?" And of course there are 3 levels of UCS support.
The whole existence of an unnormalized form is a gigantic mistake that could have been easily avoided - simply make the unnormalized form an illegal sequence to begin with.
Unicode programming hasn't gotten as bad yet as timezone programming, but they are well on their way :-(