Hacker News new | past | comments | ask | show | jobs | submit login

It's interesting how history seems to have repeated itself with UTF-16. With ASCII and its extensions, we had 128 "normal" characters and everything else was exotic text that caused problems.

Now with UTF-16, the "normal" characters are the ones in the basic multilingual plane that fit in a single UTF-16 code point.




It's worse. With UTF-8, if you're not processing it properly it becomes obvious very quickly with the first accented character you encounter. With UTF-16 you probably won't notice any bugs until someone throws an emoticon at you.


Unfortunately not. It's easy to process UTF-8 such that you mishandle certain ill-formed sequences that you are unlikely to encounter accidentally. IIS was hit [1], Apache Tomcat was hit [2], PHP was hit twice [3] [4].

UTF-16 has its own warts, but invalid code units and non-shortest forms are exclusive to UTF-8.

[1] http://www.sans.org/security-resources/malwarefaq/wnt-unicod...

[2] http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2008-2938

[3] https://www.cvedetails.com/cve/CVE-2009-5016/

[4] https://www.cvedetails.com/cve/CVE-2010-3870/




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: