Hacker News new | past | comments | ask | show | jobs | submit login

Others have talked about the history of UTF-16. I'll focus on that last part: You must not write 32-bit wide characters in UTF-8.

Unicode / ISO 10646 is specifically defined to only have code points from 0 to 0x10FFFF. As a result UTF-8 that would decode outside that range is just invalid, no different from if it was 0xFF bytes or something.

It also doesn't make sense to write UTF-8 that decodes as U+D800 through U+DFFF since although these code points exist, the standard specifically reserves them to make UTF-16 work, and you're not using UTF-16.




> You must not write 32-bit wide characters in UTF-8.

You can't tell me what to do, dad. I'll encode 64 bits and you can't stop me! Bwahahahaa!

    $ perl -MEncode=encode_utf8 -e'print encode_utf8 "\x{7fff_ffff_ffff_ffff}"' | hex
    0000  ff 80 87 bf bf bf bf bf  bf bf bf bf bf           ÿ␀␇¿¿¿¿¿¿¿¿¿¿


To be fair, that actually isn't valid UTF-8 - the leading byte has no zero bit. The largest valid UTF-8 encoding is FE BF BF BF BF BF BF, with value U+F'FFFF'FFFF. (In fact, the original specification only listed up to FD BF BF BF BF BF (U+7FFF'FFFF).)

Furthermore, even if you assume a implied zero bit at position -1, that would only be FF BF BF BF BF BF BF BF, with value U+3FF'FFFF'FFFF.

Also 7FFF'FFFF'FFFF'FFFF is only 63 bits - fer chrissakes son, learn to count.


> As a result UTF-8 that would decode outside that range is just invalid, no different from if it was 0xFF bytes or something.

That's needlessly pedantic. If you use an old version of the spec those bytes are valid.

And "have the capability" seems to me to be talking about what the underlying method is able to do, not the full set of "must not" rules.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: