Others have talked about the history of UTF-16. I'll focus on that last part: Yo...

bmn__ · on May 29, 2021

> You must not write 32-bit wide characters in UTF-8.

You can't tell me what to do, dad. I'll encode 64 bits and you can't stop me! Bwahahahaa!

    $ perl -MEncode=encode_utf8 -e'print encode_utf8 "\x{7fff_ffff_ffff_ffff}"' | hex
    0000  ff 80 87 bf bf bf bf bf  bf bf bf bf bf           ÿ␀␇¿¿¿¿¿¿¿¿¿¿

a1369209993 · on May 29, 2021

To be fair, that actually isn't valid UTF-8 - the leading byte has no zero bit. The largest valid UTF-8 encoding is FE BF BF BF BF BF BF, with value U+F'FFFF'FFFF. (In fact, the original specification only listed up to FD BF BF BF BF BF (U+7FFF'FFFF).)

Furthermore, even if you assume a implied zero bit at position -1, that would only be FF BF BF BF BF BF BF BF, with value U+3FF'FFFF'FFFF.

Also 7FFF'FFFF'FFFF'FFFF is only 63 bits - fer chrissakes son, learn to count.

Dylan16807 · on May 29, 2021

> As a result UTF-8 that would decode outside that range is just invalid, no different from if it was 0xFF bytes or something.

That's needlessly pedantic. If you use an old version of the spec those bytes are valid.

And "have the capability" seems to me to be talking about what the underlying method is able to do, not the full set of "must not" rules.