To be more clear, this flaw happened because way back when we thought UCS-2 was ...

stouset · on March 13, 2019

To be even more clear, this happened because early versions of Unicode consisted of fewer than 2^16 code points, and this was thought to be enough. In 1996, Unicode 2.0 was published and allowed characters whose encoding required more than 16 bits. This necessitated UTF-16 which allows expansion up to 32 bits to represent characters in higher planes (and is compatible with most text that had been encoded in UCS-2).

chrisseaton · on March 13, 2019

> allows expansion up to 32 bits

It doesn't allow expansion up to 32 bits. How would that even be possible? Some of the bits are needed to indicate to use the second two bytes, so there's a hole in the range and it's less than 32 bits.

gowld · on March 13, 2019

32 bit-representations (compared to UCS-2), not 32 bits worth of values, so it can represent all of Unicode.

UCS-2 is limited to only 16-bit representation, making it _impossible_ to express Unicode values outside the basic multilingual plane (BMP).

stouset · on March 13, 2019

Correct, this is what I meant. UTF-16 can use up to 32 bits to represent a larger number of characters, but cannot uniquely represent 2^32 possible characters.

theoh · on March 13, 2019

"Surrogate pairs" use two code units (32 bits total) to increase the number of code points that can be represented. It seems a lot more complicated than UTF-8. I have no idea how many code points can actually be represented in theory. There are under 21 bits' worth of Unicode codepoints, though, so I'm sure the claim that UTF-16 can represent all of them isn't some kind of trick.

lilyball · on March 13, 2019

A surrogate pair can represent 20 bits, or 2^20 values. A single UTF-16 byte is of course 16 bits, or 2^16 values. 2^20 + 2^16 == 0x110000, which is precisely as many unicode code points as there are (0–0x10FFFF).

Decoding UTF-16 (with surrogate pairs) is pretty similar to decoding UTF-8 except the state machine is simpler (as it's just 1–2 code units per codepoint, instead of 1–3). It also means that if you validate that you have a high surrogate and a low surrogate, the combination is guaranteed to be a valid codepoint (whereas with UTF-8 there are invalid encodings, e.g. any 1-byte codepoint can actually be encoded using 2 or 3 bytes instead, which the state machine must reject).

theoh · on March 14, 2019

Wait a minute.

"A surrogate pair can represent 20 bits, or 2^20 values. A single UTF-16 byte is of course 16 bits, or 2^16 values. 2^20 + 2^16 == 0x110000, which is precisely as many unicode code points as there are (0–0x10FFFF)."

This explanation can't be quite right, because the possible surrogate code unit values are carved out of the set of 2^16 code units. So you are double counting those surrogate code units.

From Wikipedia: "Since the ranges for the high surrogates (0xD800–0xDBFF), low surrogates (0xDC00–0xDFFF), and valid BMP characters (0x0000–0xD7FF, 0xE000–0xFFFF) are disjoint, it is not possible for a surrogate to match a BMP character, or for two adjacent code units to look like a legal surrogate pair."

theoh · on March 14, 2019

OK, I've found the gross reason why the 2^16 + 2^20 arithmetic works:

"The Unicode standard permanently reserves these code point values for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that no UTF forms, including UTF-16, can encode these code points."

Ugh.

lilyball · on March 14, 2019

Errata: I stated UTF-8 decoding was 1–3 bytes per codepoint. It's 1–4.

theoh · on March 13, 2019

One of us should try to make this clearer on the relevant Wikipedia page.

jcranmer · on March 14, 2019

UTF-16 encodes a wider character with two surrogate pairs, each pair holding 10 bits of the codepoint. That's why Unicode has 17 planes each containing 65,536 characters, which would otherwise seem odd to have a non-power of two number of Unicode codepoints.