To be more clear, this flaw happened because way back when we thought UCS-2 was sufficient. Then UTF-16 happened and all these systems designed for UCS-2 had to be hacked up for UTF-16 instead.
To be even more clear, this happened because early versions of Unicode consisted of fewer than 2^16 code points, and this was thought to be enough. In 1996, Unicode 2.0 was published and allowed characters whose encoding required more than 16 bits. This necessitated UTF-16 which allows expansion up to 32 bits to represent characters in higher planes (and is compatible with most text that had been encoded in UCS-2).
It doesn't allow expansion up to 32 bits. How would that even be possible? Some of the bits are needed to indicate to use the second two bytes, so there's a hole in the range and it's less than 32 bits.
Correct, this is what I meant. UTF-16 can use up to 32 bits to represent a larger number of characters, but cannot uniquely represent 2^32 possible characters.
"Surrogate pairs" use two code units (32 bits total) to increase the number of code points that can be represented. It seems a lot more complicated than UTF-8. I have no idea how many code points can actually be represented in theory. There are under 21 bits' worth of Unicode codepoints, though, so I'm sure the claim that UTF-16 can represent all of them isn't some kind of trick.
A surrogate pair can represent 20 bits, or 2^20 values. A single UTF-16 byte is of course 16 bits, or 2^16 values. 2^20 + 2^16 == 0x110000, which is precisely as many unicode code points as there are (0–0x10FFFF).
Decoding UTF-16 (with surrogate pairs) is pretty similar to decoding UTF-8 except the state machine is simpler (as it's just 1–2 code units per codepoint, instead of 1–3). It also means that if you validate that you have a high surrogate and a low surrogate, the combination is guaranteed to be a valid codepoint (whereas with UTF-8 there are invalid encodings, e.g. any 1-byte codepoint can actually be encoded using 2 or 3 bytes instead, which the state machine must reject).
"A surrogate pair can represent 20 bits, or 2^20 values. A single UTF-16 byte is of course 16 bits, or 2^16 values. 2^20 + 2^16 == 0x110000, which is precisely as many unicode code points as there are (0–0x10FFFF)."
This explanation can't be quite right, because the possible surrogate code unit values are carved out of the set of 2^16 code units. So you are double counting those surrogate code units.
From Wikipedia:
"Since the ranges for the high surrogates (0xD800–0xDBFF), low surrogates (0xDC00–0xDFFF), and valid BMP characters (0x0000–0xD7FF, 0xE000–0xFFFF) are disjoint, it is not possible for a surrogate to match a BMP character, or for two adjacent code units to look like a legal surrogate pair."
OK, I've found the gross reason why the 2^16 + 2^20 arithmetic works:
"The Unicode standard permanently reserves these code point values for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that no UTF forms, including UTF-16, can encode these code points."
UTF-16 encodes a wider character with two surrogate pairs, each pair holding 10 bits of the codepoint. That's why Unicode has 17 planes each containing 65,536 characters, which would otherwise seem odd to have a non-power of two number of Unicode codepoints.