Is there a practical upper limit of Unicode set capacity? Kind of IPv4 limits wi...

kijin · on Dec 3, 2019

Unicode has 17 planes. Each plane has 65,536 code points, so the total capacity is 1,114,112 code points. In practice it's a bit less, thanks to surrogates, private areas, and a bunch of "non-character" code points. That still leaves close to a million code points.

Last time I checked, just over 13% of the available public space was allocated. Most of the planes remain unused.

detaro · on Dec 3, 2019

And combinations are used, so e.g. a new emoji only takes zero to one new code point, not one code point per variation. (zero because if I remember right the emojis for families are just something like "woman + boy + girl + man", all existing characters, joined by Zero-Width Joiners)

colejohnson66 · on Dec 3, 2019

Why 17 and not a round number like 16? That would give a nice “round” one mebicodepoints

kijin · on Dec 4, 2019

BMP + 16 planes.

You can blame UTF-16 for this mess. Unicode was originally meant to be able to encode two billions (2^31) characters. It bent over backwards to accommodate the limits of the bastard child that is UTF-16.

https://en.wikipedia.org/wiki/Plane_(Unicode)

colejohnson66 · on Dec 4, 2019

Maybe they did it because Windows was backwards and used UCS-2, and later, UTF-16? If somehow, Windows managed to switch to UTF-8, I’m sure they (Microsoft) would mess it up and keep the 4 byte limit (imposed by Unicode) there even if it’s later removed (for backwards compatibility). What Microsoft really needs to do, IMO, is rewrite the Windows API to use UTF-8 or UTF-32. Make a `wwchar` type or something...