Hacker News new | past | comments | ask | show | jobs | submit login

Is there a practical upper limit of Unicode set capacity? Kind of IPv4 limits with all reservations and local/multicast quirks.



Unicode has 17 planes. Each plane has 65,536 code points, so the total capacity is 1,114,112 code points. In practice it's a bit less, thanks to surrogates, private areas, and a bunch of "non-character" code points. That still leaves close to a million code points.

Last time I checked, just over 13% of the available public space was allocated. Most of the planes remain unused.


And combinations are used, so e.g. a new emoji only takes zero to one new code point, not one code point per variation. (zero because if I remember right the emojis for families are just something like "woman + boy + girl + man", all existing characters, joined by Zero-Width Joiners)


Why 17 and not a round number like 16? That would give a nice “round” one mebicodepoints


BMP + 16 planes.

You can blame UTF-16 for this mess. Unicode was originally meant to be able to encode two billions (2^31) characters. It bent over backwards to accommodate the limits of the bastard child that is UTF-16.

https://en.wikipedia.org/wiki/Plane_(Unicode)


Maybe they did it because Windows was backwards and used UCS-2, and later, UTF-16? If somehow, Windows managed to switch to UTF-8, I’m sure they (Microsoft) would mess it up and keep the 4 byte limit (imposed by Unicode) there even if it’s later removed (for backwards compatibility). What Microsoft really needs to do, IMO, is rewrite the Windows API to use UTF-8 or UTF-32. Make a `wwchar` type or something...




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: