Hacker News new | past | comments | ask | show | jobs | submit login

Do you know why UTF-8 won? I feel textual data only constitutes a very tiny portion of memory used, but working with fixed size encoding is so much more easier than variable size encodings.



The only universal, fixed-size encoding is UTF-32, which, as you can imagine, is very wasteful on space for ASCII text. Like it or not, most of the interesting strings in a given program are probably ASCII.

UTF-16 is not a fixed-size encoding thanks to surrogate pairs. UCS-2 is a fixed-size encoding but can’t represent code points outside the BMP (such as emoji) which makes it unsuitable for many applications.

Besides, most of the time individual code points aren’t what you care about anyway, so the cost of a variable-sized encoding like UTF-8 is only a small part of the overall work you need to support international text.


* UTF8 is ascii-compatible, so functions working with ascii will kinda sorta work with utf8

* all content will increase significantly in size using utf32 (utf16 is also variable-size, and markups being extremely common and usually ascii, while utf8 is not a guaranteed winner against 16 it often is)

* unicode itself is variable-size encoding due to combining code points, so a fixed-size encoding really doesn’t net you anything


> "will kinda sorta work"

That sound more like an anti-feature resulting in unstable programs that almost work.


Most of the times, it works flawlessly, other times it fails on some inputs. And that defines how well your old code will behave is what you are using it for, so this property is known at design time and if you go for reliability (you certainly should, but most people don't) you can know before writing the program if you'll need to care about text encoding or not.

Either way, a complete rewrite of the text handling functionality should give you flawless functionality. At this point in time, all the important ecosystems that use UTF-8 are almost there.

This is very different from the other encodings where a complete rewrite of the text handling functionality is needed just not to fail every time. That made all the important ecosystems that used other encodings to get almost there much sooner, but there was an important period when everything was broken, and the improvements are much slower nowadays, because when you need to fix every aspect of something, iterations take much more labor.


> That sound more like an anti-feature resulting in unstable programs that almost work.

It's both. It will generally ignore non-ascii data, but that is very commonly something you don't care about, in which case it's a net advantage over plain not working at all.


No, it definitely works without errors, as long as the UTF-8 text is in ASCII space.


Because Unicode (not UTF-anything, Unicode itself) is/became a variable-width encoding (eg U+78 U+304 "x̄" is a single character, but two Unicode code points[0]). So encoding Unicode code points with a fixed-width encoding is completely useless, because your characters are still variable-width (it's also hazardous, since it increases how long it takes for bugs triggered by variable-width characters to surface, especially if you normalize to NFC).

0: Similarly, U+1F1 "DZ" is two characters, but one Unicode code point, which is much, much worse as it means you can no longer treat encoded strings as concatenations of encoded characters. UTF-8-as-such doesn't have this problem - any 'string' of code points can only be encoded as the concatenation of the encodings of its elements - but UTF-8 in practice does inherit the character-level version of this problem from Unicode.


The only way to "properly" have a fixed width encoding is to allocate 80-128 bytes per character. Anything else will break horribly on accents and other common codepoints. So everyone uses the less-easy methods.

I base this number off the "Stream-Safe Text Format" which suggests that while it's preferred that you accept infinitely-long characters, a cap of 31 code points is more or less acceptable.


It was mostly because all the C functions that worked with single byte characters encoding also worked with UTF-8.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: