Hacker News new | past | comments | ask | show | jobs | submit login
Should UTF-16 be considered harmful? (stackoverflow.com)
77 points by ch0wn on Aug 18, 2011 | hide | past | favorite | 45 comments



UTF-16 has all of the disadvantages of UTF-8 and none of the advantages, and originally comes from the UCS-2 era where they thought, "64k characters are enough for everyone!" Unfortunately, all of Windows uses it, so we as an industry are stuck with it.


You mean that it is not ASCII compatible?

Also, Mac OS X, Java, .Net, Qt, ICU... there are a lot of support for UTF-16, for other reasons than backwards compatibility. Processing UTF-16 is easier in many situations.


Processing UTF-16 would only be easier if you have a valid byte-order mark and/or know the endianness in advance and you can guarantee that no surrogate pairs exist. Otherwise the same pitfalls exist as when processing UTF-8, plus the endianness issue and the possibility of UTF-16 programmers not knowing about surrogate pairs.


What you're describing is UCS-2, not UTF-16. That's why the latter is frustrating to deal with


UTF-16 is not ASCII-compatible, if that's what you were asking.


Harmful, not really. Sometimes UTF-16 is the most compact representation for a given string, and its semantics are not that confusing. UTF-8 is a better "default choice", but the algorithms for both are very hard to get right. Given an idealized Unicode string object, it's still hard to do things like count the number of glyphs required to display the text in font foo, or convert all characters to "lowercase", or sort a list of names in "alphebetical" order. But the problem is not Unicode or UTF-16 or UTF-8, the problem is defining what a correct computer program is, and ultimately, that's the job for the programmer, not the string encoding algorithm.

Think of UTF-8 and UTF-16 as being like in-memory compression for Unicode strings. You can't take a gzip file, count the number of bytes, and multiply by some factor to get the true length of the string. But, this complexity is often worth the space savings, so that's why these encodings exist.

The problem that people have is twofold: one, the average programmer doesn't realize that characters and character encodings are two different things. The second is, they don't realize that a Unicode string is a data structure that does one thing and only one thing: maintain a sequence of Unicode codepoints in memory.

A Unicode string is not "text" in some language; it's some characters for a computer. To choose the right font or sort things correctly, you need additional information sent out-of-band. There is no way to do it right given only the Unicode string itself. People are reluctant to send out-of-bad data and so other people assume Unicode is hard. It's not. The problem is that Unicode is not text.


UTF16 gives you the worst of both worlds: it wastes a lot of memory for ASCII-like languages (mostly ASCII plus some special characters), you have to deal with byte ordering and you still don't get the advantages of directly addressing a specific character without parsing the whole string up to the character you want.

But.

If you widen the character even more, you'd probably still want it to be somewhat word-aligned so you'd use 32bit per character which would have enough storage to store all of Unicode plus some. The cost though is obvious: you waste memory.

Depending on the language, you even waste a lot of memory: think ASCII or ASCII-like (the middle european languages). In UHF-8 those need - depending on language - barely more than one byte up to two bytes per character. Representing these languages with 4 bytes per character makes you use nearly 4 times the memory reasonably needed.

This changes the farther east you move. Representing Chinese (no ASCII, many character points high up in the set) in utf-8, you begin wasting a lot of memory due to the ASCII compatibility. As encoding a Unicode code point in utf-8 uses around one byte more than if you would just store the code point as an integer.

So on international software running on potentially limited memory while targeting the eastern languages, you will again be better off using utf-16 as it requires less storage for characters really high up in the Unicode plane.

Also, if you know that you are just extended ASCII, you can optimize and access characters directly without parsing the whole string, giving you another speed advantage.

I don't know what's the best way to go. 32 bits is wasteful, utf-16 is sometimes wasteful, has endianness issues and still needs parsing (but is less wasteful than 32bits in most realistic cases) and utf-8 is really wasteful for high code points and always requires parsing but doesn't have the endianness issues.

I guess as always these are just tools and you have to pick what works in your situation. Developers have to adapt to what was picked.


> Representing Chinese (no ASCII, many character points high up in the set) in utf-8, you begin wasting a lot of memory due to the ASCII compatibility.

That's not "a lot", that's only 33% more in the best case of purely CJK text.

OTOH any non-CJK characters are 50% smaller. It starts to even out if you add few foreign names or URLs to the text.

And for HTML, UTF-16 is just crazy. Makes HTML markup twice as expensive.

CJK websites don't use UTF-16, they use Shift-JIS or GBK, which are technically more like UTF-8.


How exactly does UTF-16 make HTML twice as expensive?

I guarantee you that there are very few apps that will double their memory usage if you start using UTF-16 text. Even if you start looking at bandwidth once you compress the text there is very little difference. (You are compressing your HTML right?)

The case for UTF-8 saving memory makes a lot of sense if you're writing embedded software, however in most stacks the amount of memory wasted by UTF-16 is trivial compared to the amount of memory wasted by the GC, bloated libraries, interpreted code, etc.

If you're using .NET or the JVM char is 16 bits wide anyway. The UTF-8 vs. UTF-16 debate is a perfect example of mircobenchmarking where theoretically there is a great case for saving a resource but in aggregate makes very little difference.


> How exactly does UTF-16 make HTML twice as expensive?

    < 0 h 0 t 0 m 0 l 0 > 0
> If you're using .NET or the JVM char is 16 bits wide anyway.

Hopefully you don't need to be worried what .Net/JVM have to do for legacy reasons and you can use UTF-8 for all input and output.


> If you widen the character even more (...)

If you go to UTF32 you waste a lot more space in any conceivable situation (extra-BMP characters are less tha 0.1% of the text even in CJK languages) and you still don't get anything in exchange thanks to combining diacriticals and ligatures. Thanks Unicode Committee!


But as one of the replies in that thread pointed out, if you want to compress strings, use LZ! UTF-16 won't give you as good compression.


Unicode should be considered harmful, possibly even text. Never think you understand text, it is a very complex medium, and every time this topic is brought up, you learn something about some odd quality of some language that you might never have heard of. Yes, UTF-16 is variable length, yes, it does make many European scripts larger. Size is always a trade-off and there won't be one standard for encoding.

Text is hard, do not approach it with a C library you built in an afternoon, leave it to the professionals. I just wish I knew any...


> Size is always a trade-off and there won't be one standard for encoding.

I don't see why, in modern multi-core CPUs/GPUs where 99% of time is spent idle, we can't just have "raw" text (UCS4) to manipulate in memory, and compressed text (using any standard stream compression algorithm) on disk/in the DB/over the wire.

Anything that's not UCS4 is already variable-length-encoded, so you lose the ability to random-seek it anyway; and (safely performing) any complex text manipulation, e.g. upper-casing, requires temporarily converting the text to UCS4 anyway. At that point, you may as well go all the way, and serialize it as efficiently as possible, if you're just going to spit it out somewhere else. I guess the only difference is that string-append operations would require un-compressing compressed strings and then re-compressing the result—but you could defer that as long as necessary using rope[1].

[1] http://en.wikipedia.org/wiki/Rope_(computer_science)


The oft-stated advantage of UTF-32/UCS4 is that you can do random access. But random character access is almost entirely useless for real text processing tasks. (You can still do random byte access for UTF-8 text, and if your regexp engine spits out byte offsets, you're fine.)

Even when you're doing something "simple" like upcasing/downcasing, the advantages of UTF-32 are not great. You are still converting variable length sequences to other variable length sequences -- e.g., eszett upcases to SS.

Now the final piece to this is that for some language implementations, compilation times are dominated by lexical analysis. Sometimes, significant speed gains can be had by dealing with UTF-8 directly rather than UTF-32 because memory and disk representation are identical, and memory bandwidth affects parsing performance. This doesn't matter for most people, but it matters to the Clang developers, for example. Additional system speed gains are had from reducing memory pressure.

Sure, we have plenty of memory and processor power these days. But simpler code isn't always worth 3-4x memory usage.

Text is not simple.


> I don't see why, in modern multi-core CPUs/GPUs where 99% of time is spent idle

We have 3 levels of caching and hyperthreading cores because memory access is so ridiculously slow compared to the CPU. Quadrupling amount of data that goes through this bottleneck isn't going to help.

> Anything that's not UCS4 is already variable-length-encoded

You can't access n-th character in UCS4 anyway, because Unicode has combining characters (e.g. ü may be ¨ + u).


tchrist seems pretty knowledgeable on unicode issues. He even did 3 talks at OSCON on the topic. My takeaway wasthat the best language for dealing with Unicode was Perl, and Ruby was the second best.


I gave a presentation on this topic at Rubyconf Brazil last year, and would be hard pressed to describe Ruby as "good" at dealing with Unicode, unless by "good" you mean "avoids making almost any decisions at all" (which might actually be a good thing but it's debatable).

Ruby 1.9 doesn't even offer Unicode case folding, so from a practical standpoint working with Unicode text is a PITA with Ruby unless you use third party libraries.

Ruby source can include constants, variables, etc. with Unicode (or other character set) symbols, which is very cool but for Unicode text processing I've found Ruby to be frustratingly lacking.


Here are the slides for his "Unicode Shootout: The Good, The Bad, and the Ugly":

http://code.activestate.com/lists/perl5-porters/166738/

Hmmm... According to that email he says that Java is second, but in his first talk he did say that Ruby (1.9+) was second in his mind (and he seemed to be visibily frustrated with Java's Unicode support).


ICU is one library.


UTF-16 is not harmful, just the "I use UTF-16, so I don't have to worry about Unicode issues anymore" stance is harmful.

Processing text is hard. For example if you write an editor, you must be aware of things like right-to-left text sequences, full-width characters and graphemes composed of multiple codepoints.

If you implement a case insensitive search, you must be aware of title case, and the nasty German ß which case-folds to two upper-case characters (SS). (Ok, you'll probably just use regexes, it's the author of the regex engine who has to worry about it. And it is ugly, because it means that character classes can match two characters, which you have to consider in all length-based optimizations).

In first approximation it doesn't matter if you implement your stuff in UTF-8 or in one of the UTF-16 encodings, you have to deal with the same issues (include variable byte length for a single codepoint).


I see a different problem: the scope of Unicode has crept outside the scope of its designers' expertise. Next you know, we'll have a code point for the symbol "arctanh". (Moreover, sin through arccosh will be contiguous, erf will be placed where you'd expect arctanh, and arctanh will sit between blackboard bold B and D, because C is listed a trillion code points earlier as "Variant Complex Number Sign", between "Engineering Right Angle Bracket" and "Computer Programming Left Parenthesis".)

Encoding the technical and mathematical symbols in two bytes is necessarily kludgy. There are International Standard kludges that everyone is supposed to use, but not everyone does. My solution would be for Unicode to stick to human languages.


You shouldn't care about UTF-8/UTF-16/UCS-4 except for performance, and that totally depends on what you are doing with what data-sets.

As someone who speaks multiple languages and has written a fair amount of language-processing code, the simple truth is that if you are not using a vetted unicode framework for writing your application--and if you're asking if UTF-16 should be considered harmful, you're probably not--you are almost certainly introducing massive numbers of bugs that your own cultural biases are blinding you to.

Do not underestimate the difficulty of writing correct multi-lingual-aware programs. These frameworks exist for a reason, and are often written by professionally trained linguists.


Interfacing with third-party libraries is an obvious case where you do need to worry about encodings for reasons that are not performance.


My experience is that the harder a framework tries to "support" Unicode correctly, the more likely it's going to have bugs ranging from annoying to showstopper.

Simply passing UTF-8 through almost works if it wasn't for the fact that "pass through" allows illegal character combinations that can cause all sorts of trouble.


Is there any reason to use a character encoding other than UTF-8 (for new content)? There is a (lot of) legacy content in non-Unicode encodings. Also, UCS-4 might be a useful optimization for processing text.


I spent many years maintaining a USENET newsreader, first on BeOS, then on Mac OS X. Being that it's so old, USENET users are often very entrenched in their ways. Chinese users were particularly fond of BIG5 and seemed in no hurry to move to Unicode. I supported around 15 different encodings for displaying messages, yet it was still very likely to see an article that my app wasn't displaying properly.

I've never maintained an email client, but I'd bet the situation there is much the same.


Email is absolutely horrible. Email is where I started to appreciate statistical analysis for charset detection. Email starts with 7-bit characters and rapidly goes downhill from there.

Email is also where I started to dislike UNIX devs who think that \n is a proper line ending on networked systems. "\r\n" is not the Windows way, it's the network way.

When dealing with email you eventually learn to completely disregard the specs and do whatever works which further screws up the ecosystem for everyone else.


Size, perhaps? Remember, most of the world does not speak English. I bet you baidu uses UTF-16 for its indexes.


Is size actually a concern in the modern world? With transmission and storage size is practically a nonissue and we have proper rather cheap compression anyway. And in memory? We have gigabytes of ram, who cares? Of the potential situations where it may become an issue that I can think of, pretty much none of them are on end user hardware.


I reiterate my example of baidu.

Just because size isn't a problem for your application, doesn't mean there aren't others out there dealing with petabytes of text.


"Of the potential situations where it may become an issue that I can think of, pretty much none of them are on end user hardware."

Do what you need to on your companies machines, but keep it out of ecosystems it has no reason to be in.


the other way round, having a wildly variable size depending on the character can be a weakness. In japanese EUC was sometime used despite its shortcomings to force 16bit by character.


Baidu uses GBK, which is more efficient than UTF-16, doesn't suffer from Han unification, and has ASCII compatibility similar to UTF-8.


If whatever framework or libraries you're using don't hide this from you nearly completely, you might be doing it wrong. Just use the default Unicode format of the OS and framework or standard libraries you're using and never think of it again.

Much like dates and times, correct handling of Unicode is best left to the poor souls stuck dealing with them full time, all the time.


> Much like dates and times, correct handling of Unicode is best left to the poor souls stuck dealing with them full time, all the time.

Reminds me that date parsing is fucked up in iOs and you actually need to know your shit to get it right when dealing with users around the globe. But I don't see people dealing with this full time.


In all apps i've worked on, passing dates around as 'seconds since 1970' has made date parsing on ios effortless - you may want to try that.


  > passing dates around as 'seconds since 1970' has
  > made date parsing on ios effortless
I'm confused. How does the way that you store dates internally have anything to do with date parsing? If you are accepting dates from users, then you still have to normalize then to unix timestamps.

That also doesn't cover things like:

* Take this date and convert it into a week represented by the last Sunday of the week (i.e. 20110410 represents 20110403-20110409).

* Basically anything involving days of the week.


Give me a call in 2038.


What happens when the library has issues or doesn't follow the unicode spec? What about the fact that Unicode Is Hard(tm) and you will end up shooting yourself in the foot if you think that you can bury your head in the sand about it? Not many regex implementations allow you to reference unicode chars by name or type rather than directly by hex code.


Theoretically, maybe.

But, as with everything, things have the tendency to screw up and you should know why they did. (or somebody better paid should)


That is not an option for low-level programming.


"correct handling of Unicode is best left to the poor souls stuck dealing with them full time, all the time"


I agree with much of what the top-voted answer says, with the exception of the criticism of wchar_t. wchar_t is not equal to utf-16: it is supposed to be a type large enough to store any single codepoint. Environments where wchar_t is utf-16 simply have a broken wchar_t - when wchar_t is ucs4 then it fulfills its original purpose.


I dislike UTF16 for a very simple reason: it's another type of encoding to deal with. If I could force the entire world to dump UTF16 and use UTF8, I would. Hell, I'd force everyone to dump UTF8 and use UTF16 if I could too. It's just such a pain in the ass to have to deal with having so many ways of encoding text on the web.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: