Hacker News new | past | comments | ask | show | jobs | submit login
What Every Software Developer Must Know About Unicode (2003) (joelonsoftware.com)
96 points by jervisfm on Jan 1, 2014 | hide | past | favorite | 33 comments



This article deals mostly with Windows and is from 2003 so it fails to emphasise the current standard practice as much as it should:

    Use UTF-8 everywhere you can.
UTF-8 is:

* the most backwards compatible (can be passed through many tools intended for ASCII-only with a few limitations – including avoiding composed latin glyphs)

* most likely to give an appropriate result if the end-user incorrectly interprets it

* the most space efficient encoding (on average)

* avoids endianness problems

* de-facto encoding for most Mac and Linux C APIs

* verifiable with a high degree of accuracy (unlike many other encodings which can't be verified at all)

Specifically:

* If you have to pick an encoding, always try to use UTF-8 unless you're only storing text to pass into an API which requires something different.

* The Winapi (aka Win32) is the only commonly used API that regularly requires something other than UTF-8 (the Windows Unicode APIs use UTF-16 – not UCS-2 as indicated in the Spolsky article). Windows' UTF-16 requirement a pain for platform independence -- be careful. However, you should still aim to use UTF-8 for all text files on Windows and only use UTF-16 for the Windows API calls (never use the locale specific non-Unicode encodings).

* There are a few language+environment combinations that literally can't open Unicode filenames. These include MinGW C++ which has no platform independent way of opening file streams with unicode filenames. You need to fall back to C _wfopen and UTF-16 to open files correctly.

Note: you don't always have to choose the encoding. e.g. the Mac class NSString or the C# String class use UTF-16 internally, you don't normally need to care what they do internally since any time you access the internal characters, you specify the desired encoding. You should usually extract characters in UTF-8.


I started treating what Joel says much less seriously after discovering that FogBugz sends E-mail using "windows-1252" or "windows-1250" character sets instead of UTF-8. I reported this as a problem to customer service, but was told that it will not be changed. Quoting from the reply I got "joelonsoftware.com is a blog. Fog Creek Software is a business."

Some people might give Joel additional credibility because of his successful business — but there seems to be a gap between theory and practice, which undermines that credibility, at least for me.


I'm sorry you got that reply. I don't doubt you received it, but it doesn't reflect how I believe my company should have responded to you.

This is actually a feature. FogBugz DOES send email in UTF-8 (just copy some hebrew text and send an email to yourself) if it has to. If it doesn't have to, it will not do so simply because it's working around old email clients that don't support UTF-8. FogBugz has supported all kinds of languages for many, many years, so there's some possibility you are talking about a reply you received before that even happened, but I think you just observed a normal English only email from FogBugz has the content-type set to windows-1252. You could argue at this point that we probably don't need that workaround anymore I guess. I just have to weigh that against whether it's broken now (it's not) vs. the new support emails we would get from people who were having problems with email because they were still using Eudora 3.0.

It turns out that it's actually BECAUSE of the whole reason Joel wrote that article that FogBugz sends English email in windows-1252.


  The Winapi (aka Win32) is the only commonly used API that 
  regularly requires something other than UTF-8
Sadly, that's not really correct. Core Foundation and Cocoa on Mac OS X use UTF-16. Qt uses UTF-16. Java uses UTF-16. JavaScript uses UTF-16. Many of these APIs have easy methods for converting to and from other encodings, and offer a certain amount of abstraction over the underlying encoding, but it still shows through in that string length and character indexing work via UTF-16 code units, not Unicode code points.

  the Mac class NSString or the C# String class use UTF-16 
  internally, you don't normally need to care what they do 
  internally since any time you access the internal 
  characters, you specify the desired encoding.
While it's true that they do abstract over the encoding somewhat, and they offer conversions, the abstraction is fairly leaky, as string length and indexing all happen in terms of UTF-16 code units.

I'm a huge fan of UTF-8, and I agree that new APIs should generally favor it, but there is a lot of legacy code that has a lot of UTF-16 assumptions baked in that you can't really say that only the Winapi uses UTF-16.


There's a reason a ton of legacy code is UCS-2^/UTF-16 and not UTF-8.

The code is older than UTF-8.

Windows NT was first released in '93, UTF-8 didn't exist (much less have wide adoption) for most of it's development. Likewise, NeXTSTEP had an initial release in '89. Java and JavaScript (both '95) could have adopted UTF-8, but they're almost certainly running on an OS that expects UTF-16 soooo... yeah.

^UCS-2 was superceeded by UTF-16, so you'll find them used interchangeably a lot.


It doesn't matter when the projects started, it matters when they added Unicode support. Windows NT didn't support Unicode until 4.0, released in 1996. NeXTSTEP may have been released in '89, but Unicode itself wasn't finished until '91, it couldn't possibly have supported Unicode upon release. The original releases of NeXTSTEP just used C strings; it wasn't until OpenStep in 1994 that they introduced NSString based on UCS-2.

UTF-8 was publicly released in January 1993. So by the time these projects became Unicode enabled, UTF-8 had already existed for at least a year.

Java and JavaScript had no underlying platform constraints to choose UCS-2/UTF-16, since the underlying platforms didn't support Unicode during their development.

Qt 2.0 was the first release of Qt to introduce Unicode support in QString, and it was released in 1999.

No, the real problem was just the fundamental design mistake that the Unicode consortium made when first developing Unicode. They thought that 16 bits would be enough to fit all of the world's actively used writing systems, and the simplest way to support an extended character set would be to just switch the underlying character type from 8 bit integers to 16 bit integers. This was a mistake in many ways; 16 bits is not sufficient, especially when CJK is taken into account, and so they had to do a lot of unification that wasn't really appropriate and led to a lot of resistance to using Unicode from CJK users. Changing to 16 bit integers for the fundamental character type meant that every API had to be duplicated to provide a wide character version. Some APIs already had wide character support for legacy wide character sets, but differences in existing wide character support between NT (which used 16 bit wide characters) and many Unices (which used 32 bit wide characters) meant that writing portable code is quite difficult. Using 16 bit integers for an internal representation means that there's a native endianness, but once you need to interchange data endianness becomes a big issue. And so on.

UTF-8 was the solution to many of these problems, and it was introduced before Unicode support had become widespread, but the idea that 16 bit types should be used for Unicode had already permeated people's consciousness and likely early development efforts. It's too bad that more people didn't learn from Plan 9's experience switching to UTF-8, which happened all the way back in 1992 (they switched Plan 9 to UTF-8 before publicly announcing it, which acted as a very good proof of concept).


> The Winapi (aka Win32) is the only commonly used API that regularly requires something other than UTF-8 (the Windows Unicode APIs use UTF-16 – not UCS-2 as indicated in the Spolsky article).

Why has Microsoft still not fixed this? They have the various functions [Abc] which call [Abc]A and [Abc]W depending on whether UNICODE is defined, it seems obvious that they should add [Abc]U for UTF-8 and provide a similar define to make those the typedefs for the undifferentiated functions. Even if all the functions did was convert between UTF-8 and UTF-16 and call [Abc]W it would save everyone from having to write and debug their own implementation of the same thing.

> There are a few language+environment combinations that literally can't open Unicode filenames. These include MinGW C++ which has no platform independent way of opening file streams with unicode filenames.

This seems like much the same issue. I don't know if this somehow is the fault of the C++ standard or not, but it seems like there should be a way to specify (if not for the default to be) that any C or C++ standard library functions taking a const char* or std::string should Do The Right Thing when provided with a null-terminated UTF-8 string. If the OS needs something different on the bottom then let the library do the conversion -- half the point of standard libraries is to abstract away things like that instead of making everybody futz with them all the time.


>> The Winapi (aka Win32) is the only commonly used API that regularly requires something other than UTF-8 (the Windows Unicode APIs use UTF-16 – not UCS-2 as indicated in the Spolsky article).

> Why has Microsoft still not fixed this?

Is there a way they can fix it without breaking backward compatibility?


They can, by introducing a few thousand stub functions which convert and delegate to ★W functions, but there'd be little use to do so. Every sane program out there uses the ★W functions and some insane still use the ★A ones. So it would only be beneficial for new code while all existing code remains the same, with the same encoding bugs if there are any.

I'm also not sure whether there are that many cases where it really helps. UTF-8 only on Windows is painful and so is using UTF-8 only with all other things that use UTF-16 (Qt, Java, etc.). Usually in those cases you use a library/framework/whatever that handles the platform abstraction or just conform to what's expected.


> Every sane program out there uses the ★W functions and some insane still use the ★A ones. So it would only be beneficial for new code while all existing code remains the same, with the same encoding bugs if there are any.

Every sane program uses the undifferentiated functions, defines UNICODE so that they map to the ★W functions, and uses TCHAR which defining UNICODE causes to map to a wide char. Older programs don't define UNICODE and often use char (or CHAR) instead of TCHAR. If they would create a different define (e.g. '#define UTF8') which would map to the new UTF8 functions and would define TCHAR as CHAR then anything doing it either way would do the right thing just by defining UTF8 and recompiling. Only programs that explicitly call the ★W functions (which they never should have exposed) wouldn't be "fixed" to use UTF8, but neither would they be broken.

> UTF-8 only on Windows is painful

...because Microsoft hasn't fixed it.

> Usually in those cases you use a library/framework/whatever that handles the platform abstraction or just conform to what's expected.

That's a cop out. You're just deferring to the frameworks, which also shouldn't be using anything other than UTF8, and who may have more difficulty in fixing it because the transition mechanism Microsoft used to unicode is well adaptable to another transition. Not every library you have to use will use the same encoding as the framework and you're back to a huge pain. The only way to fix it is for everything to always use UTF8, and deprecate everything else going forward.


Agreed with what you wrote except "the most backwards compatible (can be passed through many tools intended for ASCII-only)"... That amounts to knowingly sweep bugs under the rug.

If a tool is intended for ascii-only, don't pass it utf-8 strings. Else, you'll probably expose yourself to malformed utf-8 strings and potential problems (e.g. php's mysql_escape vs mysql_real_escape).


Technically, it's not that it can be passed through tools intended for ASCII only. It can be passed through tools which assume an ASCII compatible character set (character set for which ASCII is a subset), which are 8 bit clean, and which don't make incorrect assumptions about being able to truncate strings at arbitrary points and be left with two valid strings.

Which is actually generally true of any tools which had been internationalized with legacy, pre-Unicode character sets like the ISO 8859 series.


Yes, obviously there will be caveats when using a tool beyond its intended domain:

* you do need to know that the tool passes non-ASCII through unchanged

* the text should not contain composed latin glyphs

* you're on your own if you're trimming strings to byte lengths

I've added the second point to my comment.

It's not about sweeping bugs under the rug at all. It's about using non-latin text on the command-line and in code. Most command-line tools are ASCII but will pass through non-ASCII characters unchanged. However, most require that you avoid composed latin glyphs. Since Unicode includes single-codepoint versions of all valid latin accented glyphs and these are the default entry methods, this isn't usually a problem but yes, it really constitutes a subset of UTF-8 and you must know about this limitation to avoid bugs.


So when writing Windows applications with Win32, would you use UTF-16 internally and read/write UTF-8 files or use UTF-8 internally and convert to/from UTF-16 at the boundaries to Win32 functions? My company currently does the first, UNICODE is defined, TCHAR=UTF-16.


I would never use anything except UTF-8 for a general text file – for the efficiency reasons listed above – unless there was a strong reason otherwise. General text files are the biggest source of problems because they have no metadata indicating what encoding they actually contain.

For everything else, it really depends what your data is for.

If your data is only ever going to hold a file path that you need to pass to the Winapi, then there's no real problem with UTF-16. Although needing to have multiple paths for your text handling can become an issue.

I write multi-platform C++ programs and even on Windows I use UTF-8 for all internal strings – including those that will eventually be passed to the Windows API. It just makes string handling simpler.

However, I use a range of abstraction classes around all OS calls that transparently converts to/from UTF-16 as needed. A good example is boost::filesystem for paths and file I/O which internally stores UTF-16 on Windows but abstracts the need for me as the programmer to know or care about that detail – instead I can use UTF-8 everywhere and let the abstraction handle the encoding.


A good summary, but for one imortant detail: In UTF-16, some code points (laying on the so-called "astral planes", ie not on the "basic multilingual plane") take 32 bits.

The Emoji, for example, lie on the first higher plane: 🍒🎄🐰🚴. Firefox and Safari display them properly, Chrome doesn't, no idea for IE and Opera.

UCS-2 is a strict 16-bit encoding (a subset of UTF-16), and it cannot represent all characters.

It is the encoding used by JavaScript, which can be problematic when double width characters are used. For example, `"🐙🐚🐛🐜🐝🐞🐟".length` is 14 even though there are only seven characters, and you could slice that string in the middle of a character.


This is a bit pedantic, but JS implementations don't necessarily use UCS-2 as the internal encoding. The issue is that the spec requires characters to be exposed to programs as 16-bit values. See here: http://mathiasbynens.be/notes/javascript-encoding


From my undestanding, the Chrome doesn't display emoji because of poor unicode handling, but its a difference in the (font?) renderer used.

See http://apple.stackexchange.com/questions/41228/why-do-emoji-... and https://code.google.com/p/chromium/issues/detail?id=90177


It should be noted that at the time JavaScript was developed, there was only the basic multilingual plane - the extension was only introduced with Unicode 2.0 in 1996.


I wonder how hard it'd be to get JavaScript/ECMAScript onto a better encoding.. Do we actually have a "better" encoding?


Depends what you mean by "better". UTF-8 generally ends up using fewer bytes to represent the same string than UTF-16, unless you're using certain characters a lot (e.g. for asian languages), so it's a candidate, but it's not like you could just flip a switch and make all javascript use UTF-8.


I think the size issue is a red herring. UTF-8 wins some, UTF-16 wins others, but either encoding is acceptable. There is no clear winner here so we should look at other properties.

UTF-8 is more reliable, because mishandling variable-length characters is more obvious. In UTF-16 it's easy to write something that works with the BMP and call it good enough. Even worse, you may not even know it fails above the BMP, because those characters are so rare you might never test with them. But in UTF-8, if you screw up multi-byte characters, any non-ASCII character will trigger the bug, and you will fix your code more quickly.

Also, UTF-8 does not suffer from endianness issues like UTF-16 does. Few people use the BOM and no one likes it. And most importantly, UTF-8 is compatible with ASCII.


There is absolutely no situation in which UTF-16 wins over UTF-8, because of the surrogate pairs required. That makes both encodings variable length.

UTF-32 is probably what you're thinking of.


I know that both encodings are variable-length. That is the issue I am trying to address.

My point is that in UTF-16 it's too easy to ignore surrogate pairs. Lots of UTF-16 software fails to handle variable-length characters because they are so rare. But in UTF-8 you can't ignore multi-byte characters without obvious bugs. These bugs are noticed and fixed more quickly than UTF-16 surrogate pair bugs. This makes UTF-8 more reliable.

I am not sure why you think I am advocating UTF-16. I said almost nothing good about it.


Bugs in UTF-8 handling of multibyte sequences need not be obvious. Google "CAPEC-80."

UTF-16 has an advantage in that there's fewer failure modes, and fewer ways for a string to be invalid.

edit: As for surrogate pairs, this is an issue, but I think it's overstated. A naïve program may accidentally split a UTF-16 surrogate pair, but that same program is just as liable to accidentally split a decomposed character sequence in UTF-8. You have to deal with those issues regardless of encoding.


> A naïve program may accidentally split a UTF-16 surrogate pair, but that same program is just as liable to accidentally split a decomposed character sequence in UTF-8. You have to deal with those issues regardless of encoding.

The point is that using UTF-8 makes these issues more obvious. Most programmers these days think to test with non-ascii characters. Fewer think to test with astral characters.


Anything in the range U+0800 to U+FFFF takes three bytes per character in UTF-8 and two in UTF-16 (http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings...:

"Therefore if there are more characters in the range U+0000 to U+007F than there are in the range U+0800 to U+FFFF then UTF-8 is more efficient, while if there are fewer then UTF-16 is more efficient. "

That same page also states: "A surprising result is that real-world documents written in languages that use characters only in the high range are still often shorter in UTF-8, due to the extensive use of spaces, digits, newlines, html markup, and embedded English words", but I think the "citation needed]" is added rightfully there (it may be close in many texts, though)


UTF-8 is variable length in that it can be anywhere from 1 to 4 bytes, while UTF-16 can either be 2 or 4. That makes a UTF-16 decoder/encoder half as complex as a UTF-8 one.


Surrogate pairs are way more complex than anything in UTF-8.


> Even worse, you may not even know it fails above the BMP, because those characters are so rare you might never test with them.

I don't think this is too relevant because anyone who claims to know UTF-16 should know about the surrogates. And if you are handling mostly Asian text (which is where UTF-16 is more likely to be chosen), then those high characters become a lot more common.


UTF-8 has its own unique issues, like non-shortest forms and invalid code units, that you are even less likely to encounter in the wild. Bugs in handling of these have enabled security exploits in the past.


🚴 isnt displayed correctly in my Firefox, though all the other characters you mention are.

FF 26 on Win 7.


Emojis are even more fun than that. Some of them take two unicode characters.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: