UTF-8 Everywhere (2012)

Animats · on June 19, 2016

The Python problem is amusing. Python 3 has three representations of strings internally (1-byte, 2-byte, and 4-byte) and promotes them to a wider form when necessary. This is mostly to support string indexing. It probably would have been better to use UTF-8, and create an index array for the string when necessary.

You rarely need to index a string with an integer in Python. FOR loops don't need to. Regular expressions don't need to. Operations that return a position into the string could return an opaque type which acts as a string index. That type should support adding and subtracting integers (at least +1 and -1) by progressing through the string. That would take care of most of the use cases. Attempts to index a string with an int would generate index arrays internally. (Or, for short strings, just start at the beginning every time and count.)

Windows and Java have big problems. They really are 16-bit char based. It's not Java's fault; they standardized when Unicode was 16 bits.

johncolanduoni · on June 19, 2016

I think it's even better to take this one step further and have your default "character" actually be a grapheme[1]. In almost any case where you're dealing with individual character boundaries you want to split things on the grapheme level, not the code-point level.

This doesn't matter much for (normalized) western European text, but if the language in question needs to use separate diacritical code points you'll likely end up with hanging accents in the like. Swift is the only language I know of that has grapheme clusters as the default unit of character, I'd love to see it in more places.

[1]: http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

MrBuddyCasino · on June 19, 2016

Never understood that either. Why is this so rare? Even in technical discussions like this one, some people will look at you funny upon hearing this suggestion.

jcranmer · on June 20, 2016

Navigating a UTF-8 string on codepoint level is a fairly simple algorithm, since UTF-8 is self-synchronizing. This means it can easily be done without relying on external libraries or data files. It's also stable with respect to Unicode version--it always produces the same result independent of what version of the Unicode tables you use.

Moving to grapheme cluster boundaries means that the algorithm may work incorrectly if you input a string of Unicode N+1 to an implementation that only supports Unicode N. It also makes the "increment character" function very complicated. In the UTF-8 version, this looks roughly like:

    char *advance(char *str) {
      uint8_t c = (uint8_t)*str;
      /* Count the number of leading 1's */
      int num1s = __builtin_clz(~c) - 24;
      if (num1s == 0) return str + 1;
      return str + num1s;
    }

Grapheme-based indexing looks like this:

    char *advance_grapheme(char *str) {
      while (true) {
        uint32_t codepoint = read_codepoint(str);
        str = advance(str);
        uint32_t nextCodepoint = read_codepoint(str);
        /* This is typically something like table[table2[codepoint >> 4] * 16 + codepoint & 15]; */
        GraphemeClusterBreak left = lookupProp(codepoint);
        GraphemeClusterBreak right = lookupProp(nextCodepoint);
        /* Several rules based on left versus right... */
      }
      return str;
    }

See the vast difference in the two implementations? It's a lot of complexity, and it's worth asking if that complexity needs to be built into the main library (strings are a fundamental datatype in any language). It's also important to note that it's questionable whether such a feature implemented by default is going to actually fix naive programmers' code--if you read UTR #29 carefully, you'll notice that something like क्ष will consist of two grapheme clusters (क् and ष), which is arguably incorrect. Internationalization is often tied heavily to GUI and, especially for problems like grapheme clusters, it arguably makes more sense for toolkits to implement and deal with the problems themselves and provide things like "text input widget" primitives to programmers rather than encouraging users to try to implement it themselves.

mtviewdave · on June 20, 2016

History has shown that, when it comes to strings, developers have a hard time getting even something as simple as null-termination correct. If grapheme handling is complex, that's an argument for having it implemented by a small team of experts exactly once. The resulting abstraction might not be leak-proof, but then no abstraction is.

lomnakkus · on June 20, 2016

> (strings are a fundamental datatype in any language)

(Probably a bit "unfair" to "pounce" on an off-hand paranthetical like this, but I'm in a bit of a pedantic mood...)

This is not true for e.g. Haskell. In Haskell it's defined as [Char], i.e. a list of characters. (Of course the Haskell community is suffering from that decision, but that's another story.)

I'm not sure why strings would need to be a fundamendal type, though. Sure, they would probably be part of the standard library for almost all languages, but they don't need to be "magical" in the way most fundamental types (int, etc.) are.

MrBuddyCasino · on June 20, 2016

Thanks for the detailed response. This looks like a performance problem, big lookup tables trash the cashes.

I with mtviewdave that more complexity means this should really be in the std lib. Java's charAt(int i) is misleading at best.

int_19h · on June 20, 2016

I don't think there should even be a "default" character at all. In some cases, codepoints are the right choice; in others, graphemes are. If we make this explicit - e.g. by forbidding to iterate strings directly, but providing functions/methods to extract codepoints and graphemes as iterable sequences - the programmer has to make a conscious choice every time they iterate.

johncolanduoni · on June 20, 2016

In principle I'd agree with you, but the response from most people learning a language that treats strings like that is likely to be "You can't iterate over strings? And it doesn't even have a character type!" I'm not saying people are stupid: I had never really thought about the difference between a code point and grapheme until relatively recently when I had to do some low-level text layout stuff and it became important.

My point is that I don't think throwing people into the deep end and expecting them to grok the codepoint/grapheme division before they get to use the language is likely to be productive. Defaulting to graphemes carries the advantage that if a programmer who doesn't know much about languages that require more Unicode finesse uses it purely intuitively, they'll get things right in a lot of cases. Using codepoints, on the other hand, makes it easy to put text through a grinder while doing relatively innocent things.

Unfortunately while we're discussing the finer points of using graphemes and codepoints in APIs a million supplementary characters were brutally eviscerated by code running on languages that haven't quite gotten past UCS-2 XD

int_19h · on June 20, 2016

I think that throwing people into the deep end is the only way to get them to do it right in this case. Defaulting to graphemes is too often the wrong answer, as well (e.g. you don't want that in a parser).

Really, this is not dissimilar to "what do you mean, there are more letters than A to Z?" issue that plagued software written in US back before Unicode became dominant. The way we (my perspective on this is as a native speaker of a language with a non-Latin alphabet) have eventually solved it is by basically forcing Unicode onto those people. It broke their simple and convenient picture of the world, and replaced it with something much more complicated. But it was necessary.

My position is that letting programmers get away with a simplistic view of text processing (by allowing defaults that "mostly" work) is what creates those issues. So adjusting the abstractions such that they expose more of the underlying complexity is a good thing. People SHOULD believe that doing text processing the right way is hard, because it is.

lelf · on June 19, 2016

Perl6 does support them natively too.

SwellJoe · on June 19, 2016

Perl has historically had excellent Unicode support. I remember going from Perl to Python (like a decade ago), and being annoyed at how messy Unicode support still was. Ruby, too, lacked good Unicode support, for many years after Perl had it pretty good.

But, Perl 6 definitely gets it more right than other implementations I've seen.

Joeri · on June 20, 2016

Since all other systems have standardized on code points this would lead to subtle incompatibilities. For example, checking for length prior to inserting in a database must be done in code points.

What i find more frustrating is how the documentation for many systems describes the basic unit of text as a character, without specifying whether a code point or grapheme is meant, and without leading people to an explanation of the difference. There is still a lot of software that processes unicode text incorrectly, not because it is difficult to do so, but because nobody told the developer how things should be done.

asveikau · on June 20, 2016

1 grapheme = 1 unit of datatype is a good start. There is a lot of complexity in Unicode, but the mental model of a "char" is what most developers are thinking, largely blind to that complexity. Kudos to anyone trying to make that more "common sense" model closer to reality.

However... How does a language or library attempting to abstract this part (like swift might) deal with the other, unrelated annoying aspects of Unicode? Even if 1 glyph is 1 "char", and we normalize all the inputs, there is still, say ... Bi-di text.

johncolanduoni · on June 20, 2016

I don't think there is an easy answer to this. Even just pure RTL text is hard to support: IIRC Android and iOS only recently made all the built-in views support it. This is complicated by the fact that most developer teams don't have nor can afford to hire someone familiar with RTL to implement this kind of stuff.

asveikau · on June 20, 2016

I don't think the rules in Unicode are simple for anyone. I have seen professional developers who are native speakers of RTL languages screw it up. All this stuff about the directionality of a paragraph and the embedded markers... It boggles the mind. But a truly international product should be getting it right. Sad mismatch there.

PS: bit of trivia that people forget these days, Win32 has been supporting it for longer than android/iOS.

raiph · on June 24, 2016

> Swift is the only language I know of that has grapheme clusters as the default unit of character

Also Elixir and Perl 6. For a bit more info see https://news.ycombinator.com/item?id=

dietrichepp · on June 19, 2016

There are a couple cases for string indexing, usually involving parsing or regular expressions. You might want to slice the quotes off of a quoted string, or slice from one match of a regular expression to a match of a different regular expression starting at a different index. These come up infrequently enough that it doesn't make sense to make a better API just for these use cases, but frequently enough that it would be a serious impediment if we didn't do some kind of string indexing.

I agree, however, that it's completely irrelevant whether the indexes correspond to code units (i.e. byte offsets in UTF-8) or whether they correspond to code points (how it works in Python currently), as long as we have some way to store, compare, and otherwise manipulate locations within a string.

Some Rust developers at one point proposed making string indexes their own (opaque) type, as you suggest, so that they couldn't be confused with integers used for other purposes. The extra complexity of such an API meant that this proposal was never really taken seriously, and it only prevents a small category of programming errors.

You might be interested in looking at some string APIs which are mostly without string indexing, like Haskell's Data.Text, which is one of the most well-designed string APIs ever made.

https://hackage.haskell.org/package/text-1.2.2.1/docs/Data-T...

As for Windows, my Windows apps use UTF-8 everywhere, and then convert to wchar_t at the last possible moment when interacting with the Windows API. I believe this is what UTF-8 Everywhere suggests.

Avernar · on June 19, 2016

> You might want to slice the quotes off of a quoted string

That's why I really like dealing with UTF-8. As you said you can just index by byte instead of having to worry about code point boundries. This is because a code point will never match inside a larger code point.

So if I search for a one byte quote character it will never match the second, third or fourth bytes in a larger code point. Same with other control characters.

Works when searching for 2 or 3 byte code points as well.

bluejekyll · on June 20, 2016

> in Rust... The extra complexity of such an API meant that this proposal was never really taken seriously,

In fact it looks like even standard indexes have been deprecated, in favor of Iterators over the string:

https://doc.rust-lang.org/stable/std/primitive.str.html#meth...

Iteration is better IMO.

johncolanduoni · on June 20, 2016

Makes a lot of sense. I get frustrated with how explicit I have to be with Rust sometimes (currently I'm writing something with a lot of reusable, immutable structures and am put off by how much Rc.clone() that requires) but making it known to the programmer that every indexing of a String is effectively an iteration is a good call.

bluejekyll · on June 20, 2016

Yeah, this is a tough thing when coming from other languages.

I've been using references pretty effectively to get around some issues like this, though in other cases Rc is my only resort.

If you have some code up on github, I'd be happy to take a look and see if there's some other options that are less cumbersome.

Animats · on June 20, 2016

The advantage of an opaque type is that you should be able to add one to it and advance one rune. Go copies the string functions from C, but doesn't (at least as of last year) offer "advance one rune" and "back up one rune" functions.

Avernar · on June 20, 2016

If by rune you mean code point that doesn't really gain you anything. If you are searching for something it's faster to just scan a byte at a time.

If you want to split the string into "characters" you need to do it a the grapheme level (multiple code points) and for that you need to use a unicode library. But that adds overhead when you're just scanning.

A JSON or XML scanner does not need the added overhead of advancing by code point or grapheme.

dietrichepp · on June 20, 2016

I don't understand the argument for "characters = grapheme clusters". From my perspective, there are a lot of different ways you'd want to iterate over a string. Grapheme cluster breaks, tailored grapheme cluster breaks, word breaks, line breaks, code points, code units… all of these make sense in some context. However, there are precious few times that I've wanted to iterate over grapheme clusters, so telling people that they should do that instead of something else doesn't make sense to me. (I mean, what problem is so common that we would want to iterate this way by default?)

For parsing, it often makes sense to iterate over code points or code units, since many languages are defined in terms of code points (and you can translate that to code units, for performance). XML 1.1, JavaScript, Haskell, etc... many languages are defined in terms of the underlying code points and their character classes in the Unicode standard. JSON and XML 1.0 are not everything.

Avernar · on June 20, 2016

We're pretty much on the same page here. When you want to slice a string (because you can only display or store a certain amount) or you want to do text selection and other cursor operations you can't so it by code point. That's where you want to break at character boundries which are graphemes or grapheme clusters.

For parsing it's easier to just scan for a byte sequence in UTF-8 because you know what you're looking for ahead of time. If you're looking for a matching quote, brace, etc. you just need to scan for a single byte in your text stream. Adding a smart iterator to the process that moves to the start of each code unit is not necessary and will slow things way down.

I just gave JSON and XML as examples and not an exhaustive list. If you know the code points you are scanning for it's way more effecient to scan for their code units. The state machine in a paraer will be operating at the byte level anyways.

I have yet to see a good example where processing/iterating by code point is the better choice (other than the grapheme code of the unicode library).

dietrichepp · on June 20, 2016

I'm not convinced that state machines will operate at the byte level. First of all, not all tokenizers are written using state machines. Even if that is the mathematical language we use to talk about parsers, it's still relatively common to make hand-written parsers. Secondly, if you take a Unicode-specified language and convert it to a state machine that operates on UTF-8, you can easily end up with an explosion in the number of possible states. Remember, this trick doesn't really change the size of the transition table, it just spreads it out among more states. On the other hand, you can get a lot more mileage out of using equivalency classes, as long as you're using something sensible like code points to begin with.

If you're curious, here's the V8 tokenizer header file:

https://github.com/v8/v8/blob/master/src/parsing/scanner.h

You can see that it works on an underlying UTF-16 code unit stream which is then composed into code points before tokenization. This extra step with UTF-16 is a quirk of JavaScript.

If you think that V8 shouldn't be processing by code point, feel free to explain that to them.

Avernar · on June 20, 2016

State machines would have to operate on the byte level. Otherwise each state would have to have have 65536 entries per state. The trick to handle UTF-8 would be to have 0-127 run as a state machine and > 127 break out to functions to handle the various unicode ranges that are valid for identifiers.

For languages that only allow non ascii in string literals a pure state machine would suffice.

Not sure why you're mentioning parsers. At that point you you're dealing with tokens.

As for UTF-16 it's an ugly hack that never should have existed in the first place. Unfortunately the unicode people had to fix their UCS-2 mistake.

Since Javascript is standardised to be either UCS-2 or UTF-16 it probably made sense to make the scanner use UTF-16.

dietrichepp · on June 22, 2016

State machines don't have to operate on the byte level because the tables can use equivalency classes. This will often result in smaller and faster state machines than byte-level state machines, if your language uses Unicode character classes here and there.

Avernar · on June 20, 2016

Looks like Javascript source code is required to be processed as UTF-16:

ECMAScript source text is represented as a sequence of characters in the Unicode character encoding, version 3.0 or later. [...] ECMAScript source text is assumed to be a sequence of 16-bit code units for the purposes of this specification. [...] If an actual source text is encoded in a form other than 16-bit code units it must be processed as if it was first converted to UTF-16.

dietrichepp · on June 22, 2016

Right, but the UTF-16 is read code point by code point, not code unit by code unit. At that point, it might as well be UTF-8 or UTF-32.

ridiculous_fish · on June 20, 2016

Walking over grapheme clusters is common in UIs, e.g. visually truncating a string.

Really I think you are arguing against the notion of "default iteration" altogether. As you say, the right type of iteration is context dependent, and it ought to be made explicit.

dietrichepp · on June 20, 2016

I'm not sure it is so common in UIs. Truncation is done by a single library function, so that's one case where it's used. Another case is for character wrapping, but that's fairly uncommon. I'm having trouble coming up with another case where it's used. Font shaping is done by a font shaping engine, which applies an enormous number of rules specific to the script in use. Text in a text editor isn't deleted according to grapheme cluster boundaries, and the text cursor doesn't fall on grapheme cluster boundaries either. These are all rules that change according to the script in use.

infogulch · on June 20, 2016

Actually, go has supported ranging over code points of a string since Go 1 [0], and has unicode/utf8.DecodeRuneInString which produces the same iteration sequence as range.

[0]: https://blog.golang.org/strings

the_mitsuhiko · on June 19, 2016

> The Python problem is amusing. Python 3 has three representations of strings internally (1-byte, 2-byte, and 4-byte) and promotes them to a wider form when necessary. This is mostly to support string indexing. It probably would have been better to use UTF-8, and create an index array for the string when necessary.

It's especially amusing because on Python 3 strings internally cache their utf-8 equivalent if it was used once.

Animats · on June 20, 2016

I didn't know that. That's funny. They really are doing this the hard way.

int_19h · on June 20, 2016

It really is the hard way. For the curious, read through the comment here describing the data layout and invariants:

https://github.com/python/cpython/blob/master/Include/unicod...

chubot · on June 19, 2016

Yeah, I didn't really understand this until I heard Go/Plan 9 guys ranting about this. In other words, char* IS unicode if you use utf-8. Otherwise you need wchar_t and all that junk.

I think Python got unicode in the same era as Java, so it's understandable that Python 2 doesn't work like this. But if they are going to break the whole world for unicode, I also think it would have been better to do something like Go does (e.g. the rune library).

paulddraper · on June 19, 2016

Using raw bytes makes it possible to create corrupted strings. Slicing is not a safe operation. Python 3 doesn't have this issue.

Avernar · on June 19, 2016

It's very easy to not get corrupted strings when byte indexing into UTF-8. In a while loop, if the index is not at the end of the string and the top two bits of the character at the index are both one, advance the index by one.

Or throw an invalid index exception if the top to bits are one if that makes more sense for the language you're using.

imron · on June 20, 2016

In a while loop, if the index is not at the end of the string and the top two bits of the character at the index are both one, advance the index by one.

So what you're saying, is that it's very easy to get corrupted strings by anyone who doesn't have an understanding of utf-8 at the bit-level - which in my experience seems to be the majority of programmers.

Avernar · on June 20, 2016

Not at all. The language/library impementor should handle the details. My example was the argument checking of the slice function.

And indexing by code points doesn't solve the problem either. The majority of programmers don't know what a grapheme is or how to collate or sort unicode strings.

imron · on June 20, 2016

Except what happens in the real world is that people who are used to indexing and slicing ASCII strings however they please don't think "I should use a library for this", instead they just keep indexing and slicing as per usual and don't think anything of it until their Chinese customers start complaining of random program crashes, or missing text - which the developer then has difficulty trying to reproduce because hey, it works for them.

My only gripe with your argument is that I don't think it's easy to avoid corrupted text in modern text processing - which is precisely why there are libraries for it because it's actually really easy to get it wrong - even if you know what you're doing.

Avernar · on June 20, 2016

Which is why we have languages like Go where we can put those types of developers. Incidentally Go use UTF-8. Higher level languages like Go, Python, etc were designed so newbie and/or ignorant programers could do less damage.

When I was working on a project before Unicode we would switch our dev PCs to the other languages we supported. What a pain that was. Only issues we had was when a translated string was much longer than the screen space allocated to it. I belive Swedish was the main culprit. No problems with simplified and traditional Chinese as those were more compact. I have no sympathy for dev shops that can't get internationalization right. As with everything else in the corporate dev world management doesn't seem to want to hire/retain the more experienced programmers.

I think you have a gripe with my argument because you may be missing my point. If a high level language chooses to let a programmer index into a UTF-8 string at the byte level (for performance and other reasons) it's very easy for it to prevent the the programmer from slicing in the middle of a code unit.

The reason being is that the language function to slice a unicode string would either throw an exception or just advance to the next valid index. There wouldn't be a way for the programmer to slice a unicode string in the middle of a code unit.

imron · on June 20, 2016

I think you have a gripe with my argument because you may be missing my point

I get your point, it just doesn't apply to many real world situations I've seen where you don't have the luxury of just using a higher level language or a library that takes care of all these things, or keeping programmers who don't understand what they are doing away from that sort of thing.

The most egregious example that I've personally seen was a developer working on a legacy Cobol banking program that needed Chinese support retro-fitted to it.

The app was originally only developed with ASCII in mind and so sliced through strings willy-nilly, which naturally caused problems with Chinese text.

The developer working on the "fix" before me, was calling out to ICU through the C API of the version of Cobol that we used and was still messing things up - he'd actually modified ICU in some custom way to prevent the bug from crashing the program, but was still causing corrupted text.

I basically undid all his changes, and wrapped all COBOL string splicing to call a function that always split a string at a valid position - truncating invalid bytes at the start/end as necessary. Much simpler and resulted in the removal of an unnecessary dependency on ICU.

This bug had been outstanding for several months when I first joined that company, and it was the first one I was assigned to work on - and luckily for them they'd accidentally hired someone who had done lots of multilingual programming before.

it's very easy for it to prevent the the programmer from slicing in the middle of a code unit.

Okay, but even you made a mistake in your first example of what to do, and that's the sort of code that someone who knows what they are doing could write, and will seem to work in the conditions under which it was tested (working on my machine, ship it!), but that will cause seemingly random problems once it hits users.

Avernar · on June 20, 2016

> I get your point, it just doesn't apply to many real world situations I've seen where you don't have the luxury of just using a higher level language or a library that takes care of all these things

No, I still think your missing some of it. I am not advocating that what I said is the solution for everything.

Someone said that slicing UTF-8 strings leads to string corruption and endorsed the Python 3 frankenstien unicode type as a way to avoid it. I just gave a way of preventing that.

Now you argued that a novice programmer would fail to implement it properly. So you're comparing my method implemented by a novice programmer to a method implemented by profesional compiler writers. That hardly seems fair. :)

So my argument is that if my method were to be implemented by professional compiler writers it would prevent corrupted strings while still using UTF-8 as the internal representation.

> I basically undid all his changes, and wrapped all COBOL string splicing to call a function that always split a string at a valid position - truncating invalid bytes at the start/end as necessary.

> luckily for them they'd accidentally hired someone who had done lots of multilingual programming before.

So an expert programmer implemented a string splitting function that didn't corrupt strings. :D

> but even you made a mistake in your first example of what to do

I writing this on an iPad while watching TV and playing a game on another android tablet while looking at the wikipedia UTF-8 article on a tiny phone screen while a little white dog is trying to bite my fingers (wish I was making this up). Not exactly my usual programming environment. ;)

imron · on June 20, 2016

> Now you argued that a novice programmer

sigh if only it was novice programmers making these mistakes :-/

Avernar · on June 20, 2016

Upvote for that comment.

The stuff I've seen in some people's multithreaded code just makes me want to cry.

kazinator · on June 20, 2016

It's impossible to get corrupted strings if you use wide characters. This is better than hard not to.

imron · on June 20, 2016

Which wide characters are you talking about? Because on Windows, where wide characters are 16 bits, it's quite possible to get corrupted strings (and in fact quite a few well-known programs, written by quite well-known software companies, make this exact mistake).

All you need to do is index/slice a string half-way through any character that is outside Unicode's Basic Multilingual Plane

Avernar · on June 20, 2016

This drives me crazy. The Win32 API was designed for UCS-2. Then UTF-16 came out and the API was shoehorned to use it but as you said they still haven't caught all the places where it still thinks it's UCS-2.

paulddraper · on June 21, 2016

"wide enough" So if it's Unicode, each element is a Unicode character.

imron · on June 21, 2016

First of all, Unicode doesn't define characters it defines codepoints.

I get that this might seem pedantic, but it's important to be pedantic about this, otherwise misconceptions and ambiguities occur e.g. 'just use wide characters' - the definition of which changes depending on the platform.

Second of all, "wide enough" for all intents and purposes means 32 bits. Technically Unicode only needs 21 bits to cover the currently defined codespace, but computers don't deal well with that and so 32bits is the minimum "wide enough" character size.

This creates a lot of wasted space and memory, not to mention pushes medium length strings across cache line boundaries for very little benefit - the ability to directly index/slice strings without accidentally corrupting data.

Now obviously you want to avoid accidentally corrupting data, the tradeoff comes down to whether you need direct, arbitrary indexing, or if it's worth doing some processing to determine the correct place to split in order to make space gains.

The technical world has come down overwhelmingly in favour of the latter, and that's why you see hardly anyone using utf-32. It's simply not as good a solution for most real world concerns.

Avernar · on June 20, 2016

Not impossible. If you slice in the middle of a grapheme then you get a corrupted string as well. You'll get an alternate glyph instead of a square box but it's still corrupted.

paulddraper · on June 20, 2016

It's not a corrupted string. You may have mangled it in a way that doesn't preserve all semantics, but no one will crash with an encoding issue.

A "string" means "a sequence of characters". Wide characters (or the equivalent interface) preserve this property.

Graphemes operate at a higher level than characters. You could construct a grapheme-strings, I suppose, but that has tons of edge cases, and if you don't like character-strings, I doubt you will like grapheme-strings.

Avernar · on June 20, 2016

Why should it crash? The proper procedure when validating a UTF-8 string is to replace errors with U+FFFD.

The term character has many meanings. Graphemes are characters and that's what most users expect, something that's displayed as a single graphical unit.

paulddraper · on June 21, 2016

I use "character" in the same way that the Unicode Consortium uses the word. Though "code point" would be more precise.

Avernar · on June 21, 2016

That's what they were hoping for. Didn't tuen out thst way. From icu-project.org:

"As with glyphs, there is no one-to-one relationship between characters and code points. What an end-user thinks of as a single character (grapheme) may in fact be represented by multiple code points; conversely, a single code point may correspond to multiple characters."

kazinator · on June 20, 2016

Or, if by "string" we mean "sequence of code points" (rather than graphemes) then it doesn't get corrupted by any chopping or rearrangement which only permutes the code points.

If we chop UTF-8, we can end up with bad characters, or possibly invalid overlong forms.

Avernar · on June 20, 2016

I consider changing a glyph to some other glyph(s) as corruption. Take an emoji flag character as an example. Split it between the code points and you end up with two boxed letters.

If you chop in the middel of a code unit then you end up with U-FFFDs. In both cases the visual representation has been altered.

As I wrote elsewhere it is easy for the slice routines of a language to check to see if the programmer tried to slice in the middle of a code unit and either return an error or just advance to the start of a code unit.

kazinator · on June 20, 2016

UTF-8 slicing destroys characters and graphemes.

Slicing a code-point-character string destroys only graphemes.

Clear win.

A code point string has other niceties, like being indexed by simple integers. If end is the index of the last code point of a grapheme, then the next grapheme starts at end + 1.

If end is the index of the last UTF-8 encoding of a code point, then the next grapheme does not start at end + 1.

We can have it so that it does by making end point to the last byte of the UTF-8 encoding of the code point; but then it doesn't point at the start of the character, recovering which is awkward.

The code uglification can be addressed by piling on abstractions: integer-like iteration gizmos that can be incremented and decremented thanks to function or operator overloading.

I feel that that level of abstraction has no place in character-level data processing, if anywhere, whose basic operations should be expressible tersely in a few machine instructions.

Also, we mustn't lose sight of what the T means in UTF-8: transfer. It's not called UPF-8 (the Unicode processing format in 8 bits).

Working with UTF-8 instead of with the objects that UTF-8 denotes is like working with a textual representation of Lisp s-expressions that still contain the parentheses and whitespace delimitation, and quotes around strings and so on, refusing to parse them to obtain the object which they represent. People who do this should immediately turn in their CS degrees.

All those other issues you refer to are addressed by more parsing. If you want the glyphs, the correct thing is to parse the code-point string and make a list or vector of glyph representations.

With that representation you can still break the text "carpet" into "car" "pet" which destroys semantics; that is dealt with by parsing into words.

Chopping lists of words destroys phrases; so parse phrases, and transform at the phrase level.

And so on.

Avernar · on June 20, 2016

> Clear win.

I'd call it a slight improvement. And after the major step back of using 2 to 4 times more memory for strings I'd call it a net loss.

> A code point string has other niceties, like being indexed by simple integers.

Again, no benefit of this. The only argument I've heard here is to prevent bad slicing and I've shown a way to prevent that.

> Also, we mustn't lose sight of what the T means in UTF-8: transfer. It's not called UPF-8 (the Unicode processing format in 8 bits).

By this argument we can't use UTF-16 or UTF-32 for internal processing of strings either. Back to code pages then.

paulddraper · on June 21, 2016

His point was that the transfer format should be conceptually independent of the processing. Obviously it has to be encoded in RAM in some way, but the programmer doesn't need to worry about the memory layout.

Avernar · on June 21, 2016

Why? Why should it be conceptually different if it's easy to work with the encoded form?

Many unicode languages work with UTF-8 or UTF-16 internally. So working with the "transfer format" is common practice.

While it may not be necessary to know who the languages your program in work under the hood, expert programmers do want/need to know. That way they can write better code, or switch to another language or get the language devs to improve their internal handling.

paulddraper · on June 21, 2016

Think of JSON.

The programmer shouldn't have to know that a newline character is written as \n in a JSON string.

The JSON string "a\nb" take 6 characters to write, but it's length should be given as 3.

99% people want to manipulate a JSON model, not the JSON (or BSON) serialization itself. The 1% can still use a byte array and do whatever hacks they like.

Avernar · on June 21, 2016

Bad example. If you want to embed that string in your code you have to type those 6 characters anyways.

A better example is if you want to find a newline in a string. If you do a find it in a UTF-16 string it may be position 8 and a find in UTF-8 may be position 12. Does it matter what the actual number is? NO. You just pass it to the next function or whatever.

Avernar · on June 20, 2016

Oops, made a mistake. The bit pattern is 10 in the upper two bits if you're not at the first byte of a code point and not 11.

imron · on June 20, 2016

Good thing it's very easy not to make mistakes :->

Animats · on June 20, 2016

That's the argument for opaque indices. You can only slice at rune boundaries.

tracker1 · on June 19, 2016

Likewise, I wish JS had just changed the internal representation of strings to UTF8, and accepted that some older code might break instead of adding in the new string/regex bits. It would have made far more code simple start working with large/multibyte/international characters than it ever would have broken.

0x4a42 · on June 20, 2016

Wouldn't it be possible by using a global switch (ie: see "use strict" as an example). You could just switch to the default character encoding without breaking existings js websites/apps?

nailer · on June 19, 2016

I haven't done Python for a while, so maybe it's changed, but I thought the new Python 3 strings werre index by byte rather than character, and it was 'as designed' despite being very non-intuitive (unlike the rest of Python).

Edit: see pjscott's comment below - it's by code point, not byte, but still not by character.

Animats · on June 19, 2016

No, Python 3 indexes by character, not byte.

    $ python3
    Python 3.4.3 (default, Oct 14 2015, 20:28:29) 
    >>> s = "オンライン"
    >>> s
    'オンライン'
    >>> len(s)
    5
    >>> s [0]
    'オ'
    >>> s [1]
    'ン'
    >>> s [2]
    'ラ'
    >>> s[3]
    'イ'
    >>> s[4]
    'ン'

pjscott · on June 19, 2016

It's important to be insufferably pedantic about this: they index by code point, which is almost but not quite what people expect a character to be.

    $ python3
    >>> "위키백과"[1]
    '키'
    >>> "위키백과"[1] # Should be identical, right?
    'ᅱ'

kbenson · on June 19, 2016

> It's important to be insufferably pedantic about this

That is perhaps the most succinct and accurate way I've heard to explain and justify why you're sounding like a wet blanket to people that may not understand, while acknowledging that you know how you sound, but there is a reason for it. I expect to use this in the future.

schoen · on June 20, 2016

That's an awesome example! To show more of how it works for Western language speakers who might be confused, how about

c = "é"

c[0], c[1]

It's the same phenomenon with Latin characters. (Extra bonus: for me, the combining acute accent character then combines in the terminal with the apostrophe that Python uses to delimit the string!)

Another idea to see the effect is "a" + "é"[1]. (The result is 'á'... and as in your examples, a precomposed "̈́é" is also available which doesn't exhibit any of these phenomena.)

kazinator · on June 20, 2016

Same with English, sort of. What is

    "difficult"[2]?

Is it 'f'? Or the ligature `ffi`? :)

IndianAstronaut · on June 20, 2016

Seems like ago handles this well with its utf 8 runes.

wcoenen · on June 19, 2016

It's interesting how history seems to have repeated itself with UTF-16. With ASCII and its extensions, we had 128 "normal" characters and everything else was exotic text that caused problems.

Now with UTF-16, the "normal" characters are the ones in the basic multilingual plane that fit in a single UTF-16 code point.

mark-r · on June 20, 2016

It's worse. With UTF-8, if you're not processing it properly it becomes obvious very quickly with the first accented character you encounter. With UTF-16 you probably won't notice any bugs until someone throws an emoticon at you.

ridiculous_fish · on June 20, 2016

Unfortunately not. It's easy to process UTF-8 such that you mishandle certain ill-formed sequences that you are unlikely to encounter accidentally. IIS was hit [1], Apache Tomcat was hit [2], PHP was hit twice [3] [4].

UTF-16 has its own warts, but invalid code units and non-shortest forms are exclusive to UTF-8.

[1] http://www.sans.org/security-resources/malwarefaq/wnt-unicod...

[2] http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2008-2938

[3] https://www.cvedetails.com/cve/CVE-2009-5016/

[4] https://www.cvedetails.com/cve/CVE-2010-3870/

chillacy · on June 19, 2016

This article was from 4 years ago. Since then, utf 8 adoption has increased from 68% to 87% of the top 10 million websites on Alexa:

https://w3techs.com/technologies/history_overview/character_...

Const-me · on June 20, 2016

_Unicode_ adoption increased to 87%. At the cost of non-Unicode encodings.

UTF16 isn’t good enough for web: even for a content in Ukrainian or Hebrew languages, UTF8 saves a sizeable bandwidth because spaces, punctuation marks, newlines, digits, English-inspired HTML tags — in UTF8 they all encode in 1 byte per character, and for the web, bandwidth matters.

chillacy · on June 25, 2016

> _Unicode_ adoption increased to 87%. At the cost of non-Unicode encodings.

Am I reading that site incorrectly? It says: UTF-8: 87.2%, not unicode.

Then down below:

" The following character encodings are used by less than 0.1% of the websites"

UTF-16

https://w3techs.com/technologies/overview/character_encoding...

IvanK_net · on June 19, 2016

When you create a table in MySQL, a text attribute (VARCHAR etc.) is not encoded in UTF8 by default.

I think UTF8 should be the default and only format for storing text attributes in all databases and all other text encodings should be removed from database systems.

zeta0134 · on June 20, 2016

We can't even convince Microsoft, Apple, and everything else Unix based to agree on line endings. How on earth are we going to convince everyone that one character encoding format is the only way they should store their data?

Annoying as it is to deal with, our history as computer scientists demands that we maintain compatibility with older systems and encoding formats that were once used but are now almost forgotten. If we removed all the other encoding formats (code paths that, while underused, still function perfectly fine) we would lose the ability to parse and manipulate a lot of old data.

Yaggo · on June 20, 2016

The universal line ending character is \n. (Except in Microsoft's universe, but that's never been compatible with the rest.)

skrause · on June 20, 2016

Not true, most of the plain text protocols like SMTP, FTP, HTTP/1.1 etc. also mandate \r\n.

yuhong · on June 19, 2016

I have the feeling that back in 1990, ISO 10646 wanted 32-bit characters but had no software folks on that committee, while the Unicode people was basically software folks but thought that 16-bit was enough (this dates back to the original Unicode proposal from 1988). UTF-8 was only created in 1992, after the software folks rejected the original DIS 10646 in mid-1991.

mangix · on June 19, 2016

this seems specific to Windows. UTF8 is already standard in Linux and the web for example. It's just Microsoft.

TazeTSchnitzel · on June 19, 2016

UTF-16's (and UCS-2's) tentacles extend further than just Windows NT. A lot of stuff created in the 90's uses it, notably Java, .NET (and thus C#), JavaScript/ECMAScript, parts of C++, GSM, Python (older versions), Qt, etc.

jfries · on June 19, 2016

An interesting suggestion they make is to keep utf-8 also for strings internal to your program. That is, instead of decoding utf-8 on input and encode utf-8 on output, you just keep it encoded the whole time.

PeterisP · on June 19, 2016

What would be a good alternative for strings internal to your program?

I work with multilingual text processing applications, and I strongly support that concept. A guideline of "use UTF8 or die" works well and avoids lots of headaches - it is the most efficient encoding for in-memory use (unless you work mostly with Asian charsets where UTF16 has a size advantage) and it is compatible with all legal data, so it's quite effective to have a policy that 100% of your functions/API/datastructures/databases pass only UTF8 data, and when other encodings are needed (e.g. file import/export) then at the very edge of your application the data is converted to that something else.

Having a mix of encodings is a time bomb that sooner or later blows up as nasty bugs.

ridiculous_fish · on June 20, 2016

Abstraction is the alternative. Design an API that treats encodings uniformly, and the encoding becomes an internal implementation detail. You can then have a polymorphic representation that avoids unnecessary conversions. NSString and Swift String both work this way.

nabla9 · on June 19, 2016

Vector of pointers to grapheme clusters for example.

Sometimes vector of objects that include other information, like glyphs etc.

jandrese · on June 19, 2016

Doesn't Python use UTF-16 internally?

IMHO, UTF-16 is the worst of both worlds. It breaks backwards compatibility in the simple case and wastes storage, but still has to have complex multi-byte decoding because it's not a fixed length encoding.

UTF-8 is probably the best compromise of the lot, with the advantages of UTF-32 being outweighed by the massive overhead in the most common case.

Avernar · on June 19, 2016

No. Python 2.7 uses UCS-2 or UCS-4 depending on how it was compiled. Python 3 uses ASCII, UCS-2 or UCS-4 determined at runtime per string depending on the string's contents.

tajen · on June 19, 2016

I'm on Mac and I've had problems with Chrome sending ajax requests or decoding ajax responses in ISO-8859-1, if I remember well. I had to add "; charset=utf-8" to my headers. I remember it was a browser problem, and I think it was the same for all browsers.

TazeTSchnitzel · on June 19, 2016

For backwards-compatibility's sake, where a web page doesn't specify a character set, browsers will assume the predominant pre-Unicode encoding used in your region.

scrollaway · on June 20, 2016

Why is this still the case? UTF8 is dominant now, wouldn't it make more sense to assume UTF8?

niftich · on June 20, 2016

The older the site, the less likely it is that it will have been updated. Therefore, it's reasonable to assume that newer sites will either declare UTF-8, or can be modified to declare UTF-8, while old sites stay the way they always were, pre-UTF-8.

Keeping the backwards-compatibility heuristic the same makes sense.

TazeTSchnitzel · on June 21, 2016

Old sites lacked encoding declarations, and old browsers (e.g. early versions of IE) didn't support them.

Sites that want UTF-8 can ask for it.

moomin · on June 19, 2016

One word: Java

voaie · on June 19, 2016

May be off-topic, I wonder if anyone is planning a redesign of Unicode for the far future? or is there a better way to handle characters, so we don't require a giant library like ICU?

jcranmer · on June 19, 2016

If your goal is to eliminate ICU, there's not any change you can realistically make. Unicode has problems, but the most obvious things to fix (CJK unification, precomposed versus combining characters, different semantic characters with completely identical graphs (Angstrom sign versus A-with-circle-above, e.g.)) do not eliminate the need for ICU.

Languages are horribly complicated. The Turkish ı/İ issue makes capitalization a locale-dependent thing, and things like German ß/ẞ/ss/SS make case conversion in general mind-boggling. The treatment of diacritics in Latin script for collation purposes differs very heavily between major European languages, so sorting and searching are again locale-dependent. And by the time you're dealing with the locale mess of languages, handling locale-specific number, date, and time representations is pretty much trivial.

The need for giant Unicode character tables and CLDR tables, or tables that capture similar information, is quite frankly necessary to handle internationalization to any substantial degree.

Someone · on June 19, 2016

"so sorting and searching are again locale-dependent"

It's worse. Sorting is dependent on the task at hand. http://userguide.icu-project.org/collation: "For example, in German dictionaries, "öf" would come before "of". In phone books the situation is the exact opposite."

That page has lots more 'interesting' cases, for example:

"Some French dictionary ordering traditions sort accents in backwards order, from the end of the string. For example, the word "côte" sorts before "coté" because the acute accent on the final "e" is more significant than the circumflex on the "o"."

That means that, given two strings s and t such that s sorts before t, you can append characters to t to get u which sorts before s. EDIT (after reading the reply of kelnage): _for some strings s and t_

kelnage · on June 19, 2016

No, I don't think that example does imply that. I interpret it as meaning that for the variants of the same "base word" (i.e. all characters are unaccented) the ordering is defined by the positions of the accents rather than their respective orderings. It says nothing about two words that have different lengths or bases.

jandrese · on June 19, 2016

What would you do differently? Unicode isn't complex because people like things that are hard to understand, it's complex because it took on an exceedingly difficult problem.

voaie · on June 19, 2016

Given more and more custom fonts in the OSes/websites, maybe by using some new APIs, we don't need to specify everything in the Unicode standard. We can design a new font format or just a separate datafile, to store those locale-specific information. The Unicode code points then becomes parking slots for different fonts(with locale info to be registered). And We can use the standard/default datafile to keep the old info about the current unicode standard (say Unicode 8.0).

This is just my first thought. Seems that the job of ICU is transfered to the OS or web browser.

voaie · on June 20, 2016

I think the Unicode standard should not limit the use of fonts. Instead, let the font or the additonal locale datafile tell us how to deal with those locale issues.

PeterisP · on June 19, 2016

If you want to handle characters by anything much simpler than current Unicode, you need to simplify the reality that Unicode describes, changing or eliminating a bunch of major human languages. Not all of them, and not even most of them, but still hundreds of millions of people would need to change how they use their language.

It could happen in a century or two, actually, we are seeing some language trends that do favor internationalization and simplification over localization and keeping with linguistic tradition.

vorg · on June 19, 2016

Simplication (caused by internationalization) and diversification (caused by localization) are two ends of a spectrum, but languages, both their spoken and written forms, have bounced between those ends throughout history. In a century or two, by the time simplification has succeeded on Earth, the settlers on Titan will rebel with their own graphical symbols for displaying language.

hackuser · on June 19, 2016

> we are seeing some language trends that do favor internationalization and simplification over localization and keeping with linguistic tradition.

I know you're not necessarily advocating it, but if our cultures change to adapt to our technological limitations, that's the reverse of what I think should be happening - there's a problem with the tech.

voaie · on June 19, 2016

Right, there will be less common languages. The faded ones could be kept in the digital world by using special fonts.

damienkatz · on June 19, 2016

ICU would still be necessary for collation and case conversion.

Const-me · on June 20, 2016

> In both UTF-8 and UTF-16 encodings, code points may take up to 4 bytes.

Wrong: up to 4 bytes UTF16, and up to 6 bytes UTF8.

> Cyrillic, Hebrew and several other popular Unicode blocks are 2 bytes both in UTF-16 and UTF-8.

Cyrillic, Hebrew and several other languages still have spaces and punctuation, that take a single byte in UTF8. Now it’s 2016, RAM and storage are cheap and declining, but CPU branch misprediction cost is same 20 cycles and not going to decline.

> plain Windows edit control (until Vista)

Windows XP is 14 years old, and now in 2016 it’s market share is less then 3%. Who cares what was before Vista?

> In C++, there is no way to return Unicode from std::exception::what() other than using UTF-8.

The exception that are part of STL don’t return Unicode at all, they are in English.

If you throw your custom exceptions, return non-English messages in exception::what() in utf-8, catch std::exception and call what() — you’ll get English error messages for STL-thrown exceptions, and non-English error messages for your custom exceptions.

I’m not sure mixing GUI languages in a single app is always a right thing.

> First, the application must be compiled as Unicode-aware

The oldest visual studio I have installed is 2008 (because I sometimes develop for WinCE). I’ve just created a new C++ console application project, and by default it already Unicode-aware.

So, for anyone using Microsoft IDE, this requirement is not a problem.

d0mine · on June 20, 2016

Modern UTF-8 is limited by 4 bytes (not 6). http://stackoverflow.com/questions/9533258/what-is-the-maxim...

I haven't checked your other claims but this stands out:

> The exception that are part of STL don’t return Unicode at all, they are in English.

Do you mean they return the text as bytes using some (likely ASCII) character encoding and all the text characters are in ASCII range?

There Ain't No Such Thing As Plain Text. (2003) http://www.joelonsoftware.com/articles/Unicode.html

Const-me · on June 20, 2016

> Do you mean they return the text as bytes using some (likely ASCII) character encoding and all the text characters are in ASCII range?

If you rely on std::exception::what() while building a localizable software, you’ll end with inconsistent GUI language. Because some exceptions (that are part of STL) will return English messages, other exceptions (that aren’t part of STL) will return non-English messages.

This means if you’re developing anything localizable, you can’t rely on std::exception::what().

Then why care about it’s prototype?

ybungalobill · on June 20, 2016

The standard does not specify what the standard exceptions return from what(). It does not have to be in English.

Why care about it's prototype? You may want to embed into what() unicode strings that describe the error and came from elsewhere. E.g. a path, a URL, an XML element id, etc. from the context the exception originated. It may be shown to the user or written to the log. Localization is irrelevant here.

hackuser · on June 19, 2016

Is there any application where UTF-8 isn't the best choice for long-term (i.e., 20-200 year) forward compatibility?

niftich · on June 20, 2016

Places and situations where you can't accommodate variable-length encodings. As far as future-proofing, UTF-8 is essentially the new ASCII, in that UTF-8 will remain a backward-compatibility goal for any other format that will succeed it.

hackuser · on June 20, 2016

> As far as future-proofing, UTF-8 is essentially the new ASCII, in that UTF-8 will remain a backward-compatibility goal for any other format that will succeed it.

Yes, I love that every byte transmitted on the Internet still reserves code points for controlling teletype (or similar) machines.

Murk · on June 19, 2016

After considering this problem in long detail in the past, I too favoured utf8 at the time.

I remember a project (circa 1999) I worked on which was a feature phone HTML 3.4 browser and email client (one of the first). The browser/ip stack handled only ascii/code page characters to begin with. To my surprise it was decided to encode text on the platform using utf-16. Thus the entire code base was converted to use 16 bit code points (UCS-2). On a resource constrained platform (~300k ram IFIRC), better, I think, would have been update the renderer and email client to understand utf8.

Nice as it might be to have the idea that utf16, or utf32 were a "character" it is as has been pointed out not the case, and when you look into language you can see how it never can be that simple.

misnome · on June 20, 2016

I quite like Swift's approach -Characters, where a character can be "An extended grapheme cluster ... a sequence of one or more Unicode scalars that (when combined) produce a single human-readable character.". This seems, in practice to mean things like multibyte entries, modified entries, end up as a single entry.

As the trade-off, directly indexing into strings is... Either not possible or discouraged, and often relies on an opaque(?) indexing class.

The main weirdness I have encountered so far is that the Regex functions operate only on the old, objective-c method of indexing, so a little swizzling is required to handle things properly.

cm3 · on June 19, 2016

Offtopic, but does anyone know of a way to ensure I don't introduce non-ASCII filenames, to ensure broad portability across systems? I've had to resort to disabling UTF-8 on Linux to achieve that.

viraptor · on June 19, 2016

What's the use case? Make sure you don't introduce them as a desktop user? As a app developer? (what does the app do?) As a sysadmin with third party unknown apps?

You can't really "disable utf-8" on Linux. You can change how things are encoded when displaying or saving. (via locale/lang variables) But if the app wants to create a file named "0xE2 0x98 0x83" (binary version of course), it's still free to do that.

cm3 · on June 20, 2016

I just don't want garbage file names when sharing a file system between systems that don't agree on the encoding. I was thinking maybe some mount option. I can use ISO-88591 and skip UTF-8. I haven't found a mount option for ext4 or xfs yet.

viraptor · on June 20, 2016

There isn't one. The names in ext4 and xfs are opaque binary with some simple limitations (like null bytes). Encodings simply don't exist at the fs layer.

You could probably write some filter using fusefs, but in practice... I think you should configure the servers / clients to agree on encoding instead. Better supported and shouldn't be that much work.

Koromix · on June 20, 2016

Linux filesystems are not encoding aware. Paths are just treated as opaque byte strings. However, there is ongoing work add configurable safe filenames to Linux: https://lwn.net/Articles/686789/

But it won't allow you to force ISO-8859-1 in this form. However you could filter out non-ASCII characters.

wrp · on June 20, 2016

This militancy to force everyone to use UTF-8 is bad engineering. I'm thinking of GNOME 3, where you aren't even allowed the option of choosing ASCII as a default setting, only UTF-8 or ISO-8859-x. A default setting is just as important for what it filters out as for what it passes through. I use a lot of older tools on *nix that are ASCII-only, in tool chains that slurp and munge text. If the chain includes any of these UTF-8-only apps, I'm constantly dealing with the problem of invalid ASCII passing through.

codeulike · on June 19, 2016

With or without a BOM?

mcpherrinm · on June 19, 2016

Putting a BOM in UTF-8 is just silly. Unlike -16, there's no option for which order you put the bytes in. The only time you'll see a BOM in UTF-8 is in poorly converted UTF-16.

slavik81 · on June 19, 2016

Apparently, Powershell requires a BOM to recognize UTF-8 scripts. https://github.com/chocolatey/choco/wiki/CreatePackages#char...

codeulike · on June 20, 2016

Yep, but a lot of MS software will only read UTF-8 correctly if a BOM is present.

jcranmer · on June 19, 2016

Without. UTF-8 is such a distinctive pattern that if text with high bits set matches UTF-8, it's almost certainly UTF-8. There's no need for a BOM to tell you it's UTF-8 (looking at you, Windows), and it can easily confuse software instead.

umanwizard · on June 19, 2016

Huh? What would a BOM in UTF-8 even do? 1-byte objects can't have an internal byte ordering.

jcranmer · on June 19, 2016

MS popularized the idea of adding the UTF-16 BOM into UTF-8 to distinguish between UTF-8 text files and Windows code page files, or what they called "Unicode" and "ANSI." There's (nearly?) unanimous agreement among everyone else that BOMs in UTF-8 text are really stupid.

Note that the "BOM" in this case means storing the U+FEFF character in UTF-8 form (just as UTF-16 stores it in the appropriate endianness). This means that the result would be EF BB BF.

Const-me · on June 20, 2016

Not everything is a web request or response that have a “content-encoding” header transmitted somewhere out of band.

The BOM allows to distinguish a byte stream between non-Unicode, UTF8, UTF16 and UTF32.

Like it or not, but it's part of the standard:

http://unicode.org/faq/utf_bom.html#BOM

Avernar · on June 20, 2016

From section 2.6 in the standard: "Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature."

Yes, it can be used to distinguish a UTF-8 stream but it's not recommended. One issue is you can't tell if the BOM is not valid text in some other non-unicode encoding.

I'm curious where you've encountered missing content-encoding headers or other OOB indicators where it wasn't because of programmer error or laziness.

Const-me · on June 20, 2016

> it can be used to distinguish a UTF-8 stream but it's not recommended

If a specification says “something may be encountered”, for me, when I write my software, it means I must support that thing. Otherwise, the software won’t conform to the spec.

> I'm curious where you've encountered missing content-encoding headers or other OOB indicators where it wasn't because of programmer error or laziness.

Everywhere.

Most filesystems don’t have encoding headers for their text files. Most databases don’t have headers for their blob columns.

Only web that has encoding headers.

Avernar · on June 20, 2016

There's a difference between what a program should accept as input and what it should generate as output. The standard just says to expect a BOM on input and suggests not generating on output. In other words "A UTF-8 BOM is a bad idea but some yutz out there stated doing it so we should ignore it on input". Someone else mentioned that the yutz was Microsoft.

I misread what you wrote about where you saw no idication that it was UTF-8. You were talking about places other than the web.

BOM for UTF-8 text files seems to be a Microsoft thing. Everyone else just defaults to UTF-8. But you can't be sure that it's a UTF-8 BOM or some other encoding. Most editors let the user overide what it is.

Why would you store text in a blob column? If a database can't handle UTF-8 in it's text column it needs to be fixed (or taken out back an shot).

Const-me · on June 20, 2016

> There's a difference between what a program should accept as input and what it should generate as output.

I’m a Windows developer. In my world, a program should generate its output in whatever format user wants it to be.

When I press “File/Save as” in visual studio and click on the down arrow icon, I see a choice of more than 100 different encodings (including all flavors of Unicode with and without the BOM), and independent choice of 3 line endings (Window, mac, Unix).

> BOM for UTF-8 text files seems to be a Microsoft thing

Practically — maybe, most Microsoft apps tend to understand those BOMs, and most *nix tools don’t, even on input.

Officially — definitely no, we both saw the spec on unicode.org.

Avernar · on June 20, 2016

> I’m a Windows developer. In my world, a program should generate its output in whatever format user wants it to be.

When generating output for a user, letting them choose is a good idea. But for interop with other programs I leave it off unless the program needs it.

> Officially — definitely no, we both saw the spec on unicode.org.

The spec says the BOM is optional. Some Microsoft programs however require it.

Const-me · on June 20, 2016

> for interop with other programs I leave it off unless the program needs it.

Plain text isn’t exactly a machine-friendly format.

If you want to interop with other programs, the good choice is e.g. XML. That has this encoding problem fixed as a part of the standard.

> The spec says the BOM is optional. Some Microsoft programs however require it.

Could you please name a Microsoft program that you think requires a BOM?

I’m asking because I have completely different experience. For me, Microsoft programs open text files just fine, with or without the BOM. But most *nix and osx programs show me garbage instead of BOM.

Avernar · on June 20, 2016

> Plain text isn’t exactly a machine-friendly format.

Works fine for unix. :D

> Could you please name a Microsoft program that you think requires a BOM?

Visual C++ off the top of my head. It mangles UTF-8 string literals without the BOM in the source code.

> For me, Microsoft programs open text files just fine, with or without the BOM. But most *nix and osx programs show me garbage instead of BOM.

That's what I was trying to say about the BOM being prevalent on the Windows side of the fence. Some programs require it, some always generate it so most program now accept it.

On the unix/osx side everyone switched to UTF-8 so the BOM is redundant. Everything is UTF-8 so the silliness of this needs a BOM that doesn't need a BOM doesn't exist. Good example of what the "UTF-8 Everywhere" site is trying to promote.

Personally I really wish Microsoft would eventually fix their UTF-8 codepage. Would be so nice not having to convert to/from UTF-16 at the Win32 API boundary.

Const-me · on June 20, 2016

> Works fine for unix. :D

The trend towards higher-level data formats is universal across all OSes.

Even on Unix, users typically read html, write odf or docx both being xml, print PostScript, etc.

Plain text is friendly towards developers. But it’s neither interop-friendly nor user friendly.

> Visual C++ off the top of my head

Only the C++ compiler. MS can’t change the compiler because backward compatibility. The IDE however works fine with such files.

> "UTF-8 Everywhere" site is trying to promote.

The transition is going to be expensive, because most languages and frameworks (C++/MFC/ATL/QT, .NET languages, JVM languages, Python, etc) use Unicode (USC2 or UT16) strings for decades already.

To justify the costs, the benefits of the transition must be substantial.

And there aren’t any.

Avernar · on June 20, 2016

> Plain text is friendly towards developers. But it’s neither interop-friendly nor user friendly.

Kind of got off track here. You can process a lot of formats as text (html, css, xml, etc). So a BOM there is unnecessary and sometimes detrimental. On the unix side there are a lot of text utilities that do useful things that you can do on these formats. That's probably why BOMs are non existent there.

> MS can’t change the compiler because backward compatibility.

You care to tell MS that? Every single time I've done a major VS upgrade my code had to be changed because something that was valid before stopped being valid.

> And there aren’t any.

If you can't see any benefit of using UTF-8 then I'm done debating with you.

the_mitsuhiko · on June 19, 2016

In UTF-8 a BOM can be placed to support round tripping the information with UTF-16.

umanwizard · on June 19, 2016

So what order do you put the BOM in? Does it not even matter?

Avernar · on June 19, 2016

The unicode BOM is code point U+FEFF. The process of encoding it determines the order.

Encoded to UTF-8 it becomes EF BB BF. Encoding to UTF-16 big endian it will become FE FF. Encoding it to UTF-16 little endian it becomes FF FE.

Converting it back from UTF-8 always gives you U+FEFF since UTF-8 doesn't care about endianess. Converting it back from UTF-16 using the correct endianess gives you U+FEFF. Converting it using the wrong endianess gives you U+FFFE which is defined by unicode as a "non character" that should never appear in text.

umanwizard · on June 20, 2016

Makes sense, thanks :)

damienkatz · on June 19, 2016

Joking? BOM is completely unnecessary in UTF8, only useful to losslessly preserve UTF16 text when converting back and forth.

douche · on June 20, 2016

Some days, I imagine a parallel universe, where the ancient Chinese had called ideograms a bad idea, and went on to develop a proper alphabet. Unicode would be pretty much unnecessary.