this seems specific to Windows. UTF8 is already standard in Linux and the web fo...

TazeTSchnitzel · on June 19, 2016

UTF-16's (and UCS-2's) tentacles extend further than just Windows NT. A lot of stuff created in the 90's uses it, notably Java, .NET (and thus C#), JavaScript/ECMAScript, parts of C++, GSM, Python (older versions), Qt, etc.

jfries · on June 19, 2016

An interesting suggestion they make is to keep utf-8 also for strings internal to your program. That is, instead of decoding utf-8 on input and encode utf-8 on output, you just keep it encoded the whole time.

PeterisP · on June 19, 2016

What would be a good alternative for strings internal to your program?

I work with multilingual text processing applications, and I strongly support that concept. A guideline of "use UTF8 or die" works well and avoids lots of headaches - it is the most efficient encoding for in-memory use (unless you work mostly with Asian charsets where UTF16 has a size advantage) and it is compatible with all legal data, so it's quite effective to have a policy that 100% of your functions/API/datastructures/databases pass only UTF8 data, and when other encodings are needed (e.g. file import/export) then at the very edge of your application the data is converted to that something else.

Having a mix of encodings is a time bomb that sooner or later blows up as nasty bugs.

ridiculous_fish · on June 20, 2016

Abstraction is the alternative. Design an API that treats encodings uniformly, and the encoding becomes an internal implementation detail. You can then have a polymorphic representation that avoids unnecessary conversions. NSString and Swift String both work this way.

nabla9 · on June 19, 2016

Vector of pointers to grapheme clusters for example.

Sometimes vector of objects that include other information, like glyphs etc.

jandrese · on June 19, 2016

Doesn't Python use UTF-16 internally?

IMHO, UTF-16 is the worst of both worlds. It breaks backwards compatibility in the simple case and wastes storage, but still has to have complex multi-byte decoding because it's not a fixed length encoding.

UTF-8 is probably the best compromise of the lot, with the advantages of UTF-32 being outweighed by the massive overhead in the most common case.

Avernar · on June 19, 2016

No. Python 2.7 uses UCS-2 or UCS-4 depending on how it was compiled. Python 3 uses ASCII, UCS-2 or UCS-4 determined at runtime per string depending on the string's contents.

tajen · on June 19, 2016

I'm on Mac and I've had problems with Chrome sending ajax requests or decoding ajax responses in ISO-8859-1, if I remember well. I had to add "; charset=utf-8" to my headers. I remember it was a browser problem, and I think it was the same for all browsers.

TazeTSchnitzel · on June 19, 2016

For backwards-compatibility's sake, where a web page doesn't specify a character set, browsers will assume the predominant pre-Unicode encoding used in your region.

scrollaway · on June 20, 2016

Why is this still the case? UTF8 is dominant now, wouldn't it make more sense to assume UTF8?

niftich · on June 20, 2016

The older the site, the less likely it is that it will have been updated. Therefore, it's reasonable to assume that newer sites will either declare UTF-8, or can be modified to declare UTF-8, while old sites stay the way they always were, pre-UTF-8.

Keeping the backwards-compatibility heuristic the same makes sense.

TazeTSchnitzel · on June 21, 2016

Old sites lacked encoding declarations, and old browsers (e.g. early versions of IE) didn't support them.

Sites that want UTF-8 can ask for it.

moomin · on June 19, 2016

One word: Java