Hacker News new | past | comments | ask | show | jobs | submit login

this seems specific to Windows. UTF8 is already standard in Linux and the web for example. It's just Microsoft.



UTF-16's (and UCS-2's) tentacles extend further than just Windows NT. A lot of stuff created in the 90's uses it, notably Java, .NET (and thus C#), JavaScript/ECMAScript, parts of C++, GSM, Python (older versions), Qt, etc.


An interesting suggestion they make is to keep utf-8 also for strings internal to your program. That is, instead of decoding utf-8 on input and encode utf-8 on output, you just keep it encoded the whole time.


What would be a good alternative for strings internal to your program?

I work with multilingual text processing applications, and I strongly support that concept. A guideline of "use UTF8 or die" works well and avoids lots of headaches - it is the most efficient encoding for in-memory use (unless you work mostly with Asian charsets where UTF16 has a size advantage) and it is compatible with all legal data, so it's quite effective to have a policy that 100% of your functions/API/datastructures/databases pass only UTF8 data, and when other encodings are needed (e.g. file import/export) then at the very edge of your application the data is converted to that something else.

Having a mix of encodings is a time bomb that sooner or later blows up as nasty bugs.


Abstraction is the alternative. Design an API that treats encodings uniformly, and the encoding becomes an internal implementation detail. You can then have a polymorphic representation that avoids unnecessary conversions. NSString and Swift String both work this way.


Vector of pointers to grapheme clusters for example.

Sometimes vector of objects that include other information, like glyphs etc.


Doesn't Python use UTF-16 internally?

IMHO, UTF-16 is the worst of both worlds. It breaks backwards compatibility in the simple case and wastes storage, but still has to have complex multi-byte decoding because it's not a fixed length encoding.

UTF-8 is probably the best compromise of the lot, with the advantages of UTF-32 being outweighed by the massive overhead in the most common case.


No. Python 2.7 uses UCS-2 or UCS-4 depending on how it was compiled. Python 3 uses ASCII, UCS-2 or UCS-4 determined at runtime per string depending on the string's contents.


I'm on Mac and I've had problems with Chrome sending ajax requests or decoding ajax responses in ISO-8859-1, if I remember well. I had to add "; charset=utf-8" to my headers. I remember it was a browser problem, and I think it was the same for all browsers.


For backwards-compatibility's sake, where a web page doesn't specify a character set, browsers will assume the predominant pre-Unicode encoding used in your region.


Why is this still the case? UTF8 is dominant now, wouldn't it make more sense to assume UTF8?


The older the site, the less likely it is that it will have been updated. Therefore, it's reasonable to assume that newer sites will either declare UTF-8, or can be modified to declare UTF-8, while old sites stay the way they always were, pre-UTF-8.

Keeping the backwards-compatibility heuristic the same makes sense.


Old sites lacked encoding declarations, and old browsers (e.g. early versions of IE) didn't support them.

Sites that want UTF-8 can ask for it.


One word: Java




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: