*The problem is that while this is trivial for Western languages I don't even kn...

ajross · on April 3, 2012

Note that precisely such an encoding detection algorithm was specified as part of XML to avoid the absurdity of the BOM.

funkah · on April 3, 2012

Sounds hard to me.

chc · on April 3, 2012

How about this simplified version:

1. Try both byte orders

2. If one produces valid text and the other does not, choose that one (this will get you the correct answer almost every time, even if the source text is Chinese)

3. If both happen to produce valid text, use the one with the smallest number of scripts

(Note that this just determines byte order, while Patrick was talking about the more ambitious task of heuristically determining whether a random string of bytes is text and if so what encoding it is. My point is just that you really don't need to be told the order of the bytes in most cases.)

kijin · on April 3, 2012

Simple in theory, but hard enough in practice that companies like Microsoft screw it up from time to time.

Try saving a text file in Windows XP Notepad with the words "Bush hid the facts" and nothing else. Close it and open the file again. WTF Chinese characters! Conspiracy!

jerf · on April 3, 2012

That's not Microsoft "screwing it up", that's you not feeding the algorithm enough characters for it to be really sure. While that short string is below the threshold, the threshold is actually quite surprisingly small; if I remember correctly it's just over 100 bytes and any non-pathological input will be correctly identified with effectively 100% success.

chc · on April 3, 2012

That's a bug having to do with uncertain encoding (which is what I called "the more ambitious task"), not uncertain byte order.