Hacker News new | past | comments | ask | show | jobs | submit login

The problem is that while this is trivial for Western languages I don't even know how you'd begin when presented with arbitrary text using an East Asian or African written language

Not actually that hard.

Consider a document which is encoded in either a) ASCII like you know it or b) ASCII where the top 4 bits and bottom 4 bits are transposed. How would you tell the difference? Well, one can imagine creating a histogram of the bits for each half of the bytes and comparing them to expectations based on the distribution of bits in naturally occurring English text. The half with most of the entries in the 0x5, 0x6, and 0x7 is the upper order half.

If you don't know what naturally occurring e.g. Japanese looks like in Unicode code points, take this on faith: flipping the order does not give you a document which looks probably correct. (Also, crucially, Japanese with the order flipped doesn't resemble any sensible document in any language -- you end up with Unicode code points from a mishmash of unrelated pages.)

P.S. Why care about that algorithm? Here's a hypothetical: you're a forensic investigator or system administrator who, given a hard drive which has been damaged, need to extract as much information as possible from it. The BOM is very possibly not in the same undamaged sector which you are reading right now, and it may be impossible to stitch the sectors without first reading the text. How would you determine a) whether an arbitrary stream of bytes was likely a textual document, b) what the encoding was, c) what endianness it was, if appropriate, and d) what human language it was written in?




Note that precisely such an encoding detection algorithm was specified as part of XML to avoid the absurdity of the BOM.


Sounds hard to me.


How about this simplified version:

1. Try both byte orders

2. If one produces valid text and the other does not, choose that one (this will get you the correct answer almost every time, even if the source text is Chinese)

3. If both happen to produce valid text, use the one with the smallest number of scripts

(Note that this just determines byte order, while Patrick was talking about the more ambitious task of heuristically determining whether a random string of bytes is text and if so what encoding it is. My point is just that you really don't need to be told the order of the bytes in most cases.)


Simple in theory, but hard enough in practice that companies like Microsoft screw it up from time to time.

Try saving a text file in Windows XP Notepad with the words "Bush hid the facts" and nothing else. Close it and open the file again. WTF Chinese characters! Conspiracy!


That's not Microsoft "screwing it up", that's you not feeding the algorithm enough characters for it to be really sure. While that short string is below the threshold, the threshold is actually quite surprisingly small; if I remember correctly it's just over 100 bytes and any non-pathological input will be correctly identified with effectively 100% success.


That's a bug having to do with uncertain encoding (which is what I called "the more ambitious task"), not uncertain byte order.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: