I think the article doesn't really clarify the reason these bugs exist. After all, is it really a bug? What if I so happen to be working with Chinese Unicode files without BOM that are also valid ASCII files? Then it would seem to me that Notepad was previously behaving correctly, and ever since Vista they introduced a bug!
No, this isn't a low-level bug tied to any particular Windows API function. This is the inevitable result of the folly of trying to guess a file's encoding. When we find ourselves cornered into doing something this ridiculous, it becomes apparent that we, as a society of programmers, are extremely disorganized.
What if I so happen to be working with Chinese Unicode files without BOM that are also valid ASCII files?
What's the longest natural Chinese substring you can find for which this is the case? When we ran our encoding-and-language-detection heuristic developed for our research work I think, and this is 10+ years ago, it was 5 characters long -- there was no substring 6+ characters (12 bytes of UTF-16) which could be reasonably written in ASCII.
This is because much of ASCII is unprintable, more-than-doubly so if you're concerned strictly with 7-bit ASCII as opposed to e.g. Latin-1. When's the last time you saw a string contain 0x18 ("cancel"), 0x07 ("bell"), etc?
An assortment of 0x..18 characters for your amusement: 予,先,券,単,又. I picked ones which are included in commonly used words in Japanese. Seeing any one of these once in UTF-16 is dispositive that the bytestream is not ASCII.
This problem is worth thinking about (which is why a customer asked a team of CS researchers and linguists to think about it, including one barely competent programmer who nonetheless had reasonably good intuitions for what character distributions looked like), but it turns out to be much, much less hard than many people originally expected.
Anyhow, long story short: make a histogram of the bytes, dot product with the histogram signature you have of a few large corpora in decent-guess-based-on-apriori-knowledge languages/encoding, normalize. The screamingly obvious candidate is almost always the winner.
If you want to do it really, really quickly you can even evaluate this heuristic with a Bloom filter in an FPGA.
Do you have an intuition for what that looks like if you interpret it as ASCII? No? Just guess: "almost plausibly an English document", "gibberish but mostly ASCII", "absolutely zero probability of being mistaken for ASCII."
I whipped up a quick Ruby script:
```
require colorize; chinese = File.readlines("/tmp/chinese.txt"); puts chinese.bytes.map {|b| str = b.chr; if str.ascii_only? ? str.blue : str.red}.join
```
which converts that string from a Unicode encoding (UTF-8) to ASCII and renders the output blue where it collides with a printable ASCII character and as a red question mark otherwise.
What's hard, however, is telling multi-byte CJK encodings apart, and telling them apart from UTF-8.
I've tried it, as I was working on extending my mojibake-fixing library [1] to the language that gave us the word "mojibake". The ambiguous cases actually happen.
Suppose you found this string (which someone actually tweeted) encoded in Shift-JIS:
(|| * m *)ウ、ウップ・・
Those half-width katakana with a comma in the middle look a lot like typical Japanese mojibake, so the classifier would be entirely justified in deciding that Shift-JIS is wrong. Next let's try decoding those bytes as EUC-JP instead:
(|| * m *)海劾餅ゥ
Hey, maybe that's a name or something! No other encoding works, so the classifier happily decides it's EUC-JP. Except this string is complete nonsense and the person actually meant the first string.
I do agree with you that it's a problem worth trying to solve (clearly). We can't just throw up our hands and say "this is a problem that shouldn't happen, so let's not solve it". Microsoft Office exists, and its mojibake-causing "features" will never be removed due to backward compatibility, so this happens and will continue to happen.
But, any ideas where the corpus for telling apart encodings should actually come from? I've been using Twitter, but this limits the domain. Only thoroughly defective Twitter clients send anything but UTF-8 to Twitter. There are lots of defective Twitter clients, it turns out, but the creative mistakes that Excel users make every day are one in a billion on Twitter. I can apply artificial mojibake, but then I'm not correlating the text with the encoding it's likely to be in.
But, any ideas where the corpus for telling apart encodings should actually come from?
If you want a lot of fairly natural Japanese, try Wikipedia. If you want something a bit more comprehensive and less restricted, the "Balanced Corpus of Contemporary Written Japanese" is a thing that exists, but you might have to jump through a few hoops.
Thanks! I saw your retweet about it. Let me warn you that I haven't solved the CJK encoding problem, so Japanese developers are only going to benefit if their mojibake involves UTF-8.
The corpus I'm looking for needs to be messy and informal, unlike Wikipedia, as the ambiguous cases tend to be crazy emoticons. I guess I can't hope for more than Twitter with artificially increased mojibake.
> This problem is worth thinking about (which is why a customer asked a team of CS researchers and linguists to think about it, including one barely competent programmer who nonetheless had reasonably good intuitions for what character distributions looked like), but it turns out to be much, much less hard than many people originally expected.
There are a couple related ideas that might make this more obvious in hindsight:
- You can determine the language of a substitution-ciphered text just from its frequency distribution.
- Imagine getting a page each of text in several different languages, say english, french, portuguese, polish, and turkish. "Normalize" everything to ascii characters, so laïcité turns into laicite. It will be trivially easy, by looking at any page in isolation, to determine which language is represented there.
Trivially easy for a human, maybe. In most cases it's pretty straightforward, but there are a few notoriously tricky language pairs that any automated solution is going to have trouble with. Most notably Norwegian and Danish (and Swedish to a lesser extent), and Czech and Slovakian. There are also some tricky cases in the Iberian area where some dialects of Spanish are quite similar to Portuguese, and in the Balkans Croat, Serbian and Bosnian are basically identical as well, although in some cases they can be distinguished based on the writing system used.
> This is the inevitable result of the folly of trying to guess a file's encoding.
I'd say just: "This is the inevitable result of the folly of trying to guess." Guessing without confirmation request has no place in system libraries, nor in any "serious" (read: money-related) application.
Guessing is OK for providing nice defaults that work most of the time, but the user should have the final word.
Donald Knuth's annual Christmas lecture at Stanford was just released on YouTube a few days ago. It's about comma-free codes, a similar idea to this bug: https://www.youtube.com/watch?v=48iJx8FVuis
The Wikipedia article mentions other applications than notepad and implies IsTextUnicode was fixed in Vista.
Kaplan's blog post explains the change in Vista was actually in notepad (use a different algorithm) and IsTextUnicode was left broken, so the other applications mentioned in the Wiki page would presumably still be broken on Vista and above.
my favorite thing about this is the apparent fact that someone typed "bush hid the facts" into a notepad document then saved it. "oh man, this is big... better write this down..."
Ironic that this came after the DOJ stopped investigating Microsoft for abusing their monopoly on Windows when Bush came into office.
I remember watching the DOJ videos on Bill Gates drinking Pepsi trying to ask questions on what type of Java they are talking about Microsoft competing against.
No, this isn't a low-level bug tied to any particular Windows API function. This is the inevitable result of the folly of trying to guess a file's encoding. When we find ourselves cornered into doing something this ridiculous, it becomes apparent that we, as a society of programmers, are extremely disorganized.