The "Bush hid the facts" bug

pierrec · on Dec 8, 2015

I think the article doesn't really clarify the reason these bugs exist. After all, is it really a bug? What if I so happen to be working with Chinese Unicode files without BOM that are also valid ASCII files? Then it would seem to me that Notepad was previously behaving correctly, and ever since Vista they introduced a bug!

No, this isn't a low-level bug tied to any particular Windows API function. This is the inevitable result of the folly of trying to guess a file's encoding. When we find ourselves cornered into doing something this ridiculous, it becomes apparent that we, as a society of programmers, are extremely disorganized.

patio11 · on Dec 8, 2015

What if I so happen to be working with Chinese Unicode files without BOM that are also valid ASCII files?

What's the longest natural Chinese substring you can find for which this is the case? When we ran our encoding-and-language-detection heuristic developed for our research work I think, and this is 10+ years ago, it was 5 characters long -- there was no substring 6+ characters (12 bytes of UTF-16) which could be reasonably written in ASCII.

This is because much of ASCII is unprintable, more-than-doubly so if you're concerned strictly with 7-bit ASCII as opposed to e.g. Latin-1. When's the last time you saw a string contain 0x18 ("cancel"), 0x07 ("bell"), etc?

An assortment of 0x..18 characters for your amusement: 予,先,券,単,又. I picked ones which are included in commonly used words in Japanese. Seeing any one of these once in UTF-16 is dispositive that the bytestream is not ASCII.

This problem is worth thinking about (which is why a customer asked a team of CS researchers and linguists to think about it, including one barely competent programmer who nonetheless had reasonably good intuitions for what character distributions looked like), but it turns out to be much, much less hard than many people originally expected.

Anyhow, long story short: make a histogram of the bytes, dot product with the histogram signature you have of a few large corpora in decent-guess-based-on-apriori-knowledge languages/encoding, normalize. The screamingly obvious candidate is almost always the winner.

If you want to do it really, really quickly you can even evaluate this heuristic with a Bloom filter in an FPGA.

patio11 · on Dec 8, 2015

Elaboration since I like geeking out about this subject and demos like this always got the point across to non-Unicode-fluent developers.

Here's a sample of Chinese language text, sourced by my favorite method: grab whatever is on Wikipedia's homepage.

2006年大西洋飓风季时间轴中记录有全年大西洋盆地所有热带和亚热带气旋形成、增强、减弱、登陆、转变成温带气旋以及消散的具体信息。2006年大西洋飓风季于2006年6月1日正式开始，同年11月30日结束，传统上这样的日期界定了一年中绝大多数热带气旋在大西洋形成的时间段，这一飓风季是继2001年大西洋飓风季以来第一个没有任何一场飓风在美国登陆的大西洋飓风季，也是继1994年大西洋飓风季以来第一次在整个十月份都没有热带气旋形成。美国国家飓风中心每年都会对前一年飓风季的所有天气系统进行重新分析，并根据结果更新其风暴数据库，因此时间轴中还包括实际操作中没有发布的信息。包括最大持续风速、位置、距离在内的所有数字都是经四舍五入换算成整数。2006年大西洋飓风季的活动程度与前一年相比远远不及。起初气象学家预计在极其活跃的2005年大西洋飓风季后，2006年的活动程度应该只会略低。然而，2006年迅速形成的厄尔尼诺-南方涛动现象、大西洋热带海域上空的撒哈拉空气层，以及以百慕大为中心的亚速尔高压这一强大二级高气压的持续存在，都令2006年大西洋飓风季的活动程度大幅降低。从10月2日以后一直到飓风季结束都完全没有热带气旋形成。2005年12月底形成的热带风暴泽塔一直持续到了2006年1月初，成为有纪录以来第二个跨日历年的大西洋风暴。虽然其存在时间不在任何一年飓风季的正式时间段里，但仍然可以视为2005和2006年大西洋飓风季的一部分。

Do you have an intuition for what that looks like if you interpret it as ASCII? No? Just guess: "almost plausibly an English document", "gibberish but mostly ASCII", "absolutely zero probability of being mistaken for ASCII."

I whipped up a quick Ruby script:

``` require colorize; chinese = File.readlines("/tmp/chinese.txt"); puts chinese.bytes.map {|b| str = b.chr; if str.ascii_only? ? str.blue : str.red}.join ```

which converts that string from a Unicode encoding (UTF-8) to ASCII and renders the output blue where it collides with a printable ASCII character and as a red question mark otherwise.

Did this match your prediction?

https://www.evernote.com/l/Aaf93wCQGulAdZAtZjBA-8st_zgF_BKDl...

If we first convert the string to UTF-16, it's a little less screamingly obvious but, well:

https://www.evernote.com/l/AafWlxVe1CRIRKky5fDXLGYaVSFUetnXb...

stavros · on Dec 8, 2015

Your post confuses me. The first screenshot is pretty much exactly how I would translate the Chinese string as well.

iSnow · on Dec 8, 2015

Really love your posts, geeking it to the extremes about a strange and rare problem. thanks :)

rspeer · on Dec 8, 2015

What's hard, however, is telling multi-byte CJK encodings apart, and telling them apart from UTF-8.

I've tried it, as I was working on extending my mojibake-fixing library [1] to the language that gave us the word "mojibake". The ambiguous cases actually happen.

[1] http://github.com/LuminosoInsight/python-ftfy

Suppose you found this string (which someone actually tweeted) encoded in Shift-JIS:

    (|| * m *)ｳ､ｳｯﾌﾟ･･

Those half-width katakana with a comma in the middle look a lot like typical Japanese mojibake, so the classifier would be entirely justified in deciding that Shift-JIS is wrong. Next let's try decoding those bytes as EUC-JP instead:

    (|| * m *)海劾餅ゥ

Hey, maybe that's a name or something! No other encoding works, so the classifier happily decides it's EUC-JP. Except this string is complete nonsense and the person actually meant the first string.

I do agree with you that it's a problem worth trying to solve (clearly). We can't just throw up our hands and say "this is a problem that shouldn't happen, so let's not solve it". Microsoft Office exists, and its mojibake-causing "features" will never be removed due to backward compatibility, so this happens and will continue to happen.

But, any ideas where the corpus for telling apart encodings should actually come from? I've been using Twitter, but this limits the domain. Only thoroughly defective Twitter clients send anything but UTF-8 to Twitter. There are lots of defective Twitter clients, it turns out, but the creative mistakes that Excel users make every day are one in a billion on Twitter. I can apply artificial mojibake, but then I'm not correlating the text with the encoding it's likely to be in.

patio11 · on Dec 8, 2015

But, any ideas where the corpus for telling apart encodings should actually come from?

If you want a lot of fairly natural Japanese, try Wikipedia. If you want something a bit more comprehensive and less restricted, the "Balanced Corpus of Contemporary Written Japanese" is a thing that exists, but you might have to jump through a few hoops.

That project is awesome, by the way.

rspeer · on Dec 8, 2015

Thanks! I saw your retweet about it. Let me warn you that I haven't solved the CJK encoding problem, so Japanese developers are only going to benefit if their mojibake involves UTF-8.

The corpus I'm looking for needs to be messy and informal, unlike Wikipedia, as the ambiguous cases tend to be crazy emoticons. I guess I can't hope for more than Twitter with artificially increased mojibake.

thaumasiotes · on Dec 8, 2015

> This problem is worth thinking about (which is why a customer asked a team of CS researchers and linguists to think about it, including one barely competent programmer who nonetheless had reasonably good intuitions for what character distributions looked like), but it turns out to be much, much less hard than many people originally expected.

There are a couple related ideas that might make this more obvious in hindsight:

- You can determine the language of a substitution-ciphered text just from its frequency distribution.

- Imagine getting a page each of text in several different languages, say english, french, portuguese, polish, and turkish. "Normalize" everything to ascii characters, so laïcité turns into laicite. It will be trivially easy, by looking at any page in isolation, to determine which language is represented there.

arnsholt · on Dec 8, 2015

Trivially easy for a human, maybe. In most cases it's pretty straightforward, but there are a few notoriously tricky language pairs that any automated solution is going to have trouble with. Most notably Norwegian and Danish (and Swedish to a lesser extent), and Czech and Slovakian. There are also some tricky cases in the Iberian area where some dialects of Spanish are quite similar to Portuguese, and in the Balkans Croat, Serbian and Bosnian are basically identical as well, although in some cases they can be distinguished based on the writing system used.

webkike · on Dec 8, 2015

There is no longest string

dasil003 · on Dec 8, 2015

Non-repeating obviously

tedunangst · on Dec 8, 2015

You only need two symbols to make an infinite length non repeating string.

webkike · on Dec 8, 2015

Strings can't be inifinite length, there simply is no limit on the length. Stupid bs nitpicking asside you are right

pif · on Dec 8, 2015

> This is the inevitable result of the folly of trying to guess a file's encoding.

I'd say just: "This is the inevitable result of the folly of trying to guess." Guessing without confirmation request has no place in system libraries, nor in any "serious" (read: money-related) application.

Guessing is OK for providing nice defaults that work most of the time, but the user should have the final word.

ratsbane · on Dec 8, 2015

Donald Knuth's annual Christmas lecture at Stanford was just released on YouTube a few days ago. It's about comma-free codes, a similar idea to this bug: https://www.youtube.com/watch?v=48iJx8FVuis

yuhong · on Dec 8, 2015

Note that Wikipedia is not exactly correct on Vista and later. See http://www.siao2.com/2008/03/25/8334796.aspx for the real story.

0942v8653 · on Dec 8, 2015

That page doesn't say anything about Vista or later (or I'm missing it)

Edit: I was looking at one of the sources of the Wikipedia article and mistook the tab for that post. My bad.

Arnavion · on Dec 8, 2015

The Wikipedia article mentions other applications than notepad and implies IsTextUnicode was fixed in Vista.

Kaplan's blog post explains the change in Vista was actually in notepad (use a different algorithm) and IsTextUnicode was left broken, so the other applications mentioned in the Wiki page would presumably still be broken on Vista and above.

nyolfen · on Dec 8, 2015

my favorite thing about this is the apparent fact that someone typed "bush hid the facts" into a notepad document then saved it. "oh man, this is big... better write this down..."

flashman · on Dec 8, 2015

More likely someone realised you could apparently type any 4-3-3-5 string, then came up with that text as a prank.

x1798DE · on Dec 8, 2015

I imagine they just created it as a minimal working example once they realized the source of the bug.

pilsetnieks · on Dec 8, 2015

Reminds me of this crap: http://gizmodo.com/wingdings-predicted-9-11-a-truthers-tale-...

wwwet · on Dec 8, 2015

Yeah except that "Bush hid the facts" is entirely true.

_3u10 · on Dec 8, 2015

I recall once importing code from mozilla to solve just this problem, charset detection. I wonder if it has similar problems.

thearn4 · on Dec 8, 2015

Reminds me of the "Wingdings Predicted 9/11" bug:

http://gizmodo.com/wingdings-predicted-9-11-a-truthers-tale-...

jgalt212 · on Dec 8, 2015

chardet gets it right, but may be confused by others

http://chardet.readthedocs.org/en/latest/usage.html

import chardet

print chardet.detect('Bush hid the facts')

>>> {'confidence': 1.0, 'encoding': 'ascii'}

It might be fun to run Hypothesis on this to see what if any minimal ASCII set is guessed wrong by chardet.

orionblastar · on Dec 8, 2015

Ironic that this came after the DOJ stopped investigating Microsoft for abusing their monopoly on Windows when Bush came into office.

I remember watching the DOJ videos on Bill Gates drinking Pepsi trying to ask questions on what type of Java they are talking about Microsoft competing against.