I'm just asking about Google because they're in such a good position to do it.
Is any private company doing it? It'd be neat if there was a web site that you could submit some writing samples (emails or whatever) and then see everything else that person has posted online (regardless of whether it was anonymous or not).
I'm sure there's no way to do it with total accuracy, but with enough input shouldn't it be possible to be highly accurate?
Anyone know of any software that can take a large number of writing samples and determine who wrote which ones?
If not, how would you go about creating it?
Everyone thinks "Aha, you have some catchphrases" (I do) or "Aha, you were one of the only Republicans and thus someone saying 'death tax' was more likely you" (true) or "You cited nationalreview.com more than the rest of the forum together" (true), but it turns out the distribution of really stupid stuff (stopwords, essentially) works better.
This is ironically the same they've discovered for making female/male authorship decisions, although I never went the next step and said "So what relationship does my distribution have with the average guy distribution?"
Incidentally, here's the reason you'll never have to worry about this in the context of "Google the Internet for everything Patrick McKenzie has ever written": imagine I have a 99.9% effective filter for you, and I dragnet an Internet filled with 5 billion documents of which you've written 1,000. I then identify 5 million documents as written by you... but you only wrote 1,000 of them.
This sort of "don't search the haystack unless you're bloody sure it is packed full of needles" thing is why you never want to test a population not known to be at risk for the disease, etc. (Or why you retest in the event of a positive using a different test.)