Linguistics identifies anonymous users

rachelbythebay · on Jan 9, 2013

First come the programs which generate fingerprints from writing samples. Then comes the corpus of known author fingerprints which can take a submission of arbitrary content and turn it into a series of authors and confidence values.

Later, someone will figure they can profit by gaming this system and writing posts which appear to be someone else. All they have to do is try to clone an author's style and then feed their attempts to the analysis program to see how close they came. Then they can just fine-tune things until it's a strong enough match and push it out there as legitimate.

Yep, we might just have a "linguistic similarity analysis arms race" at some point. I concluded this in a post some months back. http://rachelbythebay.com/w/2012/08/29/info/

ChuckMcM · on Jan 9, 2013

Hardly a fair race. And of course its on-going in the pursuit of better ad targeting. And it its one of those weird places you find yourself when dealing with something like Quora and other sites. Basically every 'sample' has both data (the sample) and meta-data (where it came from, when) and those things add up. Looking at some of the research which de-anonymized datasets it was pretty clear that who was saying what, especially a large corpus of such utterances, would yield identify information, if not the actual person but the 'same' person.

andrewcooke · on Jan 9, 2013

nice post (didn't know about dns serial nos). but why does it have to be a humorous phrase (as timestamp)? couldn't it be a site that generates a random nonce every day (and provides a service that compares two nonces)?

rachelbythebay · on Jan 9, 2013

I guess it doesn't have to be humorous to be useful. I'd just like to have them be quirky enough to be appreciated by users. Just think of the first time your favorite site changed its appearance for a special holiday or similar. It gives you some idea there are people involved and it's not just a bunch of robots flinging bits around.

andrewcooke · on Jan 21, 2013

http://colorlessgreen.net/

j_s · on Jan 9, 2013

Authenticode uses timestamp servers.

http://blogs.msdn.com/b/ieinternals/archive/2011/03/22/authe...

  > Time-stamping adds a cryptographically-verifiable timestamp to your 
  > signature, proving when the code was signed. [...] not all publisher 
  > certificates are enabled to permit timestamping to provide indefinite 
  > lifetime [...] to free a Certificate Authority from the burden of 
  > maintaining Revocation lists (CRL, OCSP) in perpetuity.

krickle · on Jan 9, 2013

Followed by the linguistic anonymity arms race!

noonespecial · on Jan 9, 2013

They'll have a hard time defeating my thesaur-ator and idiom-ator.

kragen · on Jan 9, 2013

IIRC, Larry Detweiler's sock-puppeting attempts on the cypherpunks mailing list were unmasked by other cypherpunks with this technique in the early 1990s.

Detweiler, natively paranoid and geographically isolated in Colorado, had come to the conclusion that most of the prominent cypherpunks posters were sock-puppets of a single person he called "Medusa" — he invented the word "pseudospoofing" to describe the phenomenon, although "sock-puppet" is the more common current term — and resolved to use the same technique in response.

I never read the actual word-frequency analyses, just Detweiler complaining about having had it used against him, which I assume he didn't imagine.

There was a recent paper (two or three years ago?) that claimed that people could usually defeat this kind of linguistic analysis simply by trying to change their writing style. I'm skeptical that this is applicable in general.

http://33bits.org/ is a pretty good blog on the unreasonable difficulty of anonymity in a computerized world.

e12e · on Jan 9, 2013

"And posts must be translated to English, a process which boosted author identification from 66 to around 80 per cent but was imperfect using freely available tools like Google and Bing."

I find that more than a little surprising. I would think machine translation would detract from the signal and/or normalize the text, rather than boost the differences.

robk · on Jan 9, 2013

This is very surprising given the statistical machine learning approach of Google Translate. Surely this adds in too much noise to be trustworthy.

taneliv · on Jan 9, 2013

Could it be that some words do not get translated and appear in the original language? These would be very distinctive features in the later steps.

galaktor · on Jan 9, 2013

That is indeed an interesting point. The article states they use function words. Assuming that there is an injective relation between those function words in the original language and English, I could imagine that the translation would make it easier to compare (translated) foreign text to the normalized English database. As opposed to comparing the same function word in (say) Italian to that word in English.

maxerickson · on Jan 9, 2013

I watched some of the video of the presentation. They attribute the increase in accuracy to the analysis tools being developed/specialized for English.

lindauer · on Jan 9, 2013

There was a really nice paper at the last IEEE Security & Privacy conference where they did an analysis like this on blog postings.

http://randomwalker.info/publications/author-identification-...

I tried a similar analysis for a class project using Twitter data, and it worked surprisingly well considering the small amount of data in a tweet. Would-be-anonymous posters beware!

dalke · on Jan 9, 2013

For an historical example, the anonymous author of Primary Colors was identified through a literary analysis (http://en.wikipedia.org/wiki/Primary_Colors_%28novel%29#Unma... )

Don Foster "did a textual analysis of Primary Colors to identify words, phrases, and expressions that were repeatedly used and to look for "quirky expressions" and peculiarities of punctuation. Afterwards, he simply analyzed writing samples until he located a consistent usage of the same "telltale" signs of authorship. Sample after sample was rejected until finally he happened onto the writings of Joe Klein and found what he had been looking for. The literary quirks and features that Foster had isolated in Primary Colors occurred with such frequency in Klein's articles that despite Klein's initial denial, Foster knew he was Anonymous."

Similar methods have been used to, for example, try and figure out who wrote which portions of the anonymously authored "Federalist Papers."

sp332 · on Jan 9, 2013

Amazon used to pull out "statistically improbable phrases", 2- or 3-word phrases that occurred in the book you were looking at that were rare among all the books in their "look inside" database. Some examples: http://www.mentalfloss.com/blogs/archives/25839

stephengillie · on Jan 9, 2013

The 1998 Rom-com Cupid [1] had an episode with an "elite linguistics ninja" that could purportedly tell where a person was born and grew up based on word usage.

[1] https://en.wikipedia.org/wiki/List_of_Cupid_episodes

cdman · on Jan 9, 2013

The 1956 musical My Fair Lady [1] features a professor with the same claim. Nothing new under the sun I guess :-)

[1] http://en.wikipedia.org/wiki/My_Fair_Lady

honestcoyote · on Jan 9, 2013

Would this work as a way of hiding? Write plainly. Put it into the Google translator to be converted to the language of your choice and then this is again translated back to English. Correct the individual words mistranslated but leave the awkward phrasing intact.

lsiebert · on Jan 9, 2013

We shouldn't have to force ourselves to adopt different ways of talking when we are anonymous. We should analyze the places where this variance exists, regularly compile information on the most probable ways of writing, and then programmatically alter our words, using the previous established probabilities, keeping them updated as time goes on.

Alternatively, we can analyze our speech for those elements that are significantly different from others to allow identification, and alter just those.

NegativeK · on Jan 9, 2013

I've forced myself to use different writing styles when trying to be anonymous. I've no clue if I succeeded, but I'm relatively sure that I wasn't as clearly myself. Every time, I've wondered what it would take to write an anonymizing tool that would strip identifying patterns from the sentences.

Of course, modifying word choice seems trivial in comparison to analyzing how a person lays out an entire paper or communicates the steps of their thought process.

gnosis · on Jan 9, 2013

"I've wondered what it would take to write an anonymizing tool that would strip identifying patterns from the sentences."

Peter Wayner came up with a technique called "Mimic Functions" [1] that could (at least in principle) change a file so that it assumes some statistical properties of a different file.

Unfortunately, it's easier said than done. One problem is that the sender doesn't know exactly which statistical tests would be used by the attacker, so while mimic functions could be devised to emulate certain statistical properties of a given file, it may be impossible to mimic all the statistical properties of any but the most trivial file -- especially if you want the result to make sense to a human reader. It might fool a machine, though.

[1] - https://en.wikipedia.org/wiki/Mimic_function

pavel_lishin · on Jan 9, 2013

I bet switching to a different keyboard layout would do wonders. I would wager that doing that would slow down your thought->words output stream enough to alter it.

lykedis · on Jan 9, 2013

who iz stoopid enuf 2 right in der normal stylz wen dey do bad tings?!

saraid216 · on Jan 9, 2013

In Designing Virtual Worlds, Richard Bartle explicitly advises that developers who intend to masquerade as players be very careful not to give themselves away, and offers a nice list of things to pay attention to.

nerdfiles · on Jan 9, 2013

I genuinely believe their findings do not apply to me; or, to say the least, that I am a statistical outlier.

There are still trolls out there, alive and linguisizing.