Hacker News new | past | comments | ask | show | jobs | submit login
"Anonymized" data really isn't—and here's why not (arstechnica.com)
79 points by timwiseman on Sept 8, 2009 | hide | past | favorite | 14 comments



Hey all,

I'm one of the authors of the Netflix and social network de-anonymization papers referenced, and the author of http://33bits.org. I also had the pleasure of having many talks with law professor Paul Ohm over the technical aspects of his excellent paper (which another commenter has posted a link to.)

Paul's paper is a great example of people in the law community "getting it." (He happens to have a CS degree and is a hobbyist perl hacker!) We need more people like Paul, and we equally also need tech people who understand law and policy. I encourage everyone to give Paul's paper a quick read, gain awareness of the issues such why the current privacy laws are wrongheaded and what needs to be changed, and be on the look-out for ways to change things (as consumers, as tech influencers, as citizens). Cheers.


"Impatient readers can skip to the bullet-point summary at the end."

Have you heard of jump links? ;0)>


Folks interested in this subject should read this very fascinating blog:

http://33bits.org/

(The title comes from the factoid that 33 bits of information is enough to uniquely address any living person. After reading through a few practical examples I'm convinced: anonymity is an illusion.)


Fortunately, we can fall back on pseudonymity to some extent, as well as lying.

A close friend of mine, throughout over 5 years of college, shared one grocery store "club" card with all roommates and acquaintances, eventually totaling a couple dozen users.

The disincentives by the store to doing this, usually in the form of a bonus paid to the instant user at some predetermined spending level, were ineffective, since the group was one who would gladly pay $1 for a lottery ticket, and this was a bargain by comparison.

Personally, I routinely give bogus ZIP codes for any situation other than where I wish to receive mail.


The big grocery chain around here recently introduced a new program where shoppers get $0.10 off a gallon of gas for every $50 they spend in the store, tracked of course through their "Advantage" card. You show the card at participating stations, and they take the discount off there.

The actual savings are nominal: spend $150, fill up w/ 20 gallons (the limit), and save what, $6.00? That's only 4%!

But wow, people have become much more serious about using their own cards. Participation must be near 100%.


My favorite zip code to give out: 90210


Unfortunately the article starts off: "Boom! But it was only an early mile marker in Sweeney's career; in 2000, she showed that 87 percent of all Americans could be uniquely identified using only three bits of information: ZIP code, birthdate, and sex."

confusing whether we're talking binary bits or unique pieces of information.

33 bits is presumably from the world population being < 2^33 (WolframAlpha puts world pop. at 6 Billion in 2007).


Anonymization is in essence a cryptography problem, so any anonymization scheme should by default be considered insecure without extraordinary evidence to the contrary.


When AOL released their 'anonimzed data set' I happened to be in a datacenter that had some pretty good bandwidth so I grabbed a copy before it was pulled it, even now, several years later it is amazing what you can mine in terms of data that is relevant today.


You can still find it around the interwebs, in torrent sites etc... I'm just not sure about the legality of using it for practical or academic purposes.


Here's the Paul Ohm paper referenced therein: http://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID1450006_code...


Ohm's blog on "Freedom to Tinker".

http://www.freedom-to-tinker.com/blog/paul

I couldn't get SSRN to cooperate; the download buttons just kept returning me to the abstract. BTW, Ohm himself admits SSRN has problems, in the comments of the post where he announces the paper.

http://www.freedom-to-tinker.com/blog/paul/anonymization-fai...


I guess this is analogous to a Heisenberg principle for computer science. You either have information on a person which increases the likelihood of identification, or you can have no information at all.


It's the classic tautology: if more than one person knows something, it's not a secret.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: