Hacker News new | past | comments | ask | show | jobs | submit login
Debunking the Myth of "Anonymous" Data (eff.org)
195 points by gslin on Nov 11, 2023 | hide | past | favorite | 102 comments



I work for a data privacy startup, and this article unfortunately groups all forms of anonymization together. It is specifically criticizing forms of anonymization that only treat direct identifiers like names, addresses, and phone numbers. That is usually referred to as "pseudonymization", and they are correct to point out that an only moderately sophisticated attacker can still link people in the dataset using combinations of indirect identifiers like age, birthday, and zipcode. Pseudonymization is a weak form of privacy.

More sophisticated methods of privacy also anonymize indirect identifiers, and in some cases personal attributes. They do this by adding noise to the data in such a way that the noise has relatively* minimal impact on the results of computations made over the dataset, but a significant impact on the ability to re-identify someone using indirect identifiers or attributes.

*There is always a tradeoff between privacy and utility. The only way to achieve 100% private data is 100% noise, but the privacy-utility tradeoff curve isn't linear, and you can still achieve very good utility and very good privacy in many cases, especially with the best tools. Methods are also improving over time, reducing the impact of the tradeoff.


In the interest of charitable-ness, can you define "noise" and explain how rudimentary denoising algorithms that have existed since the 1960s can't penetrate it?

Time and time again people think they've anonymized data and they're always proven incorrect.

Just come to grips with the fact you're participating in the sale of my (or whoever's) private data. Hope you sleep well!


"Denoising" algorithms can't remove this kind of noise. It doesn't work like that. You're simply mistaken.

Generally the state of the art for adding noise to data is differential privacy or microaggregation. In the case of differential privacy it's typically gaussian or laplacian noise, but it is not a trivial application. Noise applied through microaggregation is not a mathematical function because microaggregation targets k in k-anonymity, and the "noise" can only be measured after treatment, through various distortion measures.

My company doesn't sell private data or facilitate the sale of private data. We improve privacy within organizations to reduce the risk of privacy leaks. Data brokers and ad targeters would have no interest in our software. But thanks for the condescending comment!


Who do you think "data brokers and ad targets" purchase data from?

I don't understand how you could help companies protect the private data they collect from users and think that you're not facilitating the collection and sale of it.


Almost literally every company with human customers collects user data because it’s necessary to run a business. Very few of those have a business model where the purpose is to profit from the sale of that data. Every one of our customers is treating first party data for first party use cases, which is far more common than buying or selling data. How many companies have prod user data and developers that need an up to date prod-like database? A ton. Many just copy prod and distribute to devs. The good ones try to keep scripts for populating the database with test data up to date, but that’s a nightmare. Anonymization is an elegant solution to this problem. There are countless other use cases where organizations need to use their customer data internally and (should) want to protect identities.

Most companies don’t want to sell their user data because they consider it competitive, and most data brokers don’t want to anonymize because why would they if they’re already selling raw data? They aren’t interested in our software because we prevent singling out of identities.


> collects user data because it’s necessary to run a business

That's a lie we developers like to tell ourselves. But the reality is that most companies track data because some sold them software to track data.

It's ridiculous how every single small business website has cookie banners, GDPR consent, and includes multiple tracking scripts on their website to collect all the user data they can get their hands on.

Yet these small businesses have no idea what data they are collecting. The owners are completely clueless. They have all this data collection shit on their website because someone else who has no clue put it there.

They see facebook and google collecting all this data, and think they need to collect data too. But they have no idea what to do with it. At best they look at a few numbers and get a feeling that all the money they spent on their website was worth it.

But it's all bullshit. 90% of businesses don't need invasive tracking, don't know what to do with the collected data, and could save money if they didn't bother to track all that useless data.


If you have a name, address, and phone number associated with a customer account, then you have collected PII. Good luck avoiding that.

User tracking actually requires linking to associate online activities with user personas. Anonymization systems are designed to break linking.


From sources that don't use differential privacy to mask the relevant features.

Can you maybe share your mathematical opposition to differential privacy? It's provably effectively.


> they're always proven incorrect

This is pretty obvious survivorship bias. You think that they’re wrong because you only hear about the cases when they were. There are tons of examples of breaches where the anonymization held up, it’s just that it’s not interesting so nobody talks about it.


Umm, yes?

If I have a lock on my door that successfully prevented 99 people from breaking in, but one skilled lockpicker subverted it, then my lock has failed.


And yet we all still use door locks, despite the fact that a skilled lock picker could break every standard door lock in minutes. This illustrates the importance of threat modeling.


That’s the wrong analogy. If 99% of locks are resistant to picking but yours wasn’t, it’s ridiculous to then say that locks don’t work.


The analogy isn't that one out of a hundred locks don't work, but the none of the locks work in the face of a skilled attacker that may comprise 1% of the attackers.

This isn't saying that locks are worthless, but it is saying that thinking of them as secure is a false confidence. This sort of truth is why there's a saying in the security field that you're at the greatest risk the moment that you think you're secure.

In terms of data security, the only way to be actually secure from data leakage is to not be in possession of the data.


That’s simply incorrect.

Edit: my source for this is that I have been involved with hundreds of data breaches where the EU was satisfied that the privacy controls were sufficient to say that no personal data was compromised.


What part of what I said was incorrect?

Your source doesn't make the case that I was incorrect about any of my assertions. That doesn't mean I'm right, of course, but I don't see how examples of privacy controls successfully protecting data disproves them. There can still be cases where the controls failed.


> What part of what I said was incorrect?

All parts of it. You are asserting that a sufficiently skilled attacker can magically overcome all privacy controls. This fundamentally misunderstands how privacy controls work. Don’t make absolute statements, they’re always incorrect (see what I did there?).

When data is compliantly anonymized, the ability to deanonymize it has been irrevocably destroyed. When evaluating privacy controls, you evaluate them against a trusted insider with full access and unlimited time. There are lots of organizations whose controls meet this bar.


>Just come to grips with the fact you're participating in the sale of my (or whoever's) private data. Hope you sleep well!

What an uncharitable response in that it forecloses on the original commentator even explaining their argument better or clarifying. I mean what? Fuck a bit more honest debate and just jump to pre-assumed moral judgement?


> Just come to grips with the fact you're participating in the sale of my (or whoever's) private data. Hope you sleep well!

I think it's really sad that the open source community has historically cultivated a culture which doesn't recognize this kind of paranoia as a lack of mental wellness. I think it's because the zealotry comports well with the extreme amount of energy needed to bootstrap the ecosystem into existence.

You can care about privacy while being unrealistic about why you don't truly have it and never will, because of how much of a risk it is to the powers that be. You can live in a world ruled by proxy wars between gigantic political superpowers and empires who control the upstream levers of the macroeconomy to ensure that the world will always go in a direction opposite to what you'd prefer it to be. It's been like this for thousands if not tens or hundreds of thousands of years.

But moreover, you can and likely do live in a society, full of people who will always value convenience over the freedom you seem to cherish. But even you likely are willing to give up some freedom and privacy for some convenience, unless you employ private security personnel to guard your private compound. And my guess is that if you were inclined in that direction, you wouldn't really be posting on HN. Short of that, to live in a modern society is to some degree accept this compromise and to accept one's own hypocrisy -- and my heart goes out to you as much as to myself because the existential anguish of this never goes away. Nonetheless, it is generally historically accepted that as we grow older, we are to develop the ability to constructively manage around this and gain the wisdom to do so.

Perhaps the reason you sarcastically asked GP that you "hope you sleep well" is because you are projecting and you know that you yourself do not. My advice to you and those of your ilk: stop obsessing over theoeretical purity and get on with enjoying your life in the scarce time you have left on this planet. You'll have far less regrets at the end. And you will finally sleep well.


I'd rather work on building alternatives that respect their users than pay corporations to abuse me, thanks.


I see such attempts as well-intentioned, and while benign, ultimately ineffective. Unless you find a way to make the right way the convenient way, social convergence will be a wind always pushing against you rather than at your back.


I'm usually the first to defend the EFF but I agree that they've gone a bit far here. The anonymization script which I wrote for my company just replaces every string in the customer's database with a cryptographic hash--except a list of strings like "failed" and "success".

So unless your city is named "success", it's going to be missing from the dataset.

It's a bit bewildering to actually run the app in this mode, but here's a lot of diagnostically relevant information you can get out of a database like that. You can even confirm bug fixes. Meanwhile, somebody looking to harm the user would need to already know quite a lot about that user before they could make any use out of such a thing.

Good faith efforts to protect user privacy exist, is not helpful to lump them in with the rest like this:

> Sometimes companies say our personal data is “anonymized,” implying a one-way ratchet where it can never be dis-aggregated and re-identified. But this is not possible—anonymous data rarely stays this way.

Is it not possible, or not common?

People need to be informed about how to apply scrutiny to anonymization techniques, not scared into assuming ill intent when they see one.


> replaces every string in the customer's database with a cryptographic hash > So unless your city is named "success", it's going to be missing from the dataset.

You've made the typical mistake of thinking that just because you've made it harder, you've also made it anonymous.

For instance, you might have replaced "Chicago" with 9cfa1e69f507d007a516eb3e9f5074e2, but if for instance a lot of people with that tokenised city name also have transactions for a store that only exists in Chicago, you can infer with a reasonable degree of accuracy the reverse mapping. A couple such data points, and you can be almost certain.

If you know when you made a couple of transactions that are in the database, and you can find a set of likely options for each time, and then see what fields they have in common. Once you know your data, you can infer a whole load of the reverse mappings for various fields.

If those transactions involve another user, you can start to correlate those mappings with data you know about them, and start to build up a web of transactions that person had even if you don't necessarily know yet who the other users they interact with yet.

All this is possible without the use of rainbow tables, but chances are your hash function is a standard one, so with a single known mapping you can work out which hash function you chose, and look for speculative entries in the data. e.g. let's just look for MD5("Chicago"), SHA1("Chicago"), SHA256("Chicago") and see if there are any matches. If there are, we can use that hash function and trivially create a rainbow table for every city in the US, first names, surnames, etc.


I think of this as a graph of interconnected data and metadata. Unless the entire graph is anonymized, it's not really anonymized.

  Deducing relationships between metadata elements (city field, and purchase store) ends up being the tricky part, and highly domain specific.
  Hashes with salts make it a bit harder too.


Let’s at least grant the benefit of the doubt, that the poster knew to salt the hash. We can take it as given that incompetence makes for poor anonymization.

Your example of “transactions in Chicago” is much more salient; there’s clearly a cat-and-mouse dynamic where data can be de-anonymized, especially if the dataset is public. How much that will actually be possible will be specific to the data in question; but the risk is non-zero. There’s certainly a case that no amount of obfuscation is sufficient if a user has not explicitly consented to their data being used this way.


> Let’s at least grant the benefit of the doubt, that the poster knew to salt the hash. We can take it as given that incompetence makes for poor anonymization.

Actually let's not, because adding salting doesn't actually get it right either. Rather it's just another easily-broken system that is only good enough to fool its own designer. It's still trivial to run through a list of the most common city names, and recover nearly all of the entries. And if there is just a single "salt" per DB, which would be necessary for the apparent requirement that matching city names stay matching, even cycling through all combinations of letters is nearly practical. There just isn't enough starting entropy to make hashing meaningful.


You've answered your own questions and kind of defeated your own point here:

> Meanwhile, somebody looking to harm the user would need to already know quite a lot about that user before they could make any use out of such a thing.

The one thing I see EFF did wrong in this article is, they picked a bad quote to start with. That line from Matt Blaze, it's almost a pure tautology. There's a better line (though I don't know who to credit with it - I picked it up on HN some time ago): there is no such thing as "anonymized data", there's only "anonymized until correlated with enough other datasets".

You say "somebody" would need to already know a lot about the user to make use of your database. But from my perspective - perspective of the user - that "somebody" could just as well be your company. Or your customer buying that data from you... and buying similar data from other companies.


> You say "somebody" would need to already know a lot about the user to make use of your database. But from my perspective - perspective of the user - that "somebody" could just as well be your company

Well yes, noticing problems in these databases and providing consulting towards their mitigation (ideally before the pain shows up) is more or less what we're selling. Some degree of analysis must remain possible because otherwise it would be unclear what the user is paying us for.

My point is just that different situations call for different privacy postures. "Anonymization = Untrustworthy" glosses over this in a way that doesn't help us find ways to improve the situation.


> My point is just that different situations call for different privacy postures.

True. Privacy is a special case of security, and it's only useful to talk about security in terms of what threats are of concern.

> "Anonymization = Untrustworthy" glosses over this in a way that doesn't help us find ways to improve the situation.

If your concern is to be as anonymous as possible, then "Anonymization = untrustworthy" is not an unreasonable stance. The only way I know of that data can be collected and handled in an anonymous way is to aggregate the collected data immediately and discard all individualized data.

Replacing data with something like a hash is useful in many ways, but there are many things it isn't that useful with.


If their goal is to be anonymous as possible, why did they give you the data in the first place? If they're aware of your practices and still in the loop, then presumably their relationship with you is such that some degree of trade-off is acceptable.

I think the better focus for the article would be on transparency. I didn't write the script because I was conforming to some well defined policy about what's in bounds and what's out. I wrote it because somewhere in my heart I felt a vague duty to improve the situation. I'm getting a little pushback on my defense of the script--which is fair--but better would be to scrutinize the policies under which I had to ask my gut about it in the first place.


> If their goal is to be anonymous as possible, why did they give you the data in the first place?

Often, it's because there is no other choice. When I'm revealing personal data to companies and such, that's usually why. Very few companies that I'm aware of are actually trustworthy on these matters, but I can't live without doing business with many of them regardless.

I absolutely do decide to reveal data to some companies that are optional, though. Like with all security matters, there's a cost/benefit calculation to be done here.

> I wrote it because somewhere in my heart I felt a vague duty to improve the situation.

Which is entirely laudable, and I think a worthwhile activity. It is, however, an effort that mitigates some risk rather than making things safe. People like the EFF are focused more on the goal of safety. It's almost certainly not a goal that can be fully achieved, but I'm glad that they're shooting that high anyway.


I understand your point and I agree with it within its scope, however I want to reinforce the point often missed by data handlers: you're talking about tradeoffs involved in doing the thing, whereas we (the users) are saying, don't do the thing. Anonymization may be a spectrum of trade-offs, but for the side EFF represents here, the two issues are: 1) that companies do the thing that makes anonymization a meaningful concept in the first place, and then 2) use one end of the spectrum to sell this to the public, while actually sitting on the opposite end of that spectrum.


These scripts always have the best intentions but typically end up neglected and suddenly production data is on a dev laptop (at least, in my experience, both as an employee and a customer).

Make mock data inputs or snapshot data based on internal/QA usage, please!


> replaces every string in the customer's database with a cryptographic hash [..] So unless your city is named "success", it's going to be missing from the dataset.

I'm not sure I understand. Wouldn't it be trivial to hash the name of every known city, thereby reversing your cryptographic hash? As is done for passwords, only with a much smaller space of possibilities.


They are salted hashes. Also, I chose city out of thin air. Realistically the column names are generic, "output", "input" things like that. So you'd also have to guess that this particular user is putting city data in their outputs before you'd be able to run the "known cities" attack.


Yes, password cracking works on salted hashes too. And the way you now describe it, with the entries in every column being hashed, and the column names themselves being meaningless*, I'm confused what value such a database has for anyone. They're just columns of random data, and the analyst doesn't even know what the data represents?

*And presumably, somehow, there being no way to recover this meaning. E.g. if an attacker is reversing hashes, they could try several sources of guesses - city names, country names, common first and last names in several languages, heights and weights in metric and imperial in the human range, valid phone numbers, IPv4 addresses, dates, blood type, license plate and post numbers or entire addresses, e-mail addresses if the attacker has a list of known-valid emails... all of these, and I'm sure many others that I forgot, are sufficiently small that they can be brute-force guessed by simply trying all possibilities, so no matter how generic the name of a column is, if it contains any of this data, the hashing can be reversed.


This conversation has me thinking that I could go a step further and replace the hashes with only enough characters to make them unique:

> Baltimore=>0x01a1...1ef7=>01a

> Aaron=>0x17bf...86f1=>17bf

>{"Foo":”Bar"}=>0x19f4...9af2"=>19f

This would create lots of false positives if one attempted to reverse it, but it's still a 1:1 map, so data which satisfies unique and foreign key constraints would still do so after the transformation. As for the utility, it makes it so that you can run the app and see what the user sees--except their data is hidden from you. So suppose the user's complaint is:

> two weeks ago, workflow xyz stopped reaching its final state, but never explicitly changed status to 'failed' so we weren't alerted to the problem.

I can hash "xyz" and get its handle and then go clicking around in he app to see what stands out about it compared to its neighbors. Perhaps there's some should-never-happen scenario like an input is declared but has no value, or there's a cyclic dependency.

I don't need to know the actual names of the depended upon things to identify a cycle in them. The bugs are typically structural, it's not very common that things break only when the string says "Balitmore" and not when it says "1ef7".

Privacy wise, the goal is to be as supportive as possible without having to ask them to share their screen or create a user for us so we can poke around. And when it's a bug, to have something to test with so we can say "It's fixed now." instead of "Try it again, is it fixed now? How about now?"

I'm the third party here, and I'm trying to prevent my users from sharing their users' data with me. Maybe it's not strictly "anonymization" because I'm not sure that this data references people. Remaining unsure is kind of the point.


But remember rule #1 of data security: if there is a way to access the data legitimately, there is a way to access it illegitimately. The only question is the amount of time and effort required to do so.


> So unless your city is named "success", it's going to be missing from the dataset.

https://geotargit.com/called.php?qcity=Success

> There are 11 places in the world named Success!

Negligence of all possible corner cases is common in handcrafted, "well-intentioned" approaches such as this. I put well-intentioned in quotes because assuming you know all the issues and premature confidence in the solutions demonstrates a lack of intention to actually get it right. Hubris permeates the mindscape of move-fast-break-things types.

Sometimes the failures are more obvious such as those pointed out by sibling comments. Part of the EFF's general point is it's a complicated business with so many pitfalls that it's a shame we have to learn by trial and error when people's private data is on the line. There's a reason many large companies have entire departments exclusively focused on privacy in their data stores.


  That is usually referred to as "pseudonymization"
For your world, maybe. For most things with a public face, "anonymization" is what's used, not "pseudonymization". So yes, this article is referring to its more-common use.



> *There is always a tradeoff between privacy and utility. The only way to achieve 100% private data is 100% noise, but the privacy-utility tradeoff curve isn't linear, and you can still achieve very good utility and very good privacy in many cases, especially with the best tools. Methods are also improving over time, reducing the impact of the tradeoff.

This is the crux of the problem. In every situation where I've seen it deployed, "anonymization" is as far up the utility curve as the corporate interest can get while still credibly calling their data anonymized. The standards and rules that are in place for restricting what types of data can be called "anonymized" to the public are so far up the utility curve that there's functionally no privacy at all.

I don't know what your startup does or what steps it takes to make that tradeoff responsibly, but assuming that you are responsible, you are one of the few, and most of the customers in the world would rather use a client that is using the word "anonymous" more irresponsibly because:

1. they can get away with it

2. there's a lot more business value in obfuscating data as little as possible

The EFF article is bang-on here, because the majority of the "anonymization" in the space isn't being done by responsible folks (like yourself, presumably), it's being done by psychopathic business interests.


My experience is no different and in fact it’s hard to convince anyone to treat indirect identifiers at all. Everyone wants the bare minimum, which is unfortunate, but understandable for someone trying to run a business. All of our current customers are only using their data internally and anonymizing to protect identities as that data is shared across other internal groups, mostly for testing and analysis. Even protecting only direct identifiers is far more responsible than the vast majority of businesses, though we push for indirect treatment.

If you treat indirect identifiers with our system, the minimum level of anonymization would ensure that there are always a minimum of two matching combinations of indirect identifiers, i.e. k=2, for every record in the dataset. This is far from foolproof for preventing data leakage, but it does prevent singling out and provides a lot of protection in the case of breach.


The article suggests but does not explicitly describe adding noise, but it doesn't matter because the conclusion is the same:

> There is always a tradeoff between privacy and utility. The only way to achieve 100% private data is 100% noise

The point is that the people whose data is being "protected" rarely have a say that their data is in the data set in the first place, and to what level their data is noised when it is.


Sorry to break you the news, but your employer is probably selling snake oil. And this is exactly the point of the TFA:

> However, in practice, any attempt at de-identification requires removal not only of your identifiable information, but also of information that can identify you when considered in combination with other information known about you.

So, anonymization may be well intended and sophisticated, but is only effective as long as you control and anonymize all the data available to your adversary. Which you obviously cannot. The best you can achieve with anonymized data is to allow non-malicious users to run some analysis on your data in a controlled environment without exposing them to information they should not see. If that's your employer's business model, well, there may be a market niche for that.


I designed our anonymization systems. You clearly have a misunderstanding of the threat model and the treatment methods.


Of course I have a misunderstanding. You only shared some very vague statements. Could you share some insight on your threat model and your envisioned use case?

TFA is obviously referring to the general case where you can't really put arbitrary restrictions to the adversary.


> More sophisticated methods of privacy also anonymize indirect identifiers, and in some cases personal attributes.

The original datasets are brokered anyway. [citations disclosed]


Pseudonymization is creating a new random identifier and keeping a linking table from those new identifiers to the original identifiers.

Pseudonymized data is for all intents and purposes completely anonymous as long as the people you share the data with cannot access the linking table.

That constraint can be enforced technically, contractually, or legally, depending on how important the breach would be.


It's pseudonymization regardless of the replacement data format. It can be a token, a format-consistent value, e.g. "john.smith@example.com", redacted, etc. All of that is considered pseudonymous.

Pseudonymized data is not completely anonymous. For example, your name, address, date of birth, and other identifiable information are likely available in public datasets like property tax records or voter registration data. There are also mostly-legal datasets available for purchase with more complete demographic and personal information. If your birthday, gender, and zipcode are present in the pseudonymized dataset, I have an ~87% chance succeeding in a linkage attack matching you in of those public datasets.


For the curious: more about the 87% figure, with a link to the source paper: https://www.johndcook.com/blog/2018/12/07/simulating-zipcode.... (Not the author, just a fan of John D. Cook's blog.)


If your birthday, gender, and zip code are present in the dataset then it is not Pseudonymized.

By definition in any modern privacy specification, the only difference between anonymous and pseudonymous is that you can recover the true identity.

I believe a decade plus ago Pseudonymized meant simply making something less identifiable, but not any more.


That's not how the term typically used. Most people will assume you have only protected direct identifiers if you tell them data has been pseudonymized.

https://csrc.nist.gov/glossary/term/pseudonymization


https://en.m.wikipedia.org/wiki/Pseudonymization

See section about the new definition under GDPR.

Like I said, modern privacy standards. The US is still quite far behind: I've had to do HIPAA training and it shows.

I'd advise you to set your bar a lot higher than your national standards if you ever plan to do international products: nobody in the US will complain if you use stricter definitions but you'll instantly be rejected in Europe if you call that pseudonymization.


What bar are you talking about? We use the state of the art to treat indirect identifiers without pseudonymization, but we pseudonymize as needed if that’s overkill. The bar is set very high for our service. Sadly, very few companies are actually interested in treating indirect identifiers or consider anything besides direct identifiers a problem.

I’m familiar with GDPR, we work with EU companies, and all the ones we work with use original definition. NIST link aside, these terms are hardly standards anyway. They are nearly colloquial vocabulary, which unfortunately in this space I expect to remain imprecise and vague. This why we generally only use it in marketing and comms, while in the actual product we drill down into specifications for direct and indirect identifiers, distortion, and risk.


Many red-teaming attacks don't rely on the identifiers at all, they are essentially "duck typing" the entities in the data model. Same approach is commonly used for dealing with data where a single identity may have multiple pseudonymized identifiers.

There are a bunch of regulatory box-checking exercises like this that don't actually provide anonymization if the attacker is sophisticated.


Great article putting all the relevant content in one place. Does anyone know of any de-anonymization services? The startup I am working at is privacy focused and we are looking for a way to demonstrate why you need an additional layer to protect and compartmentalize.

Short of us buying up data in bulk and then doing the de-anonymization in-house I am not seeing an easy way to do this. Or even an advertised partner, seems like all the articles are really careful to not do free marketing for companies in this space.


There are a small handful of us out there, targeting slightly different but overlapping use cases, but this is me https://www.privacydynamics.io. Happy to answer any questions about it.


This is the same problem as white-hat security: you need people who know an awful lot about how bad guys work but are good guys. Security has developed a legitimate market over the years, but that took time.

You could try to ask independent consultant(s) who you trust to work on that problem and let them grow that practice. The problem is that they would need to have access to marketing platforms.


I just realized you wrote “de-anonymization”. We don’t do that, but we do a risk analysis that simulates an attack, making some pessimistic assumptions about what data an attacker may have for linkage, to estimate how risky data is to share, essentially as a benchmark. TLDR; more unique combinations of quasi-identifiers make data more susceptible to a linkage attack. We also pitch our system as a risk analysis only tool, though no one is using it for that currently. If you’re interested in methods, I’d recommend El Emam’s book, “Guide to the De-Identification of Personal Health Information”, which covers a lot in detail. Our attack simulation is based on his work.


A good popular take, but they, either intentionally or out of ignorance, omit newer, proven techniques like differential privacy.


Still sounds potentially problematic. Per wikipedia: "Differential privacy provides a quantified measure of privacy loss and an upper bound and allows curators to choose the explicit trade-off between privacy and accuracy. It is robust to still unknown privacy attacks. However, it encourages greater data sharing, which if done poorly, increases privacy risk. Differential privacy implies that privacy is protected, but this depends very much on the privacy loss parameter chosen and may instead lead to a false sense of security. Finally, though it is robust against unforeseen future privacy attacks, a countermeasure may be devised that we cannot predict."

If I am dependent on the curator to determine the level of privacy then I've already lost.


> …it encourages greater data sharing, which if done poorly, increases privacy risk.

This is a really useful argument, because it’s the equivalent of the FDA’s “generally believed to be safe”. If you look into something and this is the risk you find, then it’s safe.


GRAS. Btw.


I agree. DP is better than no privacy at all, but it certainly doesn't make me feel comfortable.


You are already dependent on a set of curators to not simply outright lie and export the data as captured. This take lacks subtlety; unless you are going to abandon the set of functionality ("where is the best fried chicken near me?") that this sort of metadata facilitates, you need to make decisions about which curators you trust, and then participate in driving them to honesty and accountability.

From my perspective this article leans too heavily into the FUD, and really doesn't succeed at keeping the call to action ticking over. On a good day, the EFF can be really good. Today, not so much so.


The problem with the EFF is that it’s full of people who made up their minds about what’s ok and what isn’t over a decade ago, and they are largely just playing the hits now. No new material, and no consideration that things might change.


Differential privacy does not anonymize your data. It's a (mathematically solid) instrument to make it coarse enough to only give up the part you want.

The EFF is completely correct in their assessment - there's no way to make your data anonymous and usable by the third party at the same time. The root issue lies elsewhere.

There's FHE, which does allow your data to be processed without accessing it, but it has prohibitive compute requirements and isn't practical for most purposes. I'm not sure if it allows the data to be usable enough without revealing it, either.


> there's no way to make your data anonymous and usable by the third party at the same time. The

This is absurd hyperbole, here’s a trivial proof. The US population is 51.1% female. If you live in the US your gender information is included in this data point, but it cannot be de-anonymized. It is however useful for determining TAM for many products.

Anonymity is a spectrum, and there are many points along that spectrum which are solutions to usable data which also respects privacy.


Privacy is a spectrum, anonymity is not.


Techniques like differential privacy do not work for some types of data models, including many of the more interesting/risky ones. I've never seen a technique that can deliver an analytical model at scale that is both analytically effective and anonymous while also robust against sophisticated de-anonymization attacks. There is no good theoretical foundation to suggest that such things are possible.

Most modern techniques for ensuring anonymization make assumptions that won't constrain sophisticated blackhats. They are good policy in a legal ass-covering sense and increase the cost required to de-anonymize but that is about it.


Can you give an example? Global DP systems generally offer excellent privacy protection for the analytical distortion cost. Global DP is often not implemented simply due to workflow inconveniences but otherwise it’s great for analytical use cases.


If your scale is large enough and you don’t care about identifying individuals, synthetic data does this fairly effectively.


Or they don't want you collecting data in the first place, in which case the techniques are irrelevant.


Entirely this. That data is so widely collected is the root problem.


In particular areas things like differential privacy may work, for example medical sets where there are lots of regulations and potential fines for the companies involved.

But do you think your average ad tech company gives a fuck? They are going to keep the original data because that's where the money is. Yea, maybe they'll have privacy datasets they sell/release to other groups, but all the real data will remain in a database, and with most companies in this industry, be given to government agencies on demand.


It's not because of the money. Most could make money just fine with properly anonymized data. They just don't care and/or don't want to take the effort or spend money to do it because the consequences for leaking private info are so minimal.


There's also Zero Knowledge Proofs which instead has a privacy guarantee

Edit: Not trying to start a DP vs ZKP (or Homomorphic Encryption) flame war. We're all on the same side of privacy, right? Different tools for different situations.


"De-identified data isn't. Either it's not de-identified, or it's not data anymore."

-- Cynthia Dwork, one of the invertors of Differential Privacy, Knuth Prize and Gödel Prize winner.

https://youtu.be/RWpG0ag6j9c?feature=shared&t=274


GDPR specifically mentions pseudonymous data in Recital 26:

"The principles of data protection should apply to any information concerning an identified or identifiable natural person. Personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person. To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly."

Under EU law, there is no special class of "personally identifying information". Any data that relates to a person or could be related to a person is protected. It isn't enough to just strip the name and SSN field out of the database and call it anonymous, you need to demonstrate that the data couldn't be attributed to anyone through any reasonably practical process.

https://gdpr-info.eu/recitals/no-26/


The crucial piece is your link is:

> 4To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments.

This has been interpreted that any objectively possible way to de-anonymize is reasonable. In particular, considering that all the data of the organization doing the anonymization can be used to try to de-anonymize.


I think they're trying to keep cryptographically secure data out of scope.


That broad definition opens itself to so much… Even raw movie rating data was famously partially de-anonymized.

I’m not sure that I could imagine a detailed database with personal activity at a reasonable scale that couldn’t be de-anonymized, at least partly, if one assumes that people use related services and that one has access to them via public social media, mostly.

- Emails: of course;

- social media: easy;

- search queries: LOL;

- transport: through commute;

- any cultural good: you can connect to contemporary commentary;

- fashion shopping: pretty much trivial with OOTD posts;

- medial record: not everyone, but most major diagnostics could be matched to social media for many people;

- https blobs via VNP: most people would connect to the same four sites again and again, but exact timestamps and public statements on social media;

- grocery shopping: harder, but doable as you’d have neighborhood from the store address and a lot of surface…

Is the letter of that law against pseudonymous databases?


Is it funny that it is so broad? People don't want their information leaked! Why is this so hard to understand?

It makes your job harder? Tough shit. It makes law enforcement tougher to do? Ohhhh nooo, they might have to work for a living instead of pushing buttons.

Stop trying to defend companies that do this in service of capital. It is heinous. Some, if not most of us, want to be left alone and not have our addresses and medical history and YouTube stats available for the whole world.

There is no such thing as an anonymous dataset. It has been proven again, and again.


No need to be snarky or insulting: I was trying to ask for a sincere assessment of that law.

> do this in service of capital

Sir, this is ycombinator.com

If the intent is that no database with an individual-level breakdown falls under GDPR IID protections, I presume that far more processes and declarations would have to be applied to circumstances where there never was an intent or a credible option to de-anonymize them. This is not how it is enforced, understood, or applied today.

The large companies that you criticize so readily would have an issue automating the paperwork or building hashing solutions that divert the problem, but their privacy-respecting competitors would fall under a lot more paperwork and legal risk than they could handle.

Thankfully, there are solutions: a lot of people are now handling internal data processes with the same open-source tool, dbt. That platform could help change standards if they knew current patterns do not respect the letter of the law. But their lawyers seem to think otherwise.


That is almost verbatim the definition of PII from Wikipedia and from OMB Memorandum M-07-1616 (the US government definition of PII).


I like to make an appearance in threads in this domain and just say, yeah, it's not just possible but "fun" to put the puzzle pieces together.

It is a very tough problem to solve. Especially when you consider the richness of the datasets you use to put the pieces together.

In my experience the only effective means is to poison the data, in addition to the common sense steps mentioned here.

Poisoning a dataset means seeding a wide variety of the datasets you use to discover PII with fictional look alikes that resist debunking.

Additionally you can poison the core set if you are very clever about it.


Anonymization of personal data is a very tricky thing. It is so tricky that the GDPR even doesn't really mention it. I even don't know of any official SOP for doing it.

Of course k-anonymity and differential privacy can help you with a closed data set, but once you add records to a dataset over time everything breaks down.

I once tried to find a way to anonymize data on different clients and then connect the records of the same entity on the server. This method had the main problem that one could artificially generate a new non-identifying data point for an individual and then see in which record it will end up in the central data base.


More about why location information IS personal information: https://consciousdigital.org/location-data-is-personally-ide...


I don't think humanity is going to get around disclosing how exactly all that data flows into marketing, UX design and policies. Not in the form of pop-science books or blockbuster documentaries but detailed statistics and open reports by the companies themselves.


There's anonymizing data so that creepy guy two floors down can't stalk the barista across the street, and there's anonymizing data so that you can publish it to the world.

Barely even comparable.


As others pointed out, the article mixes a lot of things together. EU (GDPR) has very specific and very hard to meet anonymization bar (tldr; it requires anonymization to be at the level where it’s mathematically improbable to de-anonymize the user). None of the “anonymization” examples in the article would pass this EU bar.


Actually the GDPR "just" requires to protect the data against "reasonably likely" attacks.

« To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly.

To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments. »

https://www.privacy-regulation.eu/en/r26.htm


And yet this does not prevent anyone from doing it in the EU.


I can’t speak for everyone, but the company I work for takes this very seriously and does its best to comply with the law. Said that, all of it is very complicated. Most (if not all) privacy problems I observed so far are caused not by malice but by incompetence. And those get fixed as soon as found.


data is never anonymous. if you get enough of it you can correlate who is who and turn it into information on the user.


Every time I hear "anonymous data", I think of that time AOL published anonymized search logs (for academic research). The anonymization was negligent, and an NYT reporter de-anonymized and tracked down one of the users with the local & personal info present in the search queries.

https://en.wikipedia.org/wiki/AOL_search_log_release

https://web.archive.org/web/20130404175032/http://www.nytime...


Another fun one was the Netflix Prize, where Netflix published an anonymized dataset, but some researchers at UT Austin were able to de-identify many/most of the users in the dataset by linking them to IMDB profiles based on preferences.


There's no way most of the users in the Netflix data had IMDb accounts.


From a skim, it looks like their main point is that with 8 movie ratings, it's trivial to link two datasets (eg anonymized Netflix rating to public IMDb profile).

https://scholar.google.com/citations?view_op=view_citation&h...

https://systems.cs.columbia.edu/private-systems-class/papers...


I have thoughts like that all the time but then I remember how many people have accounts at xyz at all and how often I created an account on abc, made a list or two and never returned (until I did). I have over 600 logins saved in my password manager and "holy fuck shit" some of the data on these sites is still "part of me", I just don't actively spend time on it. It's all quite usable.


thank you, that was my first thought. there's hyperbole and then there's whatever that claim was. I don't want to call it a lie so lets just call it a gross exaggeration.

many I would have been fine with. most? no way.


Reminds me about the telco offer to sell gps location traces on anyone you wanted, but to ensure 'anonymity' you had to order a batch of 30 people minimum and you did not get information on which trace belongen to who in that set.

Figuring out which of those traces belonged to which employee was a real puzzle /s




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: