I'm usually the first to defend the EFF but I agree that they've gone a bit far here. The anonymization script which I wrote for my company just replaces every string in the customer's database with a cryptographic hash--except a list of strings like "failed" and "success".
So unless your city is named "success", it's going to be missing from the dataset.
It's a bit bewildering to actually run the app in this mode, but here's a lot of diagnostically relevant information you can get out of a database like that. You can even confirm bug fixes. Meanwhile, somebody looking to harm the user would need to already know quite a lot about that user before they could make any use out of such a thing.
Good faith efforts to protect user privacy exist, is not helpful to lump them in with the rest like this:
> Sometimes companies say our personal data is “anonymized,” implying a one-way ratchet where it can never be dis-aggregated and re-identified. But this is not possible—anonymous data rarely stays this way.
Is it not possible, or not common?
People need to be informed about how to apply scrutiny to anonymization techniques, not scared into assuming ill intent when they see one.
> replaces every string in the customer's database with a cryptographic hash
> So unless your city is named "success", it's going to be missing from the dataset.
You've made the typical mistake of thinking that just because you've made it harder, you've also made it anonymous.
For instance, you might have replaced "Chicago" with 9cfa1e69f507d007a516eb3e9f5074e2, but if for instance a lot of people with that tokenised city name also have transactions for a store that only exists in Chicago, you can infer with a reasonable degree of accuracy the reverse mapping. A couple such data points, and you can be almost certain.
If you know when you made a couple of transactions that are in the database, and you can find a set of likely options for each time, and then see what fields they have in common. Once you know your data, you can infer a whole load of the reverse mappings for various fields.
If those transactions involve another user, you can start to correlate those mappings with data you know about them, and start to build up a web of transactions that person had even if you don't necessarily know yet who the other users they interact with yet.
All this is possible without the use of rainbow tables, but chances are your hash function is a standard one, so with a single known mapping you can work out which hash function you chose, and look for speculative entries in the data. e.g. let's just look for MD5("Chicago"), SHA1("Chicago"), SHA256("Chicago") and see if there are any matches. If there are, we can use that hash function and trivially create a rainbow table for every city in the US, first names, surnames, etc.
I think of this as a graph of interconnected data and metadata. Unless the entire graph is anonymized, it's not really anonymized.
Deducing relationships between metadata elements (city field, and purchase store) ends up being the tricky part, and highly domain specific.
Hashes with salts make it a bit harder too.
Let’s at least grant the benefit of the doubt, that the poster knew to salt the hash. We can take it as given that incompetence makes for poor anonymization.
Your example of “transactions in Chicago” is much more salient; there’s clearly a cat-and-mouse dynamic where data can be de-anonymized, especially if the dataset is public. How much that will actually be possible will be specific to the data in question; but the risk is non-zero. There’s certainly a case that no amount of obfuscation is sufficient if a user has not explicitly consented to their data being used this way.
> Let’s at least grant the benefit of the doubt, that the poster knew to salt the hash. We can take it as given that incompetence makes for poor anonymization.
Actually let's not, because adding salting doesn't actually get it right either. Rather it's just another easily-broken system that is only good enough to fool its own designer. It's still trivial to run through a list of the most common city names, and recover nearly all of the entries. And if there is just a single "salt" per DB, which would be necessary for the apparent requirement that matching city names stay matching, even cycling through all combinations of letters is nearly practical. There just isn't enough starting entropy to make hashing meaningful.
You've answered your own questions and kind of defeated your own point here:
> Meanwhile, somebody looking to harm the user would need to already know quite a lot about that user before they could make any use out of such a thing.
The one thing I see EFF did wrong in this article is, they picked a bad quote to start with. That line from Matt Blaze, it's almost a pure tautology. There's a better line (though I don't know who to credit with it - I picked it up on HN some time ago): there is no such thing as "anonymized data", there's only "anonymized until correlated with enough other datasets".
You say "somebody" would need to already know a lot about the user to make use of your database. But from my perspective - perspective of the user - that "somebody" could just as well be your company. Or your customer buying that data from you... and buying similar data from other companies.
> You say "somebody" would need to already know a lot about the user to make use of your database. But from my perspective - perspective of the user - that "somebody" could just as well be your company
Well yes, noticing problems in these databases and providing consulting towards their mitigation (ideally before the pain shows up) is more or less what we're selling. Some degree of analysis must remain possible because otherwise it would be unclear what the user is paying us for.
My point is just that different situations call for different privacy postures. "Anonymization = Untrustworthy" glosses over this in a way that doesn't help us find ways to improve the situation.
> My point is just that different situations call for different privacy postures.
True. Privacy is a special case of security, and it's only useful to talk about security in terms of what threats are of concern.
> "Anonymization = Untrustworthy" glosses over this in a way that doesn't help us find ways to improve the situation.
If your concern is to be as anonymous as possible, then "Anonymization = untrustworthy" is not an unreasonable stance. The only way I know of that data can be collected and handled in an anonymous way is to aggregate the collected data immediately and discard all individualized data.
Replacing data with something like a hash is useful in many ways, but there are many things it isn't that useful with.
If their goal is to be anonymous as possible, why did they give you the data in the first place? If they're aware of your practices and still in the loop, then presumably their relationship with you is such that some degree of trade-off is acceptable.
I think the better focus for the article would be on transparency. I didn't write the script because I was conforming to some well defined policy about what's in bounds and what's out. I wrote it because somewhere in my heart I felt a vague duty to improve the situation. I'm getting a little pushback on my defense of the script--which is fair--but better would be to scrutinize the policies under which I had to ask my gut about it in the first place.
> If their goal is to be anonymous as possible, why did they give you the data in the first place?
Often, it's because there is no other choice. When I'm revealing personal data to companies and such, that's usually why. Very few companies that I'm aware of are actually trustworthy on these matters, but I can't live without doing business with many of them regardless.
I absolutely do decide to reveal data to some companies that are optional, though. Like with all security matters, there's a cost/benefit calculation to be done here.
> I wrote it because somewhere in my heart I felt a vague duty to improve the situation.
Which is entirely laudable, and I think a worthwhile activity. It is, however, an effort that mitigates some risk rather than making things safe. People like the EFF are focused more on the goal of safety. It's almost certainly not a goal that can be fully achieved, but I'm glad that they're shooting that high anyway.
I understand your point and I agree with it within its scope, however I want to reinforce the point often missed by data handlers: you're talking about tradeoffs involved in doing the thing, whereas we (the users) are saying, don't do the thing. Anonymization may be a spectrum of trade-offs, but for the side EFF represents here, the two issues are: 1) that companies do the thing that makes anonymization a meaningful concept in the first place, and then 2) use one end of the spectrum to sell this to the public, while actually sitting on the opposite end of that spectrum.
These scripts always have the best intentions but typically end up neglected and suddenly production data is on a dev laptop (at least, in my experience, both as an employee and a customer).
Make mock data inputs or snapshot data based on internal/QA usage, please!
> replaces every string in the customer's database with a cryptographic hash [..] So unless your city is named "success", it's going to be missing from the dataset.
I'm not sure I understand. Wouldn't it be trivial to hash the name of every known city, thereby reversing your cryptographic hash? As is done for passwords, only with a much smaller space of possibilities.
They are salted hashes. Also, I chose city out of thin air. Realistically the column names are generic, "output", "input" things like that. So you'd also have to guess that this particular user is putting city data in their outputs before you'd be able to run the "known cities" attack.
Yes, password cracking works on salted hashes too. And the way you now describe it, with the entries in every column being hashed, and the column names themselves being meaningless*, I'm confused what value such a database has for anyone. They're just columns of random data, and the analyst doesn't even know what the data represents?
*And presumably, somehow, there being no way to recover this meaning. E.g. if an attacker is reversing hashes, they could try several sources of guesses - city names, country names, common first and last names in several languages, heights and weights in metric and imperial in the human range, valid phone numbers, IPv4 addresses, dates, blood type, license plate and post numbers or entire addresses, e-mail addresses if the attacker has a list of known-valid emails... all of these, and I'm sure many others that I forgot, are sufficiently small that they can be brute-force guessed by simply trying all possibilities, so no matter how generic the name of a column is, if it contains any of this data, the hashing can be reversed.
This conversation has me thinking that I could go a step further and replace the hashes with only enough characters to make them unique:
> Baltimore=>0x01a1...1ef7=>01a
> Aaron=>0x17bf...86f1=>17bf
>{"Foo":”Bar"}=>0x19f4...9af2"=>19f
This would create lots of false positives if one attempted to reverse it, but it's still a 1:1 map, so data which satisfies unique and foreign key constraints would still do so after the transformation. As for the utility, it makes it so that you can run the app and see what the user sees--except their data is hidden from you. So suppose the user's complaint is:
> two weeks ago, workflow xyz stopped reaching its final state, but never explicitly changed status to 'failed' so we weren't alerted to the problem.
I can hash "xyz" and get its handle and then go clicking around in he app to see what stands out about it compared to its neighbors. Perhaps there's some should-never-happen scenario like an input is declared but has no value, or there's a cyclic dependency.
I don't need to know the actual names of the depended upon things to identify a cycle in them. The bugs are typically structural, it's not very common that things break only when the string says "Balitmore" and not when it says "1ef7".
Privacy wise, the goal is to be as supportive as possible without having to ask them to share their screen or create a user for us so we can poke around. And when it's a bug, to have something to test with so we can say "It's fixed now." instead of "Try it again, is it fixed now? How about now?"
I'm the third party here, and I'm trying to prevent my users from sharing their users' data with me. Maybe it's not strictly "anonymization" because I'm not sure that this data references people. Remaining unsure is kind of the point.
But remember rule #1 of data security: if there is a way to access the data legitimately, there is a way to access it illegitimately. The only question is the amount of time and effort required to do so.
Negligence of all possible corner cases is common in handcrafted, "well-intentioned" approaches such as this. I put well-intentioned in quotes because assuming you know all the issues and premature confidence in the solutions demonstrates a lack of intention to actually get it right. Hubris permeates the mindscape of move-fast-break-things types.
Sometimes the failures are more obvious such as those pointed out by sibling comments. Part of the EFF's general point is it's a complicated business with so many pitfalls that it's a shame we have to learn by trial and error when people's private data is on the line. There's a reason many large companies have entire departments exclusively focused on privacy in their data stores.
So unless your city is named "success", it's going to be missing from the dataset.
It's a bit bewildering to actually run the app in this mode, but here's a lot of diagnostically relevant information you can get out of a database like that. You can even confirm bug fixes. Meanwhile, somebody looking to harm the user would need to already know quite a lot about that user before they could make any use out of such a thing.
Good faith efforts to protect user privacy exist, is not helpful to lump them in with the rest like this:
> Sometimes companies say our personal data is “anonymized,” implying a one-way ratchet where it can never be dis-aggregated and re-identified. But this is not possible—anonymous data rarely stays this way.
Is it not possible, or not common?
People need to be informed about how to apply scrutiny to anonymization techniques, not scared into assuming ill intent when they see one.