Hacker News new | past | comments | ask | show | jobs | submit login

They are salted hashes. Also, I chose city out of thin air. Realistically the column names are generic, "output", "input" things like that. So you'd also have to guess that this particular user is putting city data in their outputs before you'd be able to run the "known cities" attack.



Yes, password cracking works on salted hashes too. And the way you now describe it, with the entries in every column being hashed, and the column names themselves being meaningless*, I'm confused what value such a database has for anyone. They're just columns of random data, and the analyst doesn't even know what the data represents?

*And presumably, somehow, there being no way to recover this meaning. E.g. if an attacker is reversing hashes, they could try several sources of guesses - city names, country names, common first and last names in several languages, heights and weights in metric and imperial in the human range, valid phone numbers, IPv4 addresses, dates, blood type, license plate and post numbers or entire addresses, e-mail addresses if the attacker has a list of known-valid emails... all of these, and I'm sure many others that I forgot, are sufficiently small that they can be brute-force guessed by simply trying all possibilities, so no matter how generic the name of a column is, if it contains any of this data, the hashing can be reversed.


This conversation has me thinking that I could go a step further and replace the hashes with only enough characters to make them unique:

> Baltimore=>0x01a1...1ef7=>01a

> Aaron=>0x17bf...86f1=>17bf

>{"Foo":”Bar"}=>0x19f4...9af2"=>19f

This would create lots of false positives if one attempted to reverse it, but it's still a 1:1 map, so data which satisfies unique and foreign key constraints would still do so after the transformation. As for the utility, it makes it so that you can run the app and see what the user sees--except their data is hidden from you. So suppose the user's complaint is:

> two weeks ago, workflow xyz stopped reaching its final state, but never explicitly changed status to 'failed' so we weren't alerted to the problem.

I can hash "xyz" and get its handle and then go clicking around in he app to see what stands out about it compared to its neighbors. Perhaps there's some should-never-happen scenario like an input is declared but has no value, or there's a cyclic dependency.

I don't need to know the actual names of the depended upon things to identify a cycle in them. The bugs are typically structural, it's not very common that things break only when the string says "Balitmore" and not when it says "1ef7".

Privacy wise, the goal is to be as supportive as possible without having to ask them to share their screen or create a user for us so we can poke around. And when it's a bug, to have something to test with so we can say "It's fixed now." instead of "Try it again, is it fixed now? How about now?"

I'm the third party here, and I'm trying to prevent my users from sharing their users' data with me. Maybe it's not strictly "anonymization" because I'm not sure that this data references people. Remaining unsure is kind of the point.


But remember rule #1 of data security: if there is a way to access the data legitimately, there is a way to access it illegitimately. The only question is the amount of time and effort required to do so.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: