I feel like most of those research questions could be answered if it was a "user...

belorn · on Feb 10, 2015

> I feel like most of those research questions could be answered if

If the person who releases this kind of information has the foresight to know what the questions are going to be, they could provide the answers directly rather than go half-way and modify the data. It would likely be less work than trying to produce anonymized data that is both useful and secure.

What I see used in cases like this is one of two options. Either full public access, or restricted access where only a few selected get the chance to do the research. The 0.01% misuse is thus balanced to that choice, rather than the theoretical case of anonymized data.

m8urn · on Feb 10, 2015

As I explained in the article I seriously doubt that any more than a tiny number of these passwords are still valid. And there is no reason for them to be, having already been widely available, indexed (and cached) by every search engine, archived at archive.org, and downloaded by thousands or tens of thousands of people. Anyone who would use this data maliciously probably already has it.

Much of this data is the same data monitored by sites like haveibeenpwned.com and a dozen others. Facebook scrapes these. Lastpass will send you alerts. The risk here is minimal; the research value is much more than you realize.

meowface · on Feb 10, 2015

>Anyone who would use this data maliciously probably already has it.

You might be surprised. The fact that these dumps are supposedly quite old certainly mitigates the risk, but I've seen cases of primary email accounts being taken over from a plaintext password in a dump 5+ years old. No one ever tried it on the email because it wasn't in the dump and wasn't identical to the username, though it was very close.

Aggregators like haveibeenpwned.com and Lastpass responsibly use the passwords they scrape, they don't release them all in a big batch like this. Many cybercriminals do the same kind of scraping and share these aggregated lists privately, but they're always going to be missing things, so there's no question they're all going to be pulling in your list, too. And odds are there's going to be at least one dump that a lot of them missed which yours has.

I do understand there is some research benefit here, but even in the best possible scenario I don't think the value from the research outweighs the costs.

m8urn · on Feb 10, 2015

First of all, a good number of these passwords were simply gathered through google. Some were gathered via the archive.org archive of pastebin pastes and their normal web page archive. Some were from forums that were located via google. This data is already out there, being aggregated doesn't make it any easier to hack these people.

Try searching for "Cucum01:Ber02" or "shawman:badman" and you will see how many passwords are indexed. I have hundreds of searches like these that I monitor and scrape.

Second, I regularly share my data with the owners of password checking sites such as haveibeenpwned to make sure users are able to be aware of these breaches. Releasing this data isn't something I have taken lightly, I debated it for years. I have weighed the risks and felt it was important to release the raw data, although not everyone will agree with me on this. I made a good effort to minimize the risks to actual users.

Finally, keep in mind that most users are already at risk simply because they have bad passwords. Ten percent of users have a password on the top 1000 list. A large percentage of users are at risk because the websites they are on don't have proper security. This is how people get hacked, not because of a password found on this list.

jschwartzi · on Feb 10, 2015

Still, the whole purpose of a password is to remain secret. He's certainly doing these users a disservice by releasing this list regardless of the hypothetical likelihood of the data already being available. Basically the arguments for doing this all seem to boil down to "they should already know their passwords are compromised" which nobody can guarantee is the case.

I agree that having a crappy password puts you at risk, but what about the people who genuinely tried to use some common sense but are on this list anyway? Is it their fault for not religiously keeping up with the latest indexed password lists?

pbreit · on Feb 10, 2015

OK, I'll bite: can you give us some ideas on how this would lead to a genuine advancement in user authentication (that we wouldn't have with username/pw de-linked)?

totony · on Feb 10, 2015

Example:

Username: mickael

Password: mickael69

EDIT: Just to be more precise, there is a correlation here, and with so much data a lot can be known. Patterns can then be forbidden from password fields so the website is less prone to dictionary attacks.

pbreit · on Feb 11, 2015

So what would you do here? Disallow "mickael" from the password? That's pretty user-hostile and almost completely pointless.

totony · on Feb 11, 2015

Is it pointless to reduce the attack vector against your website? And, no, for a banking system, it is not that user-hostile to say things like "we have found that using <pattern> in your password makes it easy for people to guess, please choose a more complicated password".