A little confused. I thought a hash algorithm was designed to make every bit of ...

andrewla · on June 15, 2018

The usual way to do this is to treat the bloom filter as a cache that can have false positives. So instead of querying haveibeenpwned for each attempted password from the user, you only query if the bloom filter says that the password was found. If the bloom filter says that the password was not found, then that means it is definitively not found in the upstream.

_wldu · on June 15, 2018

Exactly. More people should read the original Bloom Filter paper. It covers this. Bloom (1970):

"In application areas where error-free performance is a necessity, these new methods are not applicable." (p. 422)

https://dl.acm.org/citation.cfm?doid=362686.362692

gknoy · on June 15, 2018

The bloom filter will tell you whether _the hash of your password_ is in the set. It doesn't matter whether your _password_ matches a pwned one, it only matters whether your password has the same hash as one that is pwned. This tool expands on that by simplifying the test such that it gets more false positives, but still has no false negatives.

Imagine that password hashes are six characters, and `abcdef` is the only pwned password we know about. A smaller version of this tool would reject your password if it hashes to anything that starts with `abc` (such as `abc000`, `abcdee`, `abcdef`, etc). With this toy example, we can see that this is deliberately pessimistic -- it only is a "your password might be pwned" indicator.

It's like throwing out anything in your fridge that is past its expiration date, even if you don't know whether it's actually spoiled.

lightbyte · on June 15, 2018

>Imagine that password hashes are six characters, and `abcdef` is the only pwned password we know about. A smaller version of this tool would reject your password if it hashes to anything that starts with `abc` (such as `abc000`, `abcdee`, `abcdef`, etc).

Doesn't that not work very well at all though? If these "abcXXX" strings are hashes, than just because they share a prefix doesn't mean the unhashed password are even remotely similar.

"good-password-123-dog-xxx..." could hash to "abcdeg" and "password" could be the thing that hashed to "abcdef".

HellsMaddy · on June 16, 2018

You're completely right, but I think you're misinterpreting the usefulness of this bloom filter. It would most likely serve only as the first layer in a password strength check.

Because there is no possibility of a false negative, you know that if the bloom filter does not include the hash you gave it, the password definitely does not exist in the original corpus. In this case, the password is probably at least semi-unique. If you're using this bloom filter in the context of a web application registration process, this would likely be the case for anyone trying to sign up with a strong/unique password, e.g. a random one generated by a password manager.

However, if the bloom filter maybe contains the hash, you would likely fall back to some more expensive strength check, like querying the full HIBP database to see if this exact password definitely exists in there.

psobot · on June 15, 2018

Similarity between passwords doesn't really matter for the purposes of this checker - all it's looking for is a _possible_ exact match. If your password hashes to "abcXXX", then there are two possibilities:

- "abc" is in the bloom filter, and so your password _might_ be in the list (depending on what the remaining characters past "abc" are)

- "abc" is not in the bloom filter, so your password _is definitely not_ in the list.

This results in a high rate of false positives (times when a password is incorrectly thought to be part of the list, but it's not) but that's deemed to be an acceptable tradeoff for this algorithm.

sbarker · on June 15, 2018

Collisions in modern, properly implemented algorithms are rare. http://everydayinternetstuff.com/2015/04/hash-collision-prob... https://www.google.com/amp/s/www.theregister.co.uk/AMP/2017/...

lightbyte · on June 15, 2018

Collisions for the entire hash are rare, but this is just a prefix of the hash. I imagine it is significantly more common for a collision in the prefix.

ubernostrum · on June 15, 2018

I maintain a library that talks to the Have I Been Pwned password database, which uses a 5-hexadecimal-digit prefix of the SHA1 hash.

Someone filed a PR asking to add a cache for HIBP's responses, and I did some quick math and came up with ~1200 as a birthday-paradox number for the first five digits of a SHA1 hex digest (assuming uniform/random distribution of hex digits).

tylersmith · on June 15, 2018

If the bloom filter returns positive you check the HIBO DB by getting all the hashes with that prefix and check the returned list for the full hash.

basch · on June 15, 2018

The average number of hashes returned is 478

The smallest is 381 (hash prefixes "E0812" and "E613D")

The largest is 584 (hash prefixes "00000" and "4A4E8")

https://www.troyhunt.com/ive-just-launched-pwned-passwords-v...

FroshKiller · on June 15, 2018

Little code running in the real world any given moment is modern or properly implemented.

notimetorelax · on June 15, 2018

I guess the logic here was as follows:

It's not good to send clear text user password to yet another system

-> send it hashed

But full hash might still be vulnerable to the rainbow table attack

-> let's use only a part of the hash, AND, reject full hashes.

In summary, I believe, they tried to avoid creating more ways in which user passwords could be leaked.