Hacker News new | past | comments | ask | show | jobs | submit login
Why Searching Through 500M Pwned Passwords Is So Quick (troyhunt.com)
304 points by pstadler on Feb 27, 2018 | hide | past | favorite | 172 comments



Ummm... because it’s an O(1) array lookup not a search at all? Infuriating.

It’s read-only static data. Spending even 60ms on the response is ridiculous. Reading from files in blob storage... WTF?

Ctrl-F Redis - was disappointed.

Actually, even forget Redis. Pre-generate each of the 1 million possible HTTP responses and store in a string array. The 5 character hex is the index into the array. Write < 100 lines of Go to load the data structure and serve it. What am I missing?

This is like “Hello World” in those HTTP Framework Benchmarks that used to make the rounds every few months.


The 250mb redis instance on Azure costs $0.055 an hour[1], so nearly $500 p/year

Going down the small hosting a 100 line Go program route, the cheapest "B1" instance type with 1GiB of RAM costs $105.12 a year, and then if you want the service to be HA you need probably another instance in a different zone, maybe a load balancer as well.

I suspect Azure Functions + Blob Storage costs a lot less. Although he didn't mention how much CloudFlare was costing him.

[1] https://azure.microsoft.com/en-us/pricing/details/cache/ [2] https://azure.microsoft.com/en-us/pricing/details/virtual-ma...


You could host this whole thing on the cheapest Digitalocean droplet if it were written as the person you're replying to described.


And then it's a single server that has none of the scaling, ease of maintenance, or high availability of just using functions and table storage. This is the perfect scenario for serverless, why would you want to run a VM instead?

The current system works great, most of the hits will be cached, and optimizing the origin to be that milliseconds faster isn't worth it at all.


It's actually NOT a perfect system for serverless, because it's a read-only (no writes), simple search "web app". The one thing that you don't get from a VM in your list is high availability I'll give you that, but otherwise this solution is an overengineered monstrosity for a simple problem if done in a more traditional way, as many many other people are calling out here.

If the only reason your web app is fast is when its data is cached, you need to take a long hard look at how your web app is built. It's not that hard to build a performant query system like this even with that amount of data.


Read vs write has nothing to do with serverless, it's about well-encapsulated logic that doesn't need any server management and has granular pay-per-use billing.

It doesn't get much simpler than a function that does a key/value lookup or returns a text file; what exactly is over-engineered about that? The app is fast, it's < 100ms which is perfectly fine for non-cached access. This isn't mission critical real-time, it's a completely free service to lookup a password hash.

Everyone here is caught up in the usual "we can do better" perspective without actually thinking about why it was done this way. In reality the current method is cheaper, more reliable, more scalable and has 0 maintenance compared to anything else suggested so far.


I'm guessing he didn't do that because he's a prominent .NET/Azure guy and I don't think that stuff is well-supported in that environment.


He mentioned that Cloudflare is free for him.


Precisely my first thought. Static readonly data? Why isn't this just sitting in memory?


Don't filesystem caches basically do this for you?


Yes but with a heck of a lot more overhead than;

  response.body = result[(int)hexCode];


No, with a lot less overhead because you don't need to write custom web server code.

Edit: and if he points the requests directly at blob storage as he's suggesting, we're very close to to flat files on a filesystem again.


You can do that after mmap(2) and you never have to reach out for read(2). Might need to use unsafe language constructs though.


Sounds reasonable, 500mil * 160 bits of SHA-1 would be 500 mb of space. Even with some metadata to manage it, the smallest instances should be able to handle that.


500,000,000*160bits/(8bits/byte)=10,000,000,000 bytes=9,536MB=9.313GB


I think this solution is banking on storage + azure function cost being cheaper than the per instance cost to service the same load with an acceptable latency.


Couldn't he run a simple KV Consul cluster in a few zones and regions in either AWS or Google Cloud and be done?


That's ridiculously complex and more expensive than just writing two functions that call table storage and blog storage APIs. Given the popularity of serverless, I'm surprised by all these answers saying he should run actual VMs when this the perfect use-case for functions.


For an extreme diet, Id html app cache the pretty pictures. I know its deprecated but is still massive support in all browsers and is instantaneous fast. Removing all origin hits except api req for sure.

On the lookups CF workers has a global state that is shared amongst nodes. Been playing with that. Wondering if you hold string array in that and can get the origin hits below 10% you can run this svc with many million reqs day practically for free. Vps tho, not amzn.


Just putting this on a vps or cheap dedicated server with 16gb ram would've led to sub-ms response times at much lower costs (if you don't get Azure and Cloudflare for free like him). At those response speeds, scalability is also not really an issue if you cache aggressively at the edge.

Argos is then nice to have but not really necessary. If the server responds back <1ms, those 30% saved RTT are probably not detectable for the user.


I'm not sure that you quite realize that Troy's intention here is to eventually be soaking a whole lot of zeroes with this. Add on to that that, if people are relying on this as a service, it had better not go down under any remotely plausible circumstances, and some growth-hacker's dedicated server from Hetzner or whatever is unsuitable. Edge caching only blunts sufficiently this if everybody's searching for the same passwords and proving that that's the case is a burden you have not shouldered.

To that end, "just use a VPS or a cheap dedicated server" sounds a whole lot like the middlebrow thing we're supposed to not be fans of around here.


> Troy's intention here is to eventually be soaking a whole lot of zeroes this this

huh?


Sorry, shorthand common in my neck of the woods. Used for any Thing per Your Favorite Time Unit, measured logarithmically. In this case, requests-per-second.

(Edit: ah, typo. s/this this/with this. Clearer?)


@pbhjpbhj - I believe the meaning of the sentence is that Troy Hunt is expecting/planning for a few orders of magnitude of growth.

"Soak a few zeros" for example going from 1,000 calls per second to 1,000,000 calls per second.


what an odd way to say that.


"Soak" is often used for "take" in engineering circles. "Zeroes" is often used for orders of magnitude in engineering circles. I've never before heard these put together in that fashion, however.


Sorry, I still don't understand it.

Could you write out deelowe's quoted sentence, above, as clearly as you're able?


> I'm not sure that you quite realize that Troy's intention here is to eventually be soaking a whole lot of zeroes with this.

Troy's goal is to get this used by a lot of people with a large number of requests per second (so, a number with a lot of zeros in it)


def not clearer. I know we are all supposed to be smart here, but maybe use more common words and language. People come from a lot of different backgrounds here.


On the other hand, I'm happy to learn a new fun term. Language is boring without aphorisms.

Learning 'soak a few zeros' is more valuable to me than the whole rest of this thread. I doubt I'll ever use it, but I like it.


> Edge caching only blunts sufficiently this if everybody's searching for the same passwords

AFAIK caching already kicks in when people check different passwords that have same first five characters in their hash.


Sorry, you're right, I should have been clearer. The bigger concern is that caching doesn't replace fault tolerance and rolling your own disaster recovery. I didn't mean to suggest that caching was not being done or would not be helpful. A million keys would be pretty uniformly dispersed; intuitively it seems like you'd need to have a lot more traffic than he's currently got to the point where caching reliably soaks enough load off of "a VPS or cheap dedicated server" that you can afford to have the server blow up--because servers blow up.

"A VPS or cheap dedicated server" is YOLO stuff. Not something you do when you want other people to rely on you.


You can set up 3 VPS and use Cloudflare's load balancing or a solution by someone like DigitalOcean? That'll still be cheaper than using Lambdas and block storage.


If you use Cloudflare, they should still handle attacks and caching. For the origin server, I can't imagine any scenario where you'd get a lot of growth from the immediate attention. How are more than a few thousand requests per second realistic? It's not as if the whole world would suddenly constantly check passwords. And if you use his API for 3rd party services it'd make more sense to open source it so that others can use a local copy.

> To that end, "just use a VPS or a cheap dedicated server" sounds a whole lot like the middlebrow thing we're supposed to not be fans of around here.

I know that we're supposed to all be fans of AWS and everything as a service. But in reality I found simple VPS (can be Hetzner, OVH, AWS, Google, whatever) to be faster, cheaper and less prone to downtime (as most downtime comes from software rather than hardware anyway).


Couldn't he just pregenerate all 1.048.576 responses, load them somewhere in RAM (or just a bunch of HTML files on an nginx with caching on) and be done with it? I mean he writes that a single response, gzipped, averages 10kb so that's only 1GB of RAM in total.

Even better: host this on a service like Netlify and not even have the 30 day cache timeout Troy has here (which means 30 days old info in case of new breaches). Just regenerate the entire set on the dev box whenever there's a new breach (should be fast enough, it's a linear search & split) and push it to Netlify, it'll invalidate all edge caches automatically.


The breach data isn't updated that often. It was 6 months between v1 and v2, so a long cache is totally fine. And then invalidate the cache if there is an update.


That.. sort of supports my argument :-) He could cache it indefinitely instead of 30 days if he's going to invalidate the cache anyway.


I do appear to have misinterpreted what your point was. Sorry about that. Although I assume the idea was limiting how much is cached at any point on their part. Not sure it really matters though its not that much data.


It sounds like he did pregenerate all the responses, then stored them in Azure Blob Storage. From the article I can't tell why he even needs to use Azure Function. Azure Blob Storage can be configured for public access over HTTP [0], so he could point Cloudflare directly at it.

[1] https://docs.microsoft.com/en-us/azure/architecture/patterns...


The function takes in the hash prefix and sends the correct file in response. That decision has to be made somewhere.


I'm saying that, at least as described in the blog post, the Azure function is unnecessary because that decision can be made on the frontend (or in the API client).

Here is how it sounds like it works now when someone visits https://haveibeenpwned.com/Passwords

1) User enters abc123 in the form

2) The frontend translates that to the hash prefix 61ee8

3) The frontend interpolates that into the URL for the function, https://api.pwnedpasswords.com/range/{hashPrefix}

4) The frontend requests https://api.pwnedpasswords.com/range/61ee8

5) The Azure function interpolates the hash prefix into the URL for the response in blob storage, something like http://pwnedpasswords.blob.core.windows.net/foo/{hashPrefix}...

6) The Azure function requests http://pwnedpasswords.blob.core.windows.net/foo/61ee8.txt

7) The Azure function returns that to the frontend

Here is what I was proposing

1) User enters abc123 in the form

2) The frontend hashes that to 61ee8

3) The frontend interpolates that into the URL for the response in blob storage, something like http://pwnedpasswords.blob.core.windows.net/foo/{hashPrefix}...

4) The frontend requests http://pwnedpasswords.blob.core.windows.net/foo/61ee8.txt


Sure that works for that site, but it's also a public API that can be used by any client so it's easier to just give out a single URL rather than point to files. The HIBP website just uses the same API instead of doing different logic.


When you don't need transactions, don't have a cache invalidation problem, and are querying read-only data, then this architecture - or really any architecture that takes advantage of what makes HTTP scalable, mostly idempotent, cacheable responses to GETs - makes sense.


I do think it is a bit disingenuous that the author compares this to a classic web application and its data needs, when the needs here are so trivial. It is very well designed for what it does (the security needs of such a checker almost dictates the design), but is a model usable by very few applications.

As an aside, the article talks a bit about Brotli and it's worth noting that Brotli is nothing more than LZ77 with a dictionary pre-seeded with a 119KB static dictionary of commonly seen web text. It is of course going to be fantastic for compressing an HTML document, where much of the content is verbose and common, but would do nothing above gzip for the hash result data. I would be surprised if it yielded a single byte of savings in that case. Brotli is supported by most browsers as it was snuck into the WOFF 2.0 standard, so browsers that support the new web font standard automatically have to support Brotli.

https://dennisforbes.ca/index.php/2016/01/28/eat-your-brotli...


Depends on whether you compare this to what a classic web application should be doing versus what it actually does.


But it looks like IIS doesn't support brotli yet. Which also means it is probably not supported in azure either.


Not to mention that the dataset is relatively small.

A 16GB of ram server could basically serve responses sub-millisecond at line rate for like $25/month.


Where are you getting 16GB ram for $25 a month? Would be pretty handy.


I currently have 2TB HDD, 16GB RAM, infinite bandwidth (100Mbps U/D) for 19.19€/month IIRC at Kimsufi [0]

[0] https://www.kimsufi.com


And that pitches out the window fault tolerance or replicability unless nontrivial developer work is dumped into it. If the disk goes farming in your Kimsufi instance, you are looking at a bad day.


That really wasn't the point of my reply.

Sure, if the machine dies, my services die. But it's still dirt cheap.


A dirt bike is cheap, too. It's not a car. It's definitely not a good car. It doesn't have much to help you when you hit a tree.


I pay 30 USD for 16 gig ram 6 TB hdd and high end I7. Its cheaper then buying the hardware and running it at my home. EC would cost me 130$ just for the instance excluding the 6TB space. So far my server has more then 1000 days uptime 0 problems. Also if you have 1 instance in a cloud or 1 instance on a dedi you should still have backup plans in place. Cloud is good for variable load, defenitly not for constant load. Also you get dedi's with san storage if all you care about is storage.


and 200-400ms latency to not-Europe.


210ms from South Africa, 91ms from NYC, 198ms from India, 360ms from Melbourne, 55ms from Morocco.

It's really not that bad...


A DNS request/response, TCP handshake, SSL handshake and last but not least an HTTP request/response is an awful lot of 210ms round trips.



A Hetzner CX41 is 16GB RAM @ €15.90/mo (about 18 usd) https://www.hetzner.com/cloud


Surprised Troy is so pro-cloudflare. I feel like they create a lot of security headaches.


Troy is clearly very pro Cloudflare and Azure. I’m only familiar with them at a high level and it would be great to hear counterpoints to his praise and get (at least the impression of) a debate that encompasses their pros and cons.

+1 for being interested in expanding this


Basically they add yet another layer of centralization and they're a MITM for all your traffic, so they're an easy target for surveillance. If only they weren't so convenient... I use their DNS service because it's top notch, but I also tend to enable their MITM service because their CDN nodes are super fast and require just a single toggle to enable. They read my far-future expires headers and just cache the content on the edge, no extra work needed.


Doesn't know about Cloudflare but as a Microsoft Regional Director he's certainly pro Azure.


Such a weird title; it makes me think he works for them, but he doesn't



didn't he work for microsoft previously?



He says that he supports CF because they enable SSL on so many websites that wouldn't have it otherwise. I guess the fact that he gets a lot of their premium features for free helps as well.


The fact that CloudFlare has "Flexible SSL" leads me to believe they don't care about security at all. I've asked them to at least post a "CF-SSL: Flexible" header or similar to responses so someone could develop a Chrome extension to alert on this behavior.

They're presenting sites over HTTPS to visitors while tunneling the full request in plain text over HTTP across the internet back to the origin server. Because of this, web browsers like Chrome will allow access to sensitive APIs like Geolocation despite the end to end communication being insecure.

If I could, I'd personally blacklist all "Flexible SSL" sites from my daily browsing.


Just revoke CF's certificate on your local machine. If they won't tell you when they're being flexible you have to assume they're always being flexible.


That has undesired effects. I have a page that uses SSL CF<->server via their certificate. The certificate for the end user is still a CF certificate. The end user has no way of finding out if the connection to origin is encrypted.


Agreed. It's getting a bit much, almost every one of his blog posts has some advertisement for Cloudflare.

2 of Cloudflare's 3 SSL options generate an insecure setup yet look perfectly fine to the user with a shiny green padlock in the address bar. That really isn't acceptable.


In this case though, security concerns do not apply; the source data is secure, the website does not use authentication, and HTTPS adds an extra layer of security on any input (which is hashed and cropped clientside)


> on any input (which is hashed and cropped clientside)

You could still run a volume analysis on the reply. If it's very short, you can guess it's a 404. If it isn't, you might be able to get which prefix was queried... \end{tin foil hat}


That won't work since all possible 5 character prefixes return data.


They don't all return the same amount though. According to my Chrome here and now:

> https://api.pwnedpasswords.com/range/aaaaa = 28.8KB

> https://api.pwnedpasswords.com/range/01234 = 28.5KB

> https://api.pwnedpasswords.com/range/af0fa = 23.2KB


Isn't Chrome seeing the decrypted and decompressed response in DevTools? I'd be more interested in what you see in Wireshark.


Sadly not available on my current machine :(

The experiment is left as an exercise to the reader :)


Can you elaborate on that?


OP is probably referring to [1] or the general fact that they MITM most traffic that uses their services.

[1] https://blog.cloudflare.com/incident-report-on-memory-leak-c...


Tbf that's the actual service they offer. It's not like they are being sneaky.


To be sure, but anybody using CloudFlare needs to keep in the back of their head that everything going through CF is not only accessible to a hypothetical Bad Guy at CloudFlare it can also (and remember has, this is a real bug that happened) get exposed to unrelated parties on the Internet if CF mixes your data with somebody else's by mistake.

This makes CF seem fine for your different variants of popcorn.gif, your (subresource integrity checked) Javascript implementation of the VIC-20 computer, or a public blog post, and NOT so great for patient access to histology results, private web forums, banking, and many other things on the Web.


Of course they do that. How else would such a service possibly operate?


By offering two different models and letting you choose between them for each hostname under Cloudflare.

1) Injecting themselves on the application layer (the current model)

2) Injecting themselves on the network layer, passing through only encrypted traffic between the client and your servers

Not every feature they currently offer would still be possible in model #2, but most of them (including DDOS protection) can still work in a limited fashion.


Coming soon


That's very good news! The knowledge that a large part of my internet usage is MITM'ed by a single entity is continuously making me uncomfortable.


I love the k-Anonymity model. Makes it actually feasible to check passwords against HIBP when carrying out password audits for clients. Shameless plug but I've added it to my Active Directory audit password tool: https://github.com/eth0izzle/cracke-dit


Lots of suggestions in this thread about better architecture but they all seem to forget that this is designed to be minimal in cost, complexity and maintenance while delivering 100% availability and great performance.

While Redis or a VM would be faster, that's way more overhead compared to a few cloud functions and table storage. This whole thing is event-driven and easy to build with just your browser, along with having cheap and granular billing. Cloudflare also already caches the responses so there's really no need for the origin to be perfect.


Somewhat related (and nitpicky), but there are some spelling errors (derrivation, seperate...) on the Cloudflare post that explains k-anonimity:

https://blog.cloudflare.com/validating-leaked-passwords-with...

P.S.: As a non-native speaker had to look those words up to check them, as I trusted the spelling from an official blog post.


Thanks. I'll get them corrected.


“dependended” is another.


Fixed


Here's a slightly more easily auditable version of the checker, in Python:

https://www.pastery.net/wwzqua/

The bash one was fine, I just prefer the readability of Python to make sure I know that only my truncated hash version is ever sent.


What bash checker you are referring to?



Wouldn't using a simple bloom filter make sense in their case? Just build the thing offline and your app loads it in RAM at startup.


Here's a quick table of sizes of the Bloom Filter with the FP rate

    False positives            Size (MB)
    0.1                              285
    0.01                             571
    0.001                            857
    0.0001                          1120
    0.00001                         1390
    0.000001                        1670


That's only if you're sending a bloom filter for the entire hash table though. If you just use the existing buckets you could reduce the API response sizes considerably.

The real problem is that the current API returns a count of the number of times each particular password has appeared, and AFAIK there's no good way to do that with a bloom filter.


Sure there is. Use a counting bloom filter.

In order to increase the count for a given key, you expand the key into your probe sequence, find the minimum counter across all of the probed locations, and then increment all of the values equal to that minimum value. One generally uses saturating addition to avoid overflow. If you want to support removing items, you actually increment the values at all of the probe locations, at a cost of increasing the expected deviation from correct counts.

To read the count for a given key, you generate your probe sequence and return the minimum of the values stored at the probed locations.

Note that a regular Bloom filter is just a counting Bloom filter using 1-bit counters.

In this case, I presume one would generate a counting bloom filter and then either store logarithm or quantile of the count, since there isn't a common use case for distinguishing between 4,096 and 4,095 occurrences of a given password.


I don't understand why everyone's obsessed with using a Bloom Filter for this. The median response size on the API is 12.2KB with 305 entries. A Bloom Filter with 305 entries and a 0.000001 false positive rate would be about 1KB. It seems to me you're introducing a lot of complexity (Bloom Filter vs. simple string match) and possibility of false positives to save 11KB.


A bloom filter can be pushed to clients. Everything is static and could be served from a CDN. If there is a hit you could then do a secondary request to perform an actual lookup. For passwords which have not been pwned there'd be a 100% savings on CPU.


True, although there's an interesting side effect. With the false positive rate tuned low any call to the actual API would likely be saying "My password is one of the ones you already know about" with quite high probability.


Presumably the use case is to reject either common or compromised passwords, and they'd like to eliminate a point of failure by not depending upon Troy's service being up / maintained forever.

Also, a Bloom filter has a smaller page cache footprint and disk space usage than using a precise set.


I've experimented with using a bloom filter for this dataset: https://github.com/jthomas/serverless-pwned-passwords


You'd have to either send the password, or something very easily derived from the password for this to work. Or have every client download the bloom filter data which is quite a lot of bandwidth by comparison.


what?

Just send a hash from the client and have the bloom filter built offline with the sames hashes, no big deal. You'd never ask clients to download it.


What? Your approach would be highly susceptible to a rainbow table-like attack AND not allow for k-Anonymity.


I think so. Just set the bits and test them as needed.


Wouldn't be useful for the range queries, all possible prefixes return data


True, I didn't see they allow to query on ranges. I guess you might be able build something on a BloomFilter Trie for that, but that might be pushing it.

But for one of the calls (checking a whole password against what's stored), which probably accounts for most of the usage of the API, a bloom filter seems like a perfect fit.


Doesn't seem particularly necessary though yeah? Bloom filters are super useful when the cost of doing the computation for the actual lookup is high, but in this case it's a single key-value lookup. That's about as cheap as an operation can possibly get.


Anyone know, if we permute all 6-16 character length alpha numeric strings, how many would would have their sha-1 hash be a match for a given five character prefix?

I’m certainly not saying I think this is an issue! I’m just academically curious about the number and how to go about calculating it.


A SHA-1 hash digest is a 160-bit hexadecimal string. These are 40 digits long, with 16 possible values per digit, yielding a total search space of 16^40 possible values.

If we splice off the first five digits (which Troy originally used as the database partition key), we get 1,048,576 five digit values each (16^5). Then we continue calculating with the sixth position as the new first position in the string.

The math from here is a straightforward function mapping 16^n -> 16^(n - 5):

* 16 six digit values match any given five digit partition key, p,

* 16^2 = 256 seven digit values match any p,

* 16^3 = 4,096 eight digit values match any p,

* 16^4 = 65,536 nine digit values match any p,

* 16^5 = 1,048,576 ten digit values match any p,

* 16^6 = 16,777,216 11 digit values match any p,

* 16^7 = 268,435,456 12 digit values match any p,

* 16^8 = 4,294,967,296 13 digit values match any p,

* 16^9 = 68,719,476,736 14 digit values match any p,

* 16^10 = 1,099,511,627,776 15 digit values match any p,

* 16^11 = 17,592,186,044,416 16 digit values match any p.

So in general, to calculate how many n digit SHA-1 digests correspond to any m digit prefix, we simply calculate (16^n)/(16^m), which yields 16^(n - m). Thus we have (16^[6..16]) / (16^5) for the five digit prefix case. Hopefully that elucidates it for you!


Shouldn't the size of the alphanumeric alphabet used for the passwords be in there somewhere? The question was about how many permutations of all 6-16 character alphanumeric strings map to a given 5 digit SHA-1 prefix.

Your answer seems to be for the case where the alphabet of the input string is hex digits.

I.e., I think we want Sum[A^i,{i,6,16}]/16^5, where A is the size of the alphanumeric alphabet.

For A=62, this is 4.6x10^22 or 2^75.3.

For A=95, this is 4.2x10^25 or 2^85.1.


Note that there's no need to actually perform N-M+1 exponentiations and N-M additions. Two exponentiations, a subtraction, and a division will give you the same answer.

  Sum[A^i,{i,1,N}] = (A^(N+1) - 1) / (A-1).
  Sum[A^i,{i,M,N}] = (A^(N+1) - A^M) / (A-1).
Consider A=10, M=6, N=2. Sum = (10,000,000 - 100) / 9 = (9,999,900) / 9 = 1,111,100, which is easily checked mentally.

Yes, I've stared at permutations for far too long, but I can't have been the first to notice the pattern. It must be in plenty of textbooks.

Another cute combinatorics question: using an alphabet of size A, generate the Nth largest M-character sequence where no two adjacent characters are equal. A simple count-and-check algorithm takes O(N) time, but there's a simple O(M) algorithm (that is, constant time if the number of characters is fixed), assuming constant-time basic arithmetic operations.


If by "alphanumeric" you mean [0-9a-zA-Z], that's an alphabet of 2 * 26+10=62 characters. Since I'll be doing this in my head, I'll add two punctation characters so we end up at 64=2^6.

Now, I assume you mean "five character prefix of the hex-encoded 160 bit hash"? One hex digit is half a byte; a nibble or 4 bits. 5 * 4=20 bits, leaves 160-20=140 bit hash. Which can represent 2^140 different values.

För a password of 1 to 16 characters from our alphabet, we have 64^16 passwords. The five character prefixes account for 64^5 of those.

64^16=2^(6 * 16)=2^96

64^5=2^30.

(2^96-2^30) < 2^140.

So, if I understood your question: all of them could (should) have unique representations in the remaining hash.

But I think you meant how many would collide in the prefix?

That'd be (2^96-2^30)/(2^20).

I think, it's getting to be bed time. But at least I should be sufficiently wrong for someone to correct me ;)


In theory, if sha-1 were a "good" hash function in the sense of evenly distributed, then if there are 2^20 prefixes then each would get 1/2^20 fraction of all the hashes.


...or CloudFlare/CloudFront plus DynamoDB table with primary key of the first/last n-number of characters from the hash with potential secondary index for filtering.

Btw, indexing can be done cheeper with Google I think.

There is also another way (probably better) and that is to use s3. 1tb can be stored for as little as $20 - the rest is endpoint caching.

Luckily for all of us it is easier than ever to single-handedly scale to millions of users at minimal cost.


You'd be paying a lot for bandwidth on S3 and Cloudfront.


Minor complaint: Start the blog post with a link to the service you are talking about. I actually have trouble finding it.


It's two clicks away:

* first paragraph links to https://www.troyhunt.com/ive-just-launched-pwned-passwords-v... ;

* first paragraph of that links to https://haveibeenpwned.com/.


That's the wrong link though. I eventually found it at https://haveibeenpwned.com/Passwords


TL;DR why brotli is HTTPS-only: some middle-boxes mangle responses with content encodings they don't know.


What does he do with the logs of all the passwords submitted in searches?


> What does he do with the logs of all the passwords submitted in searches?

> imagine if you wanted to check whether the password "P@ssw0rd" exists in the data set. [...] The SHA-1 hash of that string is "21BD12DC183F740EE76F27B78EB39C8AD972A757" so what we're going to do is take just the first 5 characters, in this case that means "21BD1". That gets sent to the Pwned Passwords API and it responds with 475 hash suffixes (that is everything after "21BD1") and a count of how many times the original password has been seen.


What about the logs from queries submitted via the HIBP website form?

"Another idea I'm toying with is to use the Cloudflare Workers John mentioned earlier to plug directly into Blob Storage. Content there can be accessed easily enough over HTTP (that's where you download the full 500M Pwned Password list from) and it could take out that Azure Function layer altogether. That's something I'll investigate further a little later on as it has to potential to bring cost down further whilst pumping up performance."

How to read this? The full list will be downloadable? Users can do queries locally on the 500M file instead of over the internet? It would be nice to avoid having to submit queries over an untrusted network (the internet), but I doubt that is what is being considered in this paragraph.


The form on HIBP uses the same JS client hashing, you can check the HTTP requests yourself in dev tools.

Yes, the whole dataset is available. The first paragraph mentions the release of the v2 dataset and you can read the full blog post here: https://www.troyhunt.com/ive-just-launched-pwned-passwords-v...

You can get the 8.8gb file directly here: https://haveibeenpwned.com/Passwords


Thanks for the answer.

That page acknowledges the issue, which is all I was curious about:

"Getting back to the online search, being conscious of not wanting to send the wrong message to people, immediately before the search box I put a very clear, very bold message: "Do not send any password you actively use to a third-party service - even this one!""


How am I supposed to pronounced 'pwned'?

Can't we find something other than 4chan language to describe this?


I'm not a fan of the term but "pwned" predates 4chan.

http://knowyourmeme.com/memes/owned-pwned


‘owned’ with a P on the front.

At this point the term pwned seems far too well known (in the industry) to change it.

For example, it seems to be in my iPhone’s internal dictionary.



Powned. Literally owned then add a hard p


That password header warning is the coolest security improvement I've seen online.


Was expecting an Algolia ad but was delightfully surprised.


Starting to get a little uncomfortable with how hard this is being pushed on HN right now.

https://hn.algolia.com/?query=troyhunt.com&sort=byDate&prefi...


It's good to be skeptical, but everything I've read so far makes me believe Hunt's motives are good. His efforts are beneficial to HN readers and beyond.

This search shows four results within the last week and then it starts to drop off. How different are these search results from other popular sites? Ars Technica, Techcrunch, Anandtech, Bloomberg, LWN? Even if you focus on individuals, maybe it's not too different from the articles of Bruce Schneier, ESR, Linus, Theo de Raadt, etc?


>How different are these search results from other popular sites?

Generally the sites you listed have unrelated articles posted which are all on the subject of some distinct topic. The articles posted here in the past week have all been about the compromised password tool.



I'm not sure if you're just being snarky or not, but use the link I gave in my comment. There are 3 articles this week about the "have I been pwned" tool.


There are 3 distinct blog posts about the data, and how it is enabled. All of which are interesting and relevant in different ways. I don't see any issue with them all showing up.


CloudFlare can read every password submitted through their service and here is why that's so great...

It's beatifully elegant, because...

What? This is also the same company that spilled memory all over every cache everywhere.


Nope. 100% incorrect, Troy's service is using Cloudflare but you don't send the password to his API: https://blog.cloudflare.com/validating-leaked-passwords-with...


My apologies, I was wrong. I was assuming that you were terminating ssl and therefore in a position to read the plaintext of user requests.

This would allow you to simply look up whether the password was pwned based on form submits.

[edit] ok, I understand what you are saying, Troy's service allows some password privacy due to api design and happens to use cloudflare stuff. Sorry I was so slow.

I was conflating two unrelated things; a) what happens when you submit a password through cloudflare and b) what happens when you happen to use a specific password checking api which uses cloudflare.


I guess you didn't read his notation in the beginning of the article that you should read his previous posts on the subject. The whole point is that you give only a part (5 first characters) of the hash (SHA-1 in this case if I'm not mistaken?) to the API, and you are then returned a list of passwords with the same first characters, then you check that response against the actual password.

In effect, Troy's service never knows the password. It doesn't even know if it existed in the list you received.


Troy's API doesn't work like that. Doesn't need to send the password at all.


What do you mean? It still sends a portion of the hash, not the whole password, but that's still some information, no? An attacker brute-forcing hashes doesn't care if he finds the exact hash or a set of a thousand he can try. In fact, since the database is public, if I have your range and the password is in it, I only have to try a few hundred passwords at most to find which one yours is. Granted, I also need to know which service you're using this on, but that doesn't seem that hard to find out.


I think the scenario you are describing is the following: the attacker somehow gets the hash submitted to the API and then runs JtR or similar to crack the password.

That doesn't work because only the first five characters of the hash are sent to the API. The API then returns a list of all the hashes that have that prefix that it knows about. The client then (privately) is able to see if their password hash is in that list.


No, it's more like "attacker gets the range, sends it to the API, gets all the possible hashes from the API, tries the corresponding plaintext passwords, thus needs only ~200 tries instead of X million".

Granted, your password was compromised already if it was in that list, so it needs to be changed anyway, but there's some information revealed by the hash (of course, how else could it be checked?).


But the whole point of the api is that you don't use the password if it is found in the list. So the attacker will know about the 200 strings that are definitely not the password.


Assuming you managed to instantly change the password on every single service you've used it, yes.


Nobody is talking about it, but to compromise someone, you also need the username, which is not sent.


Ah, that's true. In any case, I don't think this is really a reasonable vector of attack, I was just being pedantic at jgc's "the password is not sent" comment.


Got it. That makes sense.


Wonderful, thank you [edit] thinking this through.... but regardless of how the api works, is it not possible that you (cloudflare) could just have the list and check yourselves since you know the submitted password?

It could be a value added service for all your customers.

[Sorry for the late edit, not being evil here].


If you read the details of Troy's API you'll see that at no time does the password or even a hash of the password leave the computer of the person using the API.


Yep, I was sort of on the wrong tack. The api is fine, it just seems like an unnecessary dance for cloudflare integrated services since you have the password anyway.


The entire dataset is available for download, so everyone can do the check themselves if they want to.


You send a truncated digest of the password, which was obtained using a somewhat dubious hash function. I guess there are currently no known preimage attacks on this scheme, but revealing this information is more than nothing.


This is slightly OT and probably not a popular opinion, but does anyone else feel that Troy having this massive dataset of emails is unethical?

I definitely believe it is illegal and was surprised that during his recent visit to the US that the FBI did not arrest him.


Simply possessing the information (which is already freely available online) is not on its own unethical. It depends entirely upon what he does with the information, which in this case is protect the owners from malicious individuals who also have access to the information because it's freely available online.


It is if its stolen property...which most of that data belonged to the company that was hacked.


I'm not going to pretend to understand the legal niceties here, but the analogy to "stolen property" rings false because 1) he's not depriving anybody of their passwords 2) the very fact they've been leaked means they aren't really of value anymore.


Strongly no.

In an better world existence of such dataset would be unnecessary.

In the less ideal one you have to look at it's existence in a broader context.

- it provides a clearly beneficial service to owners of those email accounts bridging the gap in the legal systems (disclosure requirements)

- it's exposed in a way that keeps the risk of 3rd party exploitation low: doesn't leak information unless you have access to email account

- it has not to my knowledge been abused in any way and public is keeping check on that - there would have been an outcry that would not go unnoticed

- it's less of a high value target than you'd think - it's built from data that is already in the wild and could be pieced together by a sufficiently motivated actor, especially in the light of the recent combined lists with hundreds of millions of emails surfacing..

Quite frankly if you look at just that it scores better than most datasets gathered by websites requiring user registration.

The only thing missing here really is the consent to be included in that list, but given the sources of the data and the points above I find lack of it more than sufficiently justified.


You jumped to illegal before even explaining why it's unethical.


In the UK (where Troy is not based), I guess a prosecutor might try and build a case around either:

- Section 58 of the Terrorism Act[0] which makes it illegal to possess information "likely to be useful to a person committing or preparing an act of terrorism", though the law provides "reasonable excuse" as an explicit defence.

- Section 3A of the Computer Misuse Act[1] "making, supplying or obtaining articles for use in unauthorised access to computer material" [my paraphrasing]

It feels like Troy has a pretty strong defence of "reasonable excuse" to my lay understanding though.

[0] https://www.legislation.gov.uk/ukpga/2000/11/section/58 [1] https://www.cps.gov.uk/legal-guidance/cybercrime-prosecution...


Curious which specific US law(s) it violates. Sounds like he downloaded a few torrents and built a search interface.


During his recent visit to the US he testified before Congress on issues regarding data breaches. That would have been one hell of a bait and switch.


He has written about that very problem here: https://www.troyhunt.com/the-ethics-of-running-a-data-breach...


Charging people to access my stolen info seems like it is less of a service and more about Troy than anything else.


From what I understand, the only people that are charged access to the service are large organizations that want callbacks when new data that matches their users is entered. This seems like a very reasonable balance between making money to pay for hosting related fees, and providing a free and publicly available service.



These passwords were already publicly available, some with the email address associated with them.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: