Why Searching Through 500M Pwned Passwords Is So Quick

zaroth · on Feb 27, 2018

Ummm... because it’s an O(1) array lookup not a search at all? Infuriating.

It’s read-only static data. Spending even 60ms on the response is ridiculous. Reading from files in blob storage... WTF?

Ctrl-F Redis - was disappointed.

Actually, even forget Redis. Pre-generate each of the 1 million possible HTTP responses and store in a string array. The 5 character hex is the index into the array. Write < 100 lines of Go to load the data structure and serve it. What am I missing?

This is like “Hello World” in those HTTP Framework Benchmarks that used to make the rounds every few months.

djhworld · on Feb 27, 2018

The 250mb redis instance on Azure costs $0.055 an hour[1], so nearly $500 p/year

Going down the small hosting a 100 line Go program route, the cheapest "B1" instance type with 1GiB of RAM costs $105.12 a year, and then if you want the service to be HA you need probably another instance in a different zone, maybe a load balancer as well.

I suspect Azure Functions + Blob Storage costs a lot less. Although he didn't mention how much CloudFlare was costing him.

[1] https://azure.microsoft.com/en-us/pricing/details/cache/ [2] https://azure.microsoft.com/en-us/pricing/details/virtual-ma...

dchuk · on Feb 27, 2018

You could host this whole thing on the cheapest Digitalocean droplet if it were written as the person you're replying to described.

manigandham · on Feb 27, 2018

And then it's a single server that has none of the scaling, ease of maintenance, or high availability of just using functions and table storage. This is the perfect scenario for serverless, why would you want to run a VM instead?

The current system works great, most of the hits will be cached, and optimizing the origin to be that milliseconds faster isn't worth it at all.

dchuk · on Feb 28, 2018

It's actually NOT a perfect system for serverless, because it's a read-only (no writes), simple search "web app". The one thing that you don't get from a VM in your list is high availability I'll give you that, but otherwise this solution is an overengineered monstrosity for a simple problem if done in a more traditional way, as many many other people are calling out here.

If the only reason your web app is fast is when its data is cached, you need to take a long hard look at how your web app is built. It's not that hard to build a performant query system like this even with that amount of data.

manigandham · on Feb 28, 2018

Read vs write has nothing to do with serverless, it's about well-encapsulated logic that doesn't need any server management and has granular pay-per-use billing.

It doesn't get much simpler than a function that does a key/value lookup or returns a text file; what exactly is over-engineered about that? The app is fast, it's < 100ms which is perfectly fine for non-cached access. This isn't mission critical real-time, it's a completely free service to lookup a password hash.

Everyone here is caught up in the usual "we can do better" perspective without actually thinking about why it was done this way. In reality the current method is cheaper, more reliable, more scalable and has 0 maintenance compared to anything else suggested so far.

emodendroket · on Feb 27, 2018

I'm guessing he didn't do that because he's a prominent .NET/Azure guy and I don't think that stuff is well-supported in that environment.

jval43 · on Feb 27, 2018

He mentioned that Cloudflare is free for him.

th3byrdm4n · on Feb 27, 2018

Precisely my first thought. Static readonly data? Why isn't this just sitting in memory?

vortico · on Feb 27, 2018

Don't filesystem caches basically do this for you?

zaroth · on Feb 27, 2018

Yes but with a heck of a lot more overhead than;

  response.body = result[(int)hexCode];

tgsovlerkhgsel · on Feb 28, 2018

No, with a lot less overhead because you don't need to write custom web server code.

Edit: and if he points the requests directly at blob storage as he's suggesting, we're very close to to flat files on a filesystem again.

dullgiulio · on Feb 27, 2018

You can do that after mmap(2) and you never have to reach out for read(2). Might need to use unsafe language constructs though.

mzzter · on Feb 27, 2018

Sounds reasonable, 500mil * 160 bits of SHA-1 would be 500 mb of space. Even with some metadata to manage it, the smallest instances should be able to handle that.

lipnitsk · on Feb 27, 2018

500,000,000*160bits/(8bits/byte)=10,000,000,000 bytes=9,536MB=9.313GB

quadrature · on Feb 27, 2018

I think this solution is banking on storage + azure function cost being cheaper than the per instance cost to service the same load with an acceptable latency.

nodesocket · on Feb 27, 2018

Couldn't he run a simple KV Consul cluster in a few zones and regions in either AWS or Google Cloud and be done?

manigandham · on Feb 27, 2018

That's ridiculously complex and more expensive than just writing two functions that call table storage and blog storage APIs. Given the popularity of serverless, I'm surprised by all these answers saying he should run actual VMs when this the perfect use-case for functions.

dpweb · on Feb 27, 2018

For an extreme diet, Id html app cache the pretty pictures. I know its deprecated but is still massive support in all browsers and is instantaneous fast. Removing all origin hits except api req for sure.

On the lookups CF workers has a global state that is shared amongst nodes. Been playing with that. Wondering if you hold string array in that and can get the origin hits below 10% you can run this svc with many million reqs day practically for free. Vps tho, not amzn.

dx034 · on Feb 27, 2018

Just putting this on a vps or cheap dedicated server with 16gb ram would've led to sub-ms response times at much lower costs (if you don't get Azure and Cloudflare for free like him). At those response speeds, scalability is also not really an issue if you cache aggressively at the edge.

Argos is then nice to have but not really necessary. If the server responds back <1ms, those 30% saved RTT are probably not detectable for the user.

eropple · on Feb 27, 2018

I'm not sure that you quite realize that Troy's intention here is to eventually be soaking a whole lot of zeroes with this. Add on to that that, if people are relying on this as a service, it had better not go down under any remotely plausible circumstances, and some growth-hacker's dedicated server from Hetzner or whatever is unsuitable. Edge caching only blunts sufficiently this if everybody's searching for the same passwords and proving that that's the case is a burden you have not shouldered.

To that end, "just use a VPS or a cheap dedicated server" sounds a whole lot like the middlebrow thing we're supposed to not be fans of around here.

deelowe · on Feb 27, 2018

> Troy's intention here is to eventually be soaking a whole lot of zeroes this this

huh?

eropple · on Feb 27, 2018

Sorry, shorthand common in my neck of the woods. Used for any Thing per Your Favorite Time Unit, measured logarithmically. In this case, requests-per-second.

(Edit: ah, typo. s/this this/with this. Clearer?)

33W · on Feb 27, 2018

@pbhjpbhj - I believe the meaning of the sentence is that Troy Hunt is expecting/planning for a few orders of magnitude of growth.

"Soak a few zeros" for example going from 1,000 calls per second to 1,000,000 calls per second.

optimuspaul · on Feb 27, 2018

what an odd way to say that.

stcredzero · on Feb 27, 2018

"Soak" is often used for "take" in engineering circles. "Zeroes" is often used for orders of magnitude in engineering circles. I've never before heard these put together in that fashion, however.

pbhjpbhj · on Feb 27, 2018

Sorry, I still don't understand it.

Could you write out deelowe's quoted sentence, above, as clearly as you're able?

timdorr · on Feb 27, 2018

> I'm not sure that you quite realize that Troy's intention here is to eventually be soaking a whole lot of zeroes with this.

Troy's goal is to get this used by a lot of people with a large number of requests per second (so, a number with a lot of zeros in it)

optimuspaul · on Feb 27, 2018

def not clearer. I know we are all supposed to be smart here, but maybe use more common words and language. People come from a lot of different backgrounds here.

robotresearcher · on Feb 27, 2018

On the other hand, I'm happy to learn a new fun term. Language is boring without aphorisms.

Learning 'soak a few zeros' is more valuable to me than the whole rest of this thread. I doubt I'll ever use it, but I like it.

m4lvin · on Feb 27, 2018

> Edge caching only blunts sufficiently this if everybody's searching for the same passwords

AFAIK caching already kicks in when people check different passwords that have same first five characters in their hash.

eropple · on Feb 27, 2018

Sorry, you're right, I should have been clearer. The bigger concern is that caching doesn't replace fault tolerance and rolling your own disaster recovery. I didn't mean to suggest that caching was not being done or would not be helpful. A million keys would be pretty uniformly dispersed; intuitively it seems like you'd need to have a lot more traffic than he's currently got to the point where caching reliably soaks enough load off of "a VPS or cheap dedicated server" that you can afford to have the server blow up--because servers blow up.

"A VPS or cheap dedicated server" is YOLO stuff. Not something you do when you want other people to rely on you.

dx034 · on March 1, 2018

You can set up 3 VPS and use Cloudflare's load balancing or a solution by someone like DigitalOcean? That'll still be cheaper than using Lambdas and block storage.

dx034 · on March 1, 2018

If you use Cloudflare, they should still handle attacks and caching. For the origin server, I can't imagine any scenario where you'd get a lot of growth from the immediate attention. How are more than a few thousand requests per second realistic? It's not as if the whole world would suddenly constantly check passwords. And if you use his API for 3rd party services it'd make more sense to open source it so that others can use a local copy.

> To that end, "just use a VPS or a cheap dedicated server" sounds a whole lot like the middlebrow thing we're supposed to not be fans of around here.

I know that we're supposed to all be fans of AWS and everything as a service. But in reality I found simple VPS (can be Hetzner, OVH, AWS, Google, whatever) to be faster, cheaper and less prone to downtime (as most downtime comes from software rather than hardware anyway).

skrebbel · on Feb 27, 2018

Couldn't he just pregenerate all 1.048.576 responses, load them somewhere in RAM (or just a bunch of HTML files on an nginx with caching on) and be done with it? I mean he writes that a single response, gzipped, averages 10kb so that's only 1GB of RAM in total.

Even better: host this on a service like Netlify and not even have the 30 day cache timeout Troy has here (which means 30 days old info in case of new breaches). Just regenerate the entire set on the dev box whenever there's a new breach (should be fast enough, it's a linear search & split) and push it to Netlify, it'll invalidate all edge caches automatically.

Bedon292 · on Feb 27, 2018

The breach data isn't updated that often. It was 6 months between v1 and v2, so a long cache is totally fine. And then invalidate the cache if there is an update.

skrebbel · on Feb 27, 2018

That.. sort of supports my argument :-) He could cache it indefinitely instead of 30 days if he's going to invalidate the cache anyway.

Bedon292 · on Feb 27, 2018

I do appear to have misinterpreted what your point was. Sorry about that. Although I assume the idea was limiting how much is cached at any point on their part. Not sure it really matters though its not that much data.

sciurus · on Feb 27, 2018

It sounds like he did pregenerate all the responses, then stored them in Azure Blob Storage. From the article I can't tell why he even needs to use Azure Function. Azure Blob Storage can be configured for public access over HTTP [0], so he could point Cloudflare directly at it.

[1] https://docs.microsoft.com/en-us/azure/architecture/patterns...

manigandham · on Feb 27, 2018

The function takes in the hash prefix and sends the correct file in response. That decision has to be made somewhere.

sciurus · on Feb 28, 2018

I'm saying that, at least as described in the blog post, the Azure function is unnecessary because that decision can be made on the frontend (or in the API client).

Here is how it sounds like it works now when someone visits https://haveibeenpwned.com/Passwords

1) User enters abc123 in the form

2) The frontend translates that to the hash prefix 61ee8

3) The frontend interpolates that into the URL for the function, https://api.pwnedpasswords.com/range/{hashPrefix}

4) The frontend requests https://api.pwnedpasswords.com/range/61ee8

5) The Azure function interpolates the hash prefix into the URL for the response in blob storage, something like http://pwnedpasswords.blob.core.windows.net/foo/{hashPrefix}...

6) The Azure function requests http://pwnedpasswords.blob.core.windows.net/foo/61ee8.txt

7) The Azure function returns that to the frontend

Here is what I was proposing

1) User enters abc123 in the form

2) The frontend hashes that to 61ee8

3) The frontend interpolates that into the URL for the response in blob storage, something like http://pwnedpasswords.blob.core.windows.net/foo/{hashPrefix}...

4) The frontend requests http://pwnedpasswords.blob.core.windows.net/foo/61ee8.txt

manigandham · on March 1, 2018

Sure that works for that site, but it's also a public API that can be used by any client so it's easier to just give out a single URL rather than point to files. The HIBP website just uses the same API instead of doing different logic.

barrkel · on Feb 27, 2018

When you don't need transactions, don't have a cache invalidation problem, and are querying read-only data, then this architecture - or really any architecture that takes advantage of what makes HTTP scalable, mostly idempotent, cacheable responses to GETs - makes sense.

endorphone · on Feb 27, 2018

I do think it is a bit disingenuous that the author compares this to a classic web application and its data needs, when the needs here are so trivial. It is very well designed for what it does (the security needs of such a checker almost dictates the design), but is a model usable by very few applications.

As an aside, the article talks a bit about Brotli and it's worth noting that Brotli is nothing more than LZ77 with a dictionary pre-seeded with a 119KB static dictionary of commonly seen web text. It is of course going to be fantastic for compressing an HTML document, where much of the content is verbose and common, but would do nothing above gzip for the hash result data. I would be surprised if it yielded a single byte of savings in that case. Brotli is supported by most browsers as it was snuck into the WOFF 2.0 standard, so browsers that support the new web font standard automatically have to support Brotli.

https://dennisforbes.ca/index.php/2016/01/28/eat-your-brotli...

tjoff · on Feb 27, 2018

Depends on whether you compare this to what a classic web application should be doing versus what it actually does.

cm2187 · on Feb 27, 2018

But it looks like IIS doesn't support brotli yet. Which also means it is probably not supported in azure either.

byefruit · on Feb 27, 2018

Not to mention that the dataset is relatively small.

A 16GB of ram server could basically serve responses sub-millisecond at line rate for like $25/month.

Sevii · on Feb 27, 2018

Where are you getting 16GB ram for $25 a month? Would be pretty handy.

moviuro · on Feb 27, 2018

I currently have 2TB HDD, 16GB RAM, infinite bandwidth (100Mbps U/D) for 19.19€/month IIRC at Kimsufi [0]

[0] https://www.kimsufi.com

eropple · on Feb 27, 2018

And that pitches out the window fault tolerance or replicability unless nontrivial developer work is dumped into it. If the disk goes farming in your Kimsufi instance, you are looking at a bad day.

moviuro · on Feb 27, 2018

That really wasn't the point of my reply.

Sure, if the machine dies, my services die. But it's still dirt cheap.

eropple · on Feb 27, 2018

A dirt bike is cheap, too. It's not a car. It's definitely not a good car. It doesn't have much to help you when you hit a tree.

ludjer · on Feb 27, 2018

I pay 30 USD for 16 gig ram 6 TB hdd and high end I7. Its cheaper then buying the hardware and running it at my home. EC would cost me 130$ just for the instance excluding the 6TB space. So far my server has more then 1000 days uptime 0 problems. Also if you have 1 instance in a cloud or 1 instance on a dedi you should still have backup plans in place. Cloud is good for variable load, defenitly not for constant load. Also you get dedi's with san storage if all you care about is storage.

voltagex_ · on Feb 27, 2018

and 200-400ms latency to not-Europe.

moviuro · on Feb 27, 2018

210ms from South Africa, 91ms from NYC, 198ms from India, 360ms from Melbourne, 55ms from Morocco.

It's really not that bad...

ilovecars2 · on Feb 27, 2018

A DNS request/response, TCP handshake, SSL handshake and last but not least an HTTP request/response is an awful lot of 210ms round trips.

byefruit · on Feb 27, 2018

https://www.hetzner.com/cloud has 16GB for $20

https://www.scaleway.com/pricing/ has 16GB just over $20

jackweirdy · on Feb 27, 2018

A Hetzner CX41 is 16GB RAM @ €15.90/mo (about 18 usd) https://www.hetzner.com/cloud

sleepychu · on Feb 27, 2018

Surprised Troy is so pro-cloudflare. I feel like they create a lot of security headaches.

tomalpha · on Feb 27, 2018

Troy is clearly very pro Cloudflare and Azure. I’m only familiar with them at a high level and it would be great to hear counterpoints to his praise and get (at least the impression of) a debate that encompasses their pros and cons.

+1 for being interested in expanding this

stavros · on Feb 27, 2018

Basically they add yet another layer of centralization and they're a MITM for all your traffic, so they're an easy target for surveillance. If only they weren't so convenient... I use their DNS service because it's top notch, but I also tend to enable their MITM service because their CDN nodes are super fast and require just a single toggle to enable. They read my far-future expires headers and just cache the content on the edge, no extra work needed.

Daycrawler · on Feb 27, 2018

Doesn't know about Cloudflare but as a Microsoft Regional Director he's certainly pro Azure.

copper_think · on Feb 27, 2018

Such a weird title; it makes me think he works for them, but he doesn't

jwilk · on Feb 27, 2018

https://www.troyhunt.com/microsoft-regional-director/

ptman · on Feb 27, 2018

didn't he work for microsoft previously?

sghi · on Feb 27, 2018

No, he worked for Pfizer - https://www.troyhunt.com/today-marks-two-important-milestone...

dx034 · on Feb 27, 2018

He says that he supports CF because they enable SSL on so many websites that wouldn't have it otherwise. I guess the fact that he gets a lot of their premium features for free helps as well.

RKearney · on Feb 27, 2018

The fact that CloudFlare has "Flexible SSL" leads me to believe they don't care about security at all. I've asked them to at least post a "CF-SSL: Flexible" header or similar to responses so someone could develop a Chrome extension to alert on this behavior.

They're presenting sites over HTTPS to visitors while tunneling the full request in plain text over HTTP across the internet back to the origin server. Because of this, web browsers like Chrome will allow access to sensitive APIs like Geolocation despite the end to end communication being insecure.

If I could, I'd personally blacklist all "Flexible SSL" sites from my daily browsing.

sleepychu · on Feb 28, 2018

Just revoke CF's certificate on your local machine. If they won't tell you when they're being flexible you have to assume they're always being flexible.

dx034 · on March 1, 2018

That has undesired effects. I have a page that uses SSL CF<->server via their certificate. The certificate for the end user is still a CF certificate. The end user has no way of finding out if the connection to origin is encrypted.

r1ch · on Feb 27, 2018

Agreed. It's getting a bit much, almost every one of his blog posts has some advertisement for Cloudflare.

2 of Cloudflare's 3 SSL options generate an insecure setup yet look perfectly fine to the user with a shiny green padlock in the address bar. That really isn't acceptable.

Cthulhu_ · on Feb 27, 2018

In this case though, security concerns do not apply; the source data is secure, the website does not use authentication, and HTTPS adds an extra layer of security on any input (which is hashed and cropped clientside)

moviuro · on Feb 27, 2018

> on any input (which is hashed and cropped clientside)

You could still run a volume analysis on the reply. If it's very short, you can guess it's a 404. If it isn't, you might be able to get which prefix was queried... \end{tin foil hat}

jgrahamc · on Feb 27, 2018

That won't work since all possible 5 character prefixes return data.

moviuro · on Feb 27, 2018

They don't all return the same amount though. According to my Chrome here and now:

> https://api.pwnedpasswords.com/range/aaaaa = 28.8KB

> https://api.pwnedpasswords.com/range/01234 = 28.5KB

> https://api.pwnedpasswords.com/range/af0fa = 23.2KB

voltagex_ · on Feb 27, 2018

Isn't Chrome seeing the decrypted and decompressed response in DevTools? I'd be more interested in what you see in Wireshark.

moviuro · on Feb 28, 2018

Sadly not available on my current machine :(

The experiment is left as an exercise to the reader :)

JosephRedfern · on Feb 27, 2018

Can you elaborate on that?

tempay · on Feb 27, 2018

OP is probably referring to [1] or the general fact that they MITM most traffic that uses their services.

[1] https://blog.cloudflare.com/incident-report-on-memory-leak-c...

dingo_bat · on Feb 27, 2018

Tbf that's the actual service they offer. It's not like they are being sneaky.

tialaramex · on Feb 27, 2018

To be sure, but anybody using CloudFlare needs to keep in the back of their head that everything going through CF is not only accessible to a hypothetical Bad Guy at CloudFlare it can also (and remember has, this is a real bug that happened) get exposed to unrelated parties on the Internet if CF mixes your data with somebody else's by mistake.

This makes CF seem fine for your different variants of popcorn.gif, your (subresource integrity checked) Javascript implementation of the VIC-20 computer, or a public blog post, and NOT so great for patient access to histology results, private web forums, banking, and many other things on the Web.

matthewmacleod · on Feb 27, 2018

Of course they do that. How else would such a service possibly operate?

vader1 · on Feb 27, 2018

By offering two different models and letting you choose between them for each hostname under Cloudflare.

1) Injecting themselves on the application layer (the current model)

2) Injecting themselves on the network layer, passing through only encrypted traffic between the client and your servers

Not every feature they currently offer would still be possible in model #2, but most of them (including DDOS protection) can still work in a limited fashion.

jgrahamc · on Feb 27, 2018

Coming soon

vader1 · on Feb 27, 2018

That's very good news! The knowledge that a large part of my internet usage is MITM'ed by a single entity is continuously making me uncomfortable.

darkport · on Feb 27, 2018

I love the k-Anonymity model. Makes it actually feasible to check passwords against HIBP when carrying out password audits for clients. Shameless plug but I've added it to my Active Directory audit password tool: https://github.com/eth0izzle/cracke-dit

manigandham · on Feb 27, 2018

Lots of suggestions in this thread about better architecture but they all seem to forget that this is designed to be minimal in cost, complexity and maintenance while delivering 100% availability and great performance.

While Redis or a VM would be faster, that's way more overhead compared to a few cloud functions and table storage. This whole thing is event-driven and easy to build with just your browser, along with having cheap and granular billing. Cloudflare also already caches the responses so there's really no need for the origin to be perfect.

yupyup · on Feb 27, 2018

Somewhat related (and nitpicky), but there are some spelling errors (derrivation, seperate...) on the Cloudflare post that explains k-anonimity:

https://blog.cloudflare.com/validating-leaked-passwords-with...

P.S.: As a non-native speaker had to look those words up to check them, as I trusted the spelling from an official blog post.

jgrahamc · on Feb 27, 2018

Thanks. I'll get them corrected.

jval43 · on Feb 28, 2018

“dependended” is another.

jgrahamc · on Feb 28, 2018

Fixed

stavros · on Feb 27, 2018

Here's a slightly more easily auditable version of the checker, in Python:

https://www.pastery.net/wwzqua/

The bash one was fine, I just prefer the readability of Python to make sure I know that only my truncated hash version is ever sent.

jwilk · on Feb 27, 2018

What bash checker you are referring to?

stavros · on Feb 27, 2018

The one in this article:

https://blog.cloudflare.com/validating-leaked-passwords-with...

zcam · on Feb 27, 2018

Wouldn't using a simple bloom filter make sense in their case? Just build the thing offline and your app loads it in RAM at startup.

jgrahamc · on Feb 27, 2018

Here's a quick table of sizes of the Bloom Filter with the FP rate

    False positives            Size (MB)
    0.1                              285
    0.01                             571
    0.001                            857
    0.0001                          1120
    0.00001                         1390
    0.000001                        1670

Ajedi32 · on Feb 27, 2018

That's only if you're sending a bloom filter for the entire hash table though. If you just use the existing buckets you could reduce the API response sizes considerably.

The real problem is that the current API returns a count of the number of times each particular password has appeared, and AFAIK there's no good way to do that with a bloom filter.

KMag · on Feb 27, 2018

Sure there is. Use a counting bloom filter.

In order to increase the count for a given key, you expand the key into your probe sequence, find the minimum counter across all of the probed locations, and then increment all of the values equal to that minimum value. One generally uses saturating addition to avoid overflow. If you want to support removing items, you actually increment the values at all of the probe locations, at a cost of increasing the expected deviation from correct counts.

To read the count for a given key, you generate your probe sequence and return the minimum of the values stored at the probed locations.

Note that a regular Bloom filter is just a counting Bloom filter using 1-bit counters.

In this case, I presume one would generate a counting bloom filter and then either store logarithm or quantile of the count, since there isn't a common use case for distinguishing between 4,096 and 4,095 occurrences of a given password.

jgrahamc · on Feb 27, 2018

I don't understand why everyone's obsessed with using a Bloom Filter for this. The median response size on the API is 12.2KB with 305 entries. A Bloom Filter with 305 entries and a 0.000001 false positive rate would be about 1KB. It seems to me you're introducing a lot of complexity (Bloom Filter vs. simple string match) and possibility of false positives to save 11KB.

ericfrederich · on Feb 27, 2018

A bloom filter can be pushed to clients. Everything is static and could be served from a CDN. If there is a hit you could then do a secondary request to perform an actual lookup. For passwords which have not been pwned there'd be a 100% savings on CPU.

jgrahamc · on Feb 27, 2018

True, although there's an interesting side effect. With the false positive rate tuned low any call to the actual API would likely be saying "My password is one of the ones you already know about" with quite high probability.

KMag · on Feb 27, 2018

Presumably the use case is to reject either common or compromised passwords, and they'd like to eliminate a point of failure by not depending upon Troy's service being up / maintained forever.

Also, a Bloom filter has a smaller page cache footprint and disk space usage than using a precise set.

thomasj · on Feb 28, 2018

I've experimented with using a bloom filter for this dataset: https://github.com/jthomas/serverless-pwned-passwords

pbnjay · on Feb 27, 2018

You'd have to either send the password, or something very easily derived from the password for this to work. Or have every client download the bloom filter data which is quite a lot of bandwidth by comparison.

zcam · on Feb 27, 2018

what?

Just send a hash from the client and have the bloom filter built offline with the sames hashes, no big deal. You'd never ask clients to download it.

pbnjay · on Feb 28, 2018

What? Your approach would be highly susceptible to a rainbow table-like attack AND not allow for k-Anonymity.

_wldu · on Feb 27, 2018

I think so. Just set the bits and test them as needed.

bpicolo · on Feb 27, 2018

Wouldn't be useful for the range queries, all possible prefixes return data

zcam · on Feb 27, 2018

True, I didn't see they allow to query on ranges. I guess you might be able build something on a BloomFilter Trie for that, but that might be pushing it.

But for one of the calls (checking a whole password against what's stored), which probably accounts for most of the usage of the API, a bloom filter seems like a perfect fit.

bpicolo · on Feb 27, 2018

Doesn't seem particularly necessary though yeah? Bloom filters are super useful when the cost of doing the computation for the actual lookup is high, but in this case it's a single key-value lookup. That's about as cheap as an operation can possibly get.

euroclydon · on Feb 27, 2018

Anyone know, if we permute all 6-16 character length alpha numeric strings, how many would would have their sha-1 hash be a match for a given five character prefix?

I’m certainly not saying I think this is an issue! I’m just academically curious about the number and how to go about calculating it.

dsacco · on Feb 27, 2018

A SHA-1 hash digest is a 160-bit hexadecimal string. These are 40 digits long, with 16 possible values per digit, yielding a total search space of 16^40 possible values.

If we splice off the first five digits (which Troy originally used as the database partition key), we get 1,048,576 five digit values each (16^5). Then we continue calculating with the sixth position as the new first position in the string.

The math from here is a straightforward function mapping 16^n -> 16^(n - 5):

* 16 six digit values match any given five digit partition key, p,

* 16^2 = 256 seven digit values match any p,

* 16^3 = 4,096 eight digit values match any p,

* 16^4 = 65,536 nine digit values match any p,

* 16^5 = 1,048,576 ten digit values match any p,

* 16^6 = 16,777,216 11 digit values match any p,

* 16^7 = 268,435,456 12 digit values match any p,

* 16^8 = 4,294,967,296 13 digit values match any p,

* 16^9 = 68,719,476,736 14 digit values match any p,

* 16^10 = 1,099,511,627,776 15 digit values match any p,

* 16^11 = 17,592,186,044,416 16 digit values match any p.

So in general, to calculate how many n digit SHA-1 digests correspond to any m digit prefix, we simply calculate (16^n)/(16^m), which yields 16^(n - m). Thus we have (16^[6..16]) / (16^5) for the five digit prefix case. Hopefully that elucidates it for you!

tzs · on Feb 27, 2018

Shouldn't the size of the alphanumeric alphabet used for the passwords be in there somewhere? The question was about how many permutations of all 6-16 character alphanumeric strings map to a given 5 digit SHA-1 prefix.

Your answer seems to be for the case where the alphabet of the input string is hex digits.

I.e., I think we want Sum[A^i,{i,6,16}]/16^5, where A is the size of the alphanumeric alphabet.

For A=62, this is 4.6x10^22 or 2^75.3.

For A=95, this is 4.2x10^25 or 2^85.1.

KMag · on Feb 27, 2018

Note that there's no need to actually perform N-M+1 exponentiations and N-M additions. Two exponentiations, a subtraction, and a division will give you the same answer.

  Sum[A^i,{i,1,N}] = (A^(N+1) - 1) / (A-1).
  Sum[A^i,{i,M,N}] = (A^(N+1) - A^M) / (A-1).

Consider A=10, M=6, N=2. Sum = (10,000,000 - 100) / 9 = (9,999,900) / 9 = 1,111,100, which is easily checked mentally.

Yes, I've stared at permutations for far too long, but I can't have been the first to notice the pattern. It must be in plenty of textbooks.

Another cute combinatorics question: using an alphabet of size A, generate the Nth largest M-character sequence where no two adjacent characters are equal. A simple count-and-check algorithm takes O(N) time, but there's a simple O(M) algorithm (that is, constant time if the number of characters is fixed), assuming constant-time basic arithmetic operations.

e12e · on Feb 27, 2018

If by "alphanumeric" you mean [0-9a-zA-Z], that's an alphabet of 2 * 26+10=62 characters. Since I'll be doing this in my head, I'll add two punctation characters so we end up at 64=2^6.

Now, I assume you mean "five character prefix of the hex-encoded 160 bit hash"? One hex digit is half a byte; a nibble or 4 bits. 5 * 4=20 bits, leaves 160-20=140 bit hash. Which can represent 2^140 different values.

För a password of 1 to 16 characters from our alphabet, we have 64^16 passwords. The five character prefixes account for 64^5 of those.

64^16=2^(6 * 16)=2^96

64^5=2^30.

(2^96-2^30) < 2^140.

So, if I understood your question: all of them could (should) have unique representations in the remaining hash.

But I think you meant how many would collide in the prefix?

That'd be (2^96-2^30)/(2^20).

I think, it's getting to be bed time. But at least I should be sufficiently wrong for someone to correct me ;)

bo1024 · on Feb 27, 2018

In theory, if sha-1 were a "good" hash function in the sense of evenly distributed, then if there are 2^20 prefixes then each would get 1/2^20 fraction of all the hashes.

_pdp_ · on Feb 27, 2018

...or CloudFlare/CloudFront plus DynamoDB table with primary key of the first/last n-number of characters from the hash with potential secondary index for filtering.

Btw, indexing can be done cheeper with Google I think.

There is also another way (probably better) and that is to use s3. 1tb can be stored for as little as $20 - the rest is endpoint caching.

Luckily for all of us it is easier than ever to single-handedly scale to millions of users at minimal cost.

always_good · on Feb 27, 2018

You'd be paying a lot for bandwidth on S3 and Cloudfront.

TorKlingberg · on Feb 27, 2018

Minor complaint: Start the blog post with a link to the service you are talking about. I actually have trouble finding it.

jwilk · on Feb 27, 2018

It's two clicks away:

* first paragraph links to https://www.troyhunt.com/ive-just-launched-pwned-passwords-v... ;

* first paragraph of that links to https://haveibeenpwned.com/.

TorKlingberg · on Feb 27, 2018

That's the wrong link though. I eventually found it at https://haveibeenpwned.com/Passwords

jwilk · on Feb 27, 2018

TL;DR why brotli is HTTPS-only: some middle-boxes mangle responses with content encodings they don't know.

aplorbust · on Feb 27, 2018

What does he do with the logs of all the passwords submitted in searches?

reificator · on Feb 27, 2018

> What does he do with the logs of all the passwords submitted in searches?

> imagine if you wanted to check whether the password "P@ssw0rd" exists in the data set. [...] The SHA-1 hash of that string is "21BD12DC183F740EE76F27B78EB39C8AD972A757" so what we're going to do is take just the first 5 characters, in this case that means "21BD1". That gets sent to the Pwned Passwords API and it responds with 475 hash suffixes (that is everything after "21BD1") and a count of how many times the original password has been seen.

aplorbust · on Feb 27, 2018

What about the logs from queries submitted via the HIBP website form?

"Another idea I'm toying with is to use the Cloudflare Workers John mentioned earlier to plug directly into Blob Storage. Content there can be accessed easily enough over HTTP (that's where you download the full 500M Pwned Password list from) and it could take out that Azure Function layer altogether. That's something I'll investigate further a little later on as it has to potential to bring cost down further whilst pumping up performance."

How to read this? The full list will be downloadable? Users can do queries locally on the 500M file instead of over the internet? It would be nice to avoid having to submit queries over an untrusted network (the internet), but I doubt that is what is being considered in this paragraph.

manigandham · on Feb 27, 2018

The form on HIBP uses the same JS client hashing, you can check the HTTP requests yourself in dev tools.

Yes, the whole dataset is available. The first paragraph mentions the release of the v2 dataset and you can read the full blog post here: https://www.troyhunt.com/ive-just-launched-pwned-passwords-v...

You can get the 8.8gb file directly here: https://haveibeenpwned.com/Passwords

aplorbust · on Feb 28, 2018

Thanks for the answer.

That page acknowledges the issue, which is all I was curious about:

"Getting back to the online search, being conscious of not wanting to send the wrong message to people, immediately before the search box I put a very clear, very bold message: "Do not send any password you actively use to a third-party service - even this one!""

frogpelt · on Feb 27, 2018

How am I supposed to pronounced 'pwned'?

Can't we find something other than 4chan language to describe this?

extra88 · on Feb 27, 2018

I'm not a fan of the term but "pwned" predates 4chan.

http://knowyourmeme.com/memes/owned-pwned

MBCook · on Feb 27, 2018

‘owned’ with a P on the front.

At this point the term pwned seems far too well known (in the industry) to change it.

For example, it seems to be in my iPhone’s internal dictionary.

8note · on Feb 27, 2018

https://en.oxforddictionaries.com/definition/pwn

slow_donkey · on Feb 27, 2018

Powned. Literally owned then add a hard p

Quarrelsome · on Feb 27, 2018

That password header warning is the coolest security improvement I've seen online.

kpennell · on Feb 27, 2018

Was expecting an Algolia ad but was delightfully surprised.

ddevault · on Feb 27, 2018

Starting to get a little uncomfortable with how hard this is being pushed on HN right now.

https://hn.algolia.com/?query=troyhunt.com&sort=byDate&prefi...

wyldfire · on Feb 27, 2018

It's good to be skeptical, but everything I've read so far makes me believe Hunt's motives are good. His efforts are beneficial to HN readers and beyond.

This search shows four results within the last week and then it starts to drop off. How different are these search results from other popular sites? Ars Technica, Techcrunch, Anandtech, Bloomberg, LWN? Even if you focus on individuals, maybe it's not too different from the articles of Bruce Schneier, ESR, Linus, Theo de Raadt, etc?

ddevault · on Feb 27, 2018

>How different are these search results from other popular sites?

Generally the sites you listed have unrelated articles posted which are all on the subject of some distinct topic. The articles posted here in the past week have all been about the compromised password tool.

yorby · on Feb 27, 2018

I only see 1 post about it: https://hn.algolia.com/?query=https:%2F%2Fwww.troyhunt.com%2...

ddevault · on Feb 27, 2018

I'm not sure if you're just being snarky or not, but use the link I gave in my comment. There are 3 articles this week about the "have I been pwned" tool.

Bedon292 · on Feb 27, 2018

There are 3 distinct blog posts about the data, and how it is enabled. All of which are interesting and relevant in different ways. I don't see any issue with them all showing up.

zxcmx · on Feb 27, 2018

CloudFlare can read every password submitted through their service and here is why that's so great...

It's beatifully elegant, because...

What? This is also the same company that spilled memory all over every cache everywhere.

jgrahamc · on Feb 27, 2018

Nope. 100% incorrect, Troy's service is using Cloudflare but you don't send the password to his API: https://blog.cloudflare.com/validating-leaked-passwords-with...

zxcmx · on Feb 27, 2018

My apologies, I was wrong. I was assuming that you were terminating ssl and therefore in a position to read the plaintext of user requests.

This would allow you to simply look up whether the password was pwned based on form submits.

[edit] ok, I understand what you are saying, Troy's service allows some password privacy due to api design and happens to use cloudflare stuff. Sorry I was so slow.

I was conflating two unrelated things; a) what happens when you submit a password through cloudflare and b) what happens when you happen to use a specific password checking api which uses cloudflare.

netzone · on Feb 27, 2018

I guess you didn't read his notation in the beginning of the article that you should read his previous posts on the subject. The whole point is that you give only a part (5 first characters) of the hash (SHA-1 in this case if I'm not mistaken?) to the API, and you are then returned a list of passwords with the same first characters, then you check that response against the actual password.

In effect, Troy's service never knows the password. It doesn't even know if it existed in the list you received.

jgrahamc · on Feb 27, 2018

Troy's API doesn't work like that. Doesn't need to send the password at all.

stavros · on Feb 27, 2018

What do you mean? It still sends a portion of the hash, not the whole password, but that's still some information, no? An attacker brute-forcing hashes doesn't care if he finds the exact hash or a set of a thousand he can try. In fact, since the database is public, if I have your range and the password is in it, I only have to try a few hundred passwords at most to find which one yours is. Granted, I also need to know which service you're using this on, but that doesn't seem that hard to find out.

jgrahamc · on Feb 27, 2018

I think the scenario you are describing is the following: the attacker somehow gets the hash submitted to the API and then runs JtR or similar to crack the password.

That doesn't work because only the first five characters of the hash are sent to the API. The API then returns a list of all the hashes that have that prefix that it knows about. The client then (privately) is able to see if their password hash is in that list.

stavros · on Feb 27, 2018

No, it's more like "attacker gets the range, sends it to the API, gets all the possible hashes from the API, tries the corresponding plaintext passwords, thus needs only ~200 tries instead of X million".

Granted, your password was compromised already if it was in that list, so it needs to be changed anyway, but there's some information revealed by the hash (of course, how else could it be checked?).

dingo_bat · on Feb 27, 2018

But the whole point of the api is that you don't use the password if it is found in the list. So the attacker will know about the 200 strings that are definitely not the password.

stavros · on Feb 27, 2018

Assuming you managed to instantly change the password on every single service you've used it, yes.

euroclydon · on Feb 27, 2018

Nobody is talking about it, but to compromise someone, you also need the username, which is not sent.

stavros · on Feb 27, 2018

Ah, that's true. In any case, I don't think this is really a reasonable vector of attack, I was just being pedantic at jgc's "the password is not sent" comment.

jgrahamc · on Feb 27, 2018

Got it. That makes sense.

zxcmx · on Feb 27, 2018

Wonderful, thank you [edit] thinking this through.... but regardless of how the api works, is it not possible that you (cloudflare) could just have the list and check yourselves since you know the submitted password?

It could be a value added service for all your customers.

[Sorry for the late edit, not being evil here].

jgrahamc · on Feb 27, 2018

If you read the details of Troy's API you'll see that at no time does the password or even a hash of the password leave the computer of the person using the API.

zxcmx · on Feb 27, 2018

Yep, I was sort of on the wrong tack. The api is fine, it just seems like an unnecessary dance for cloudflare integrated services since you have the password anyway.

detaro · on Feb 27, 2018

The entire dataset is available for download, so everyone can do the check themselves if they want to.

mbid · on Feb 27, 2018

You send a truncated digest of the password, which was obtained using a somewhat dubious hash function. I guess there are currently no known preimage attacks on this scheme, but revealing this information is more than nothing.

ianhawes · on Feb 27, 2018

This is slightly OT and probably not a popular opinion, but does anyone else feel that Troy having this massive dataset of emails is unethical?

I definitely believe it is illegal and was surprised that during his recent visit to the US that the FBI did not arrest him.

mi100hael · on Feb 27, 2018

Simply possessing the information (which is already freely available online) is not on its own unethical. It depends entirely upon what he does with the information, which in this case is protect the owners from malicious individuals who also have access to the information because it's freely available online.

2close4comfort · on Feb 27, 2018

It is if its stolen property...which most of that data belonged to the company that was hacked.

emodendroket · on Feb 27, 2018

I'm not going to pretend to understand the legal niceties here, but the analogy to "stolen property" rings false because 1) he's not depriving anybody of their passwords 2) the very fact they've been leaked means they aren't really of value anymore.

c2h5oh · on Feb 27, 2018

Strongly no.

In an better world existence of such dataset would be unnecessary.

In the less ideal one you have to look at it's existence in a broader context.

- it provides a clearly beneficial service to owners of those email accounts bridging the gap in the legal systems (disclosure requirements)

- it's exposed in a way that keeps the risk of 3rd party exploitation low: doesn't leak information unless you have access to email account

- it has not to my knowledge been abused in any way and public is keeping check on that - there would have been an outcry that would not go unnoticed

- it's less of a high value target than you'd think - it's built from data that is already in the wild and could be pieced together by a sufficiently motivated actor, especially in the light of the recent combined lists with hundreds of millions of emails surfacing..

Quite frankly if you look at just that it scores better than most datasets gathered by websites requiring user registration.

The only thing missing here really is the consent to be included in that list, but given the sources of the data and the points above I find lack of it more than sufficiently justified.

Griffinsauce · on Feb 27, 2018

You jumped to illegal before even explaining why it's unethical.

tomalpha · on Feb 27, 2018

In the UK (where Troy is not based), I guess a prosecutor might try and build a case around either:

- Section 58 of the Terrorism Act[0] which makes it illegal to possess information "likely to be useful to a person committing or preparing an act of terrorism", though the law provides "reasonable excuse" as an explicit defence.

- Section 3A of the Computer Misuse Act[1] "making, supplying or obtaining articles for use in unauthorised access to computer material" [my paraphrasing]

It feels like Troy has a pretty strong defence of "reasonable excuse" to my lay understanding though.

[0] https://www.legislation.gov.uk/ukpga/2000/11/section/58 [1] https://www.cps.gov.uk/legal-guidance/cybercrime-prosecution...

tyingq · on Feb 27, 2018

Curious which specific US law(s) it violates. Sounds like he downloaded a few torrents and built a search interface.

heartbreak · on Feb 27, 2018

During his recent visit to the US he testified before Congress on issues regarding data breaches. That would have been one hell of a bait and switch.

joshuata · on Feb 27, 2018

He has written about that very problem here: https://www.troyhunt.com/the-ethics-of-running-a-data-breach...

2close4comfort · on Feb 27, 2018

Charging people to access my stolen info seems like it is less of a service and more about Troy than anything else.

joshuata · on Feb 27, 2018

From what I understand, the only people that are charged access to the service are large organizations that want callbacks when new data that matches their users is entered. This seems like a very reasonable balance between making money to pay for hosting related fees, and providing a free and publicly available service.

rimliu · on Feb 27, 2018

For a bit more serious case see https://en.wikipedia.org/wiki/New_York_Times_Co._v._United_S...

snug · on Feb 27, 2018

These passwords were already publicly available, some with the email address associated with them.