Hacker News new | past | comments | ask | show | jobs | submit login
Let's Encrypt OCSP and Issuance Outage Postmortem (letsencrypt.org)
223 points by agrajag on May 25, 2017 | hide | past | favorite | 34 comments



And this is why you always... aww hell, who am I kidding? That was one complex series of events, and I would have done no better.

Grats to the Let's Encrypt team for figuring it out, and thanks for writing up the post mortem. It's always interesting to read things like this, and it just goes to show that sometimes all the monitoring you thought you had covered wasn't quite enough.


I'm not sure (at least on the rollout side which was what caused the largest impact)...

My team does a lot of releases and rollouts. We use automation in the majority of cases and automatically pause if alert conditions occur during the rollout.

We usually release a dev version (which we can run tests (incl. typical traffic that we expect the service to handle)), canary version and production version. Ideally you want to catch this sort of stuff during the tests that you run against your dev instances. Of course, you have to have the right testing infrastructure, monitoring and alerting in place to avail of that.

For some of our services, we use graduated rollouts so we can gradually direct traffic from the old version of the stack to the new one over a period of time. This allows us (where us = our monitoring) to quickly identify if there are issues (e.g. can compare error rates of version x with version x+1) and reset the percentage of traffic going to version x+1 to 0% if there is an issue.

From the postmortem, they indicated the need to add additional monitoring to identify an increase in error rate + traffic rate. I'm pretty sure the Production Readiness Reviews called out in the SRE book by Google (see here: https://landing.google.com/sre/book/chapters/evolving-sre-en...) would have caught this. The book also has some good insights on how to do better monitoring (hint: don't have humans eyeballing dashboards).

Saying all that, I've seen plenty of small bugs make it to that stage of deployment with that level of impact so it's certainly not trivial to get 100% test / monitoring coverage... Reviewing the causes of outages and specifically identifying and resolving action items from them go a long way to ensuring you improve over time.


So they were sending b64 strings in URLs? Of course this would fail :|


Why's that?


Because of the way slashes (a b64 char) gets parsed in URLs.


It's easy to do better if you ever listened to Joe Armstrong or read his dissertation. The idea of centralized monitoring is just broken and can never be truly reliable.


How does Joe Armstrong's dissertation help you monitor the correct metrics? Edit: or what about his dissertation would have made monitoring unnecessary?


It's easy to do better if you've ever armchair quarterbacked or just read about doing it. The idea of actually outlining a coherent argument is just broken and can never be truly reliable.


Can you say who is Joe Armstrong and what is the dissertation? I would be interested in reading it.


Joe Armstrong's dissertation became Erlang


Oh wow I didn't know any of this. This will be some good weekend reading. Thanks.

Here is the link in case anyone else is interested:

http://erlang.org/download/armstrong_thesis_2003.pdf


Their original video demo of hot code swapping is also a classic worth checking out:

https://www.youtube.com/watch?v=uKfKtXYLG78

That got a lot of programmers interested in Erlang ~10yrs ago.


Anybody know what LaTeX book template this is?


I got the order backwards: Erlang (1981) became Joe Armstrong's dissertation (2003).


I thought this was a great writeup, but one significant issue I see unanswered is why did all OCSP responses fall out of cache at the same time? In a cache with an even distribution of expiration times they should have seen a gradual increase in traffic as responses steadily fell out of cache. Adding jitter when setting cache duration for the CDNshould help even out the rate at which responses fall out of cache.

In addition, monitoring cache rate and measuring request rate at the CDN should have been big indicators that it wasn't a DDoS.

Lastly, is this kind of upstream throttling with no customer communication common? That seems like a big failing on the ISPs side.


I work at a hosting company, and I can attest to throttling at the ISP side being somewhat common. My own company will throttle or null route customers almost immediately if enough traffic comes through that it starts to affect other customers behind the same switches.

We try to notify customers as quickly as possible (emails go out within minutes) but there are a lot of cases where the emails end up in unmonitored inboxes and customers don't realize they're down until their clients complain to them about it.

In any case, it sounds like there may be a bit of a logic bug clientside if a service downtime causes all of their clients to generate so much traffic that it looks like DDoS. The clients should be throttling themselves to prevent overloading a downed upstream service, why didn't that happen here? That's worth investigating, though much more difficult to fix at this point with that many copies of the client in the wild. EDIT: I just thought to actually look up OCSP, I assumed it was the mechanism something like certbot was using to renew certificated. This is built into browsers? Yeah, nevermind on this whole paragraph then.


> This is built into browsers?

Yup, unfortunately, browsers often have to do revocation (and SCT) checks because traditional servers like Apache and nginx don't staple OCSP responses by default (and even if they are configured to do so, the implementations in these servers are not robust against outages).

Firefox checks OCSP for DV certs, but that will be disabled in a near-future release.


Your reference to SCT here seems weird.

[Anybody following at home: SCTs are signed timestamps proving a particular Certificate Transparency log server logged this certificate's details at a moment in time. These log servers are public proof of what certificates should exist. The cryptography behind the log servers makes it impossible for them to lie about what they know beyond a certain horizon, policy today sets that horizon at 24 hours]

Today the only thing any browser (Chrome) does with SCT by default is to verify whether valid (properly signed) SCTs are provided for certain certificates. This doesn't result in any additional connections, although if OCSP is needed already the SCTs might optionally be included with OCSP.

Eventually browsers will be able to automatically report information to detect any discrepancies (e.g. where a log is telling different things to different people), but today that's something which exists only in prototype, not as a default feature in production web browsers ordinary people use every day.


Even better: OCSP failures cannot result in failing closed because of middleboxen. So all this traffic wasn't needed anyway.


Sorry I guess I only know how to implement/use an SSL certificate not necessarily how the generation/providing part works. What did this outage mean? If you already had a certificate generated, were you not affected by this? Or does a certificate enabled on a website need to be checked/validated.

I did look up what OCSP is after looking at the article:

>an Internet protocol used for obtaining the revocation status of an X.509 digital certificate.

So does this mean they weren't able to verify that the SSL was still good and so you'd get a warning in the browser saying "This site is not secure" or something?


(this is purely about the OCSP server outage, others have commented on the issuance server outage)

In most cases, clients ignore any failure to contact the OCSP servers. This means that:

1) OCSP servers aren't an additional point of failure for your website

2) A man-in-the-middle attack using a stolen and revoked certificate can prevent your browser from knowing it's stolen by blocking the connection to the OCSP server

Possibly due to #2, or other reasons, I can only speculate, some clients treat failure to contact OCSP servers more seriously and abort the connection. During the outage, those clients were unable to talk to servers that:

1) were using LetsEncrypt

2) enabled OCSP

3) did not have a valid stapled OCSP response (for example because OCSP stapling was not configured, or their server lost the response and couldn't get a new one during the downtime)

The size of the intersection of affected clients and sites is very small, but during that window they were completely unable to talk to each other. So in broad terms the impact was very small, but for those affected it was quite large.


As you observe, OCSP today is not widely respected (for most sites Chrome doesn't even check OCSP at all for example) which is bad news if anybody's certificate gets stolen or misused.

OCSP Stapling is (part of) the eventual solution. Web server software will go get the OCSP answers for its own certificates, and "staple" those to the certificates when it serves them. So now client software doesn't have to wrestle with unreliable networks and make extra connections, the OCSP response is right there with the certificate during connection to the site.

However, Quality of Implementation for OCSP Stapling in some of the most popular HTTP servers is poor. Let's take the example of Apache httpd, possibly still the most popular server in the world.

By default Apache doesn't do OCSP stapling at all. So you need to configure that, doing so isn't even a one line "Yup, staple please" either, instead it appear the person writing this code for Apache went through the specification and any time they weren't sure what to do they said "Eh, I'll leave that to the sysadmin" and added a configuration option, with more or less random defaults.

As a result by default Apache will forget a perfectly good OCSP answer it knows in favour of trying to get a new one. If that fails (as it did here due to Let's Encrypt's problems) Apache doesn't say "Oh well, I have a good one already, I can use that". It makes a fake "error" OCSP response and serves that up. Why? Nobody we've ever been able to find knows what that could be useful for, but the Apache developer decided it would be a good default. "Yup, if anything goes wrong, just irreversibly break the entire server, that way they'll be sure to notice".

It will also happily staple worthless outdated answers, or answers saying e.g. "Temporary failure, try later", which likewise will just cause visitors to your site to get turned away, rather than continuing to use a known-good answer it has.

And nobody at Apache seems the least bit interested in fixing this.


As an Apache user, I’ve yet to look into enabling OCSP stapling so thanks for this informative post. I presume the developer you are referring to is (one of) the developers mod_ssl. I found the bug report[1] where the Apache developers state that they won’t enable stapling by default because “it would enable a "phoning home" feature (to the CA's OCSP responders) as a side effect of configuring a certificate”. That seems reasonable to me. However, the other behaviour that you’ve mentioned seems less so. Do you have any references (mailing list discussions, links to bug reports, etc.) for this?

By the way, your opening line should probably be edited to say something like which is bad news if anybody's private key gets stolen or misused and they need to revoke the corresponding key(s). Most readers of this discussion will know what you mean but some who are still learning about PKI may be confused.

[1] https://bz.apache.org/bugzilla/show_bug.cgi?id=50740#c20


Wikipedia has a good description on what OCSP Stapling is[1] and how it works. When I read the Apache projects' WONTFIX reason, I presumed that it was related to how plain OCSP requires the client to "phone home" in order to check whether a certificate has been revoked or not – which has implications for the privacy of the browser.

However, now that I know OCSP Stapling works (the web server caches and proxies time-stamped OCSP responses that are signed by the CA), the Apache position is much less reasonable. As a Let’s Encrypt user, I “phone home” every couple of months to renew my X.509 key and certificate. That’s not a privacy concern for me or anyone else who happens to browse my site.

I also found a good article by Hanno Böck[2] which provides more details on how OCSP Stapling is thoroughly broken on Apache as described by tialaramex).

[1] https://en.wikipedia.org/wiki/OCSP_stapling

[2] https://blog.hboeck.de/archives/886-The-Problem-with-OCSP-St...


Man I do not follow much... part of it. I use Apache myself. I followed Digital Ocean's guide on how to setup Let's Encrypt and I was concerned about the mention of "backports" I didn't really get that. I've setup SSL certificates before regarding the Virtual hosts/ got some help with proper cipher suites and what not (have an A+ rating on qualyss) but yeah... after getting over the .pem vs. .cert it was no problem... though sometimes the process would say "no domain found" or "more than one virtual host" when there was only one. At any rate it worked for me so I'm glad I finally took the plunge.


This outage mainly effected the certificate generation and renewal. People who wanted to renew an existing/expired SSL cert or wanted to generate a new SSL cert weren't able to do it during this window. However, I believe the impact was small enough.

SSL certs don't need any external service for verification. The clients (in this case your browser) would have a set of root certificates issues by many CAs which serve as way for verifying.

The impact wasn't huge as most people who do renewal had a one time failure in the cron jobs which renew the certs. And since the renewals are tried 1 month before cert expiry, this was a non issue.


Oh that's a good point, I thought a week of notice in advance but a month sounds safer. It's still 90 days right?

Thanks


Yeah. The common advise is to have an automated setup for renewals so you don't have to worry about it. It is also very trivial. Just add `@weekly letsencrypt renew` to your cron job and you are done.


I know that it should be trivial. But on my local/vps servers the cron jobs would not execute. I wasn't sure why... on my raspberry pi however they executed as expected.

edit: it's possible it's a root problem, as Raspberry Pi I don't believe can be root. I know I'm bad using root user.


I have been thinking of a yellow single click SSL warning for thing like OCSP server cannot be contacted. Not enabled by default for now, of course.


Is it possible to have a test stage that replays/simulates the entire previous 24 hours of production to compare previous output with new output. This might let you know if a fix to a known bug actually makes things worse.

Obviously I'm not inferring this is a step mozilla should already have in place.


That's a huge step. Lots of data to save for the input, lots of diff output, and you have to look for all details of the output, since many of them are desirable. Then you would automate the output processing, and a wrong rule would miss the new bug (because it was unexpected, so the automation would account for it).


I used Let's Encrypt for the first time today, awesome! Saved $9.00 haha


LetsEncrypt is just awesome & a superb mission to make every website easily secured!! SSLs were otherwise a big scam! (Yes i know I love them!)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: