Hacker News new | past | comments | ask | show | jobs | submit login
Root cause analysis: significantly elevated error rates on 2019‑07‑10 (stripe.com)
203 points by gr2020 on July 12, 2019 | hide | past | favorite | 108 comments



In the face of so many outages from big companies, I wonder how Visa/MasterCard is so resilient.

Is it because they are over the curve and don't make "any" changes to their system. As opposed to other companies, we are still maturing?


Mainframes.

> Visa, for example, uses the mainframe to process billions of credit and debit card payments every year.

> According to some estimates, up to $3 trillion in daily commerce flows through mainframes.

https://www.share.org/blog/mainframe-matters-how-mainframes-...

https://blog.syncsort.com/2018/06/mainframe/9-mainframe-stat...

https://www.ibm.com/it-infrastructure/servers/mainframes


Specifically they run IBM zTPF on their mainframes, which is also used by airlines. Some installations have uptimes measured in decades.

https://www.ibm.com/it-infrastructure/z/transaction-processi...


It's rarely the hardware that fails, it's more often due to software. So I wonder what the software that's running on mainframes does differently than the software that's written for ordinary computers.


> So I wonder what the software that's running on mainframes does differently than the software that's written for ordinary computers.

Not change.


May it's not what it does but that it is written in COBOL?


Both have had plenty of downtime:

https://www.ft.com/content/1fd2a066-860f-11e8-a29d-73e3d4545...

I suspect they sometimes 'fail open' (ie. allow all payments through and reconcile later) too.


No they don't. If I sell a diamond ring for $20k and Visa passes that the card is valid but it's not. The buyer just got a free $20k ring. Card could be expired, cancelled, or not have enough balance. The merchant must be paid, their processor has to pay them, the bank that issued the card must provide that credit until the card holder pays it back. If the card was expired or a card with a $10 balance. The card holder will refuse, it get's really mess fast. Visa is not willing to assume such risk, they simply provide a network. If it goes down, it goes down and everyone on their network is screwed.

When dispute is at play, it's a hot potato that no one wants to hold between the merchant, processor, ISO, sales agent & bank. The card networks have been smart to eliminate themselves from that step.


> No they don’t.

On the contrary, I developed early merchant and payment gateway tech, and they absolutely do. The scenario you describe is extraordinarily rare, which allows an arbitrage between CAP perfection and customer satisfaction.

On a separate note, at any given time, some parts of our national payments ecosystem are “down”. There are enough players involved you have an appearance of resilience.

You can see this in a mall, when one store’s card swipe terminals are down, and another’s are not, and almost never happens that all the stores are down at the same time.

You can think of all these other players as an incidental circuit breaker pattern upstream of Visa.

VisaNet itself is surprisingly unscaled, capable of only about 24,000 transactions per second. Twenty years ago, our gateway would hit 15,000 transactions per second real world use. To do that, we scattered/gathered across many independent paths into card networks and various merchant banks.

https://usa.visa.com/content/dam/VCOM/download/corporate/med...

https://www.capgemini.com/wp-content/uploads/2017/07/Domesti...


Actually, merchants, acquirers, and issuers can do this. It happens sometimes. When it happens, other limits come into play downstream, such as terminal configuration. There are separate offline limits, and it is unlikely they would set it that high, so a $20,000 offline charge would be declined, even if a lesser charge would be approved, stored, and processed later. As for expired cards, the expiration date is on the mag stripe and in the chip, so the transaction could be rejected at any point, even offline at the terminal. It's also printed on the card so it might be rejected before it's even swiped or dipped.


It's even done regularly in some cases, various US airlines take "offline" card transactions and process them later for food & drinks.

There are of course limits on how big an offline transaction you can take intentionally or unintentionally and probably the airline wears the full cost of failed transactions in this case.

Doesn't matter that much when its for a $5 coffee, plus they know who you were if they really wanted to chase it down.

And as mentioned electronic terminals absolutely have automatic offline modes also.


They absolutely do, it is called "stand-in processing". I saw this while working in ATM at a major bank. The terminal operator (e.g. in our case, the ATM authorization system) can stand-in for the payment network when required. There are per-card number transaction limits that are well-defined in their contracts, and fraud liability can shift during this period of time. The payment network can also stand in for the issuer. In either case, once the network is restored all the authorization advices are forwarded.


Credit cards are very asynchronous, going back to the days when carbon copy paper was used and no in time verification might have been involved at all.

Shop owners would even get a reward for snipping a bad credit user’s card in half (something that survives to this day only as a meme).


That’s a great point. In spite of technical changes such as Apple Pay/Android Pay, chip cards, and so on, I can never recall an instance when I was unable to use a credit card globally. It seems most failures to running a credit card are pretty localized, too, and never at the interchange level...


I suspect there's a lot of caching involved as well. When making a purchase you probably don't need all the info to go all the way to the bank and back.

Stolen/lost cards can simply be flagged in a master db/table and can be rejected quickly for example.


They're also much simpler and the system behind payment solution didn't changed that much in the last 10 years.


They are also miles behind on features customers want...

For example:

* My credit card statement should have links to the merchant, the address, a list of the things I bought, a link to the returns process, etc.

* Why can't my statement also have the total number of calories I've purchased in the last month, or grams of carbon in fuel I've put in the truck?

* Why can't I use my mastercard to pay another mastercard user directly?

* Why hasn't mastercard produced a '2 factor' for card payments rather than forcing every bank to implement their own?

* Why can't I buy a dual Mastercard/Visa/Other card, which works with merchants who are picky and will only accept one or the other?

* Why are we still issuing bits of plastic in the digital age anyway?

* Why don't the cards have a microusb plug on one edge, or NFC to plug into a phone or computer to log in, to act as an identity card, to authenticate or make payments, or anything else other companies issue smartcards for?

* Why don't mastercard work with mobile providers to issue cards that you can spend your pay-as-you-go balance with, turning a mobile provider into a bank.

It seems mastercards business is 'stuck', and there are opportunities to innovate all around them, but they won't.


Half of what you "want" is a quasi-dystopian nightmare.


Don’t be such a pessimist. There’s nothing “quasi” about it.


If this is what he wants for himself, it's not dystopian, it's personal info.


Why would you want Target telling Mastercard that you bought Spongebob underwear and 1,968 calories worth of freeze pops?


Why do you believe this isn't already the case?


>Why are we still issuing bits of plastic in the digital age anyway?

Phones die.

If you don’t care, I suggest you look into Apple Pay or something similar. You’ll find many merchants that you won’t be able to pay.


In the west, but in China you pay everything with wechat or alipay and similar solutions are popping up in Asia in every country with success. But in China it is accepted everywhere. And when it is not it still is because usually the shop cashiers will use their own phone to complete the transaction.


Maybe in the US; here in the UK, contactless payment is now close to universal for any vendor who accepts cards. This suggests it’s eminently possible.


They are not, they go down quite often. lol.


[2019-07-10 20:13 UTC] During our investigation into the root cause of the first event, we identified a code path likely causing the bug in a new minor version of the database’s election protocol. [2019-07-10 20:42 UTC] We rolled back to a previous minor version of the election protocol and monitored the rollout.

There's a 20 minute gap between investigation and "rollback". Why did they rollback if the service was back to normal? How can they decide, and document the change within 20 minutes? Are they using CMs to document changes in production? Were there enough engineers involved in the decision? Clearly all variables were not considered.

To me, this demonstrates poor Operational Excellence values. Your first goal is to mitigate the problem. Then, you need to analyze, understand, and document the root cause. Rolling-back was a poor decision, imo.


(Stripe CTO here)

Thanks for the questions. We have testing procedures and deploy mechanisms that enable us to ship hundreds of deploys a week safely, including many which touch our infrastructure. For example, we do a fleetwide version rollout in stages with a blue/green deploy for typical changes.

In this case, we identified a specific code path that we believed had a high potential to cause a follow-up incident soon. The course of action was reviewed by several engineers; however we lacked an efficient way to fully validate this change on the order of minutes. We're investing in building tooling to increase robustness in rapid response mechanisms and to help responding engineers understand the potential impact of configuration changes or other remediation efforts they're pushing through an accelerated process.

I think our engineers’ approach was strong here, but our processes could have been better. Our continuing remediation efforts are focused there.


Thank you for taking the time to respond to my questions. I believe the high potential of causing a follow-up incident was left out of the post (or maybe I missed it?).

I hope that lessons are learned from this operational event, and invest towards building metrics and tooling that allows you to, first of all, prevent issues, and second, shorten the outage/mitigation times in the future.

I'm happy you guys are being open about the issue, and taking feedback from people outside your company. I definitely applaud this.


> ship hundreds of deploys a week safely

That seems like a lot of change in a week, or does deploys mean something else like customer websites being deployed?


They very likely have continuous deployment. So each change could potentially be released as a separate deploy. If the changes have changed to the data model, they gotta run a migration. So hundreds seems reasonable to me.


From the outside it sounds like, whatever the database is, it has far too many critical services tightly bound within it. E.g. leader election implemented internally instead of as a service with separate lifecycle management - pushing the database query processor minor version forward forcing me to move the leader election code or replica config handling forwards... ick.

From the description/comment it also sounds like the database operates directly on files rather than file leases as there's no notion of a separate local - cluster-scoped - byte-level replication layer below it. Harder to shoot a stateful node.. And sounds like it's tricky to externally cross-check various rates, i.e. monitor replication RPCs and notice that certain nodes are stepping away from the expected numbers without depending on the health of the nodes themselves.

Hopefully the database doesn't also mix geo-replication for local access requirements / sovereignty in among the same mechanisms too.. rather than separating out into some aggregation layers above purely cluster-scoped zones!

Of course, this is all far far easier said than done given the available open source building blocks. Fun problems while scaling like crazy :)


In my experience customers deeply detest the idea of waiting around for a failure case to re-occur so that you can understand it better. When your customers are losing millions of dollars in the minutes you're down, mitigation would be the thing, and analysis can wait. All that is needed is enough forensic data so that testing in earnest to reproduce the condition in the lab can begin. Then get the customers back to working order pronto. 20 minutes seems like a lifetime if in fact they were concerned that the degradation could happen again at any time. 20 minutes seems like just enough time to follow a checklist of actions on capturing environmental conditions, gather a huddle to make a decision, document the change, and execute on it. Commendable actually, if that's what happened.


> In my experience customers deeply detest the idea of waiting around for a failure case to re-occur so that you can understand it better.

Bryan Cantrill has a great talk[0] about dealing with fires where he says something to the effect of:

> Now you will find out if you are more operations or development - developers will want to leave things be to gather data and understand, while operations will want to rollback and fix things as quickly as possible

[0] Debugging Under Fire: Keep your Head when Systems have Lost their Mind - Bryan Cantrill: https://www.youtube.com/watch?v=30jNsCVLpAE


I understand it. I've worked in AWS, and now in OCI, dealing with systems that affect hundreds-to-thousands of customers, which businesses are at stake.

Mitigation is your top-priority. Bringing the system back to a good shape.

If there needs to be follow-up actions, take the less-impactful steps to prevent another wave.

If there was a deployment, roll-back.

My concern here is, a deployment have been made months ago, and many other changes that could make things worse were introduced. This is the case. The difference between taking an extra 10-20 minutes to make sure everything is fine, versus taking a hot call and causing another outage makes a big difference.

I'm just asking questions based on the documentation provided; I do not have more insights.

I am happy Stripe is being open about the issue, that way many the industry learns and matures regarding software-caused outages. Cloudflare's outage documentation is really good as well.


> My concern here is, a deployment have been made months ago, and many other changes that could make things worse were introduced.

Make every bit of software in your stack export as a monitoring metric it's build date. Have an alert if any bit of software goes over 1 month old. Manually or automatically re-build and redeploy that software.

That prevents 'bit rot' meaning you daren't reduild or rollback something.


In a lot of environments this is a terrible idea. In private environments exposing build manifest information is a good idea, but not so that you can alert at 1 month. Where I work, software that's 2-3 years old is considered good - mature, tested, thoroughly operationalized, and understood by all who need to interact with it on a daily basis. Often, consistency of the user experience is better than being bug free.


I think this is a good point. Don't rollback if you don't know why your new code is giving you problems. You may fix things with the rollback, or you may put yourself in a worse situation where the forward/backwards compatibility has a bug in it. The issue may even be coincidental to the new code.

However, it's hard to say whether this is a poor decision unless we know that they didn't analyze the path and determine that it would most likely be fine. If they did do that, then it's just a mistake and those happen. 20 minutes is enough time to make that call for the team that built it.


A rollback without understanding is definitely risky. An uninformed rollback is one of the factors that killed Knight Capital Group in 2012. For those not familiar, the actual problem was they failed to update one of a cluster of eight servers, and the server on the old version was making bad trades. They attempted to mitigate with a rollback, which made all eight servers start to make bad trades. In the end they lost $460 million over the course of about 45 minutes.

The full report is here if you're curious: https://www.sec.gov/litigation/admin/2013/34-70694.pdf


Knight Capital also didn't know what version of software their servers were running, didn't know which servers were originating the bad requests, had abandoned code still in the codebase, and reused flags that controlled that abandoned code (another summary: https://sweetness.hmmz.org/2013-10-22-how-to-lose-172222-a-s...). I'm not sure what you can infer about the risk of a rollback in a less crazy environment.


If rollbacks are not safe then you have a change management problem.

If you have a good CM system, you should have a timeline of changes that you can correlate against incidents. Most incidents are caused by changes, so you can narrow down most incidents to a handful of changes.

Then the question is, if you have a handful of changes that you could roll back, and rollbacks are risk free, then does it make sense to delay rolling back any particular change until the root cause is understood?


It's not always as simple as that. What if the problem was that something in a change didn't behave as specified and wound up writing important data in an incorrect but retrievable format? Rolling back might not recognise that data properly and could end up either modifying it further so the true data could no longer be retrieved or causing data loss elsewhere as a consequence.


In that case you would probably still roll back to prevent further data corruption and restore the corrupted records from backups.

There are certainly changes that cannot be rolled back such that the affected users are magically fixed, which is not what I am suggesting. In the context of mission critical systems, mitigation is usually strongly preferred. For example, the Google SRE book says the following:

> Your first response in a major outage may be to start troubleshooting and try to find a root cause as quickly as possible. Ignore that instinct!

> Instead, your course of action should be to make the system work as well as it can under the circumstances. This may entail emergency options, such as diverting traffic from a broken cluster to others that are still working, dropping traffic wholesale to prevent a cascading failure, or disabling subsystems to lighten the load. Stopping the bleeding should be your first priority; you aren’t helping your users if the system dies while you’re root-causing. [...] The highest priority is to resolve the issue at hand quickly.”

I have seen too many incidents (one in the last 2 days in fact) that were prolonged because people dismissed blindly rolling back changes, merely because they thought the changes were not the root cause.


In that case you would probably still roll back to prevent further data corruption and restore the corrupted records from backups.

OK, but then what if it's new data being stored in real time, so there isn't any previous backup with the data in the intended form? In this case, we're talking about Stripe, which is presumably processing a high volume of financial transactions even in just a few minutes. Obviously there is no good option if your choice is between preventing some or all of your new transactions or losing data about some of your previous transactions, but it doesn't seem unreasonable to do at least some cursory checking about whether you're about to cause the latter effect before you roll back.


I think you guys are considering this from the wrong angle...

Rollbacks should always be safe. They should always be automatically tested. So a software release should do a gradual rollout (ie. 1, 10, 100, 1000 servers), but it should also restart a few servers with the old software version just to check a rollback still works.

The rollout should fail if health checks (including checking business metrics like conversion rates) on the new release or old release fails.

If only the new release fails, a rollback should be initiated automatically.

If only the old release fails, the system is in a fragile but still working state for a human to decide what to do.


This is one of those ideas that looks simple enough until you actually have to do it, and then you realise all the problems with it.

For example, in order to avoid any possibility of data loss at all using such a system, you need to continue running all of your transactions through the previous version of your system as well as the new version until you're happy that the performance of the new version is satisfactory. In the event of any divergence you probably need to keep the output of the previous version but also report the anomaly to whoever should investigate it.

But then if you're monitoring your production system, how do you make that decision about the performance of the new version being acceptable? If you're looking at metrics like conversion rates, you're going to need a certain amount of time to get a statistically significant result if anything has broken. Depending on your system and what constitutes a conversion, that might take seconds or it might take days. And you can only make a single change, which can therefore be rolled back to exactly the previous version without any confounding factors, during that whole time.

And even if you provide a doubled-up set of resources to run new versions in parallel and you insist on only rolling out a single change to your entire system during a period of time that might last for days in case extended use demonstrates a problem that should trigger an automatic rollback, you're still only protecting yourself against problems that would show up in whatever metric(s) you chose to monitor. The real horror stories are very often the result of failure modes that no-one anticipated or tried to guard against.


I think the 80 / 20 rule applies here.


My point was that it's all but impossible for any rollback to be entirely risk-free in this sort of situation. If everything was understood well enough and if everything was working to spec well enough for that to happen, you wouldn't be in a situation where you had to decide whether to make a quick rollback in the first place.

I'm not saying that the decision won't be to do the rollback much of the time. I'm just saying it's unlikely to be entirely without risk and so there is a decision to be considered. Rolling back on autopilot is probably a bad idea no matter how good a change management process you might use, unless perhaps we're talking about some sort of automatic mechanism that could do so almost immediately, before there was enough time for significant amounts of data to be accumulated and then potentially lost by the rollback.


Because people make mistakes. Mistakes get fixed in post mortems, retros, best practices, etc. But mistakes will still happen.


The odds of you understanding all of the constraints and moving variables in play, and doing situation analysis better than the seasoned ops team at a multibillion dollar company are pretty low. Maybe hold off on the armchair quarterbacking.


I dunno. Based on what's on show here I'd rather buy their product than yours if you were competing.


"[Four days prior to the incident] Two nodes became stalled for yet-to-be-determined reasons."

How did they not catch this? It's super surprising to me that they wouldn't have monitors for this.


(Stripe infra lead here)

This was a focus in our after-action review. The nodes responded as healthy to active checks, while silently dropping updates on their replication lag, together this created the impression of a healthy node. The missing bit was verifying the absence of lag updates. (Which we have now.)


You might want to clarify this in the post. To me it reads like you knowingly had degraded infra for days leading up to an incident which might have been preventable had you recovered this instances.


Thanks for the suggestion, we’re adding a clarifying note to the report’s timeline.


I am a curious and very amateur person, but do you think that if "100%" uptime were your goal, this:

"[Three months prior to the incident] We upgraded our databases to a new minor version that introduced a subtle, undetected fault in the database’s failover system."

could have been prevented if you had stopped upgrading minor versions, i.e. froze on one specific version and not even applied security fixes, instead relying on containing it as a "known" vulnerable database?

The reason I ask is that I heard of ATM's still running windows XP or stuff like that. but if it's not networked could it be that that actually has a bigger uptime than anything you can do on windows 7 or 10?

what I mean is even though it is hilariously out of date to be using windows xp, still, by any measure it's had a billion device-days to expose its failure modes.

when you upgrade to the latest minor version of databases, don't you sacrifice the known bad for an unknown good?

excuse my ignorance on this subject.


> could have been prevented if you had stopped upgrading minor versions, i.e. froze on one specific version and not even applied security fixes, instead relying on containing it as a "known" vulnerable database?

This is a valid question.

As a database and security expert, I carefully weigh database changes. However, developers and security zealots typically charge ahead "because compliance."

Email me if you need help with that.


You could use that same logic to argue that they should never write any new code, just live forever on the existing code.

But customers want new features, so Stripe does changes.


How do you have a ATM thats not networked?


Same user (sorry I guess I didn't enter my password carefully as I can't log in.)

Well I mean they're not exactly on the Internet with an IP address and no firewall, are they? (Or they would have been compromised already.)

Whatever it is, it must be separated off as an "insecure enclave".

So that's why I'm wondering about this technique. You don't just miss out on security updates, you miss performance and architecture improvements, too, if you stop upgrading.

But can that be the path toward 100% uptime? Known bad and out of date configurations, carefully maintained in a brittle known state?


Secure .. enclave? I'm sorry but I think you're throwing buzzwords around hoping to hit a homerun here.


No, it's a fair question. The word "enclave" has a general meaning in English as a state surrounded entirely by another, or metaphorically a zone with some degree of isolation from its surroundings.

So the legit question is, can insecure systems (e.g. ancient mainframes) be wrapped by a security layer (WAF, etc.) to get better uptime than patching an exposed system?


yes, thank you.


If you can think of every possible failure and create monitoring and reporting for it before it happens, then you're the best dev on the planet.


And also have the greatest bosses on the history of earth giving you unlimited time to do this.


And then filtering for a lot of crap and false alarms the tools and supporting infrastructure throws

I kinda lost count of how many times Nagios barfed itself and reported an error while the application was fine


In this environment:

Stripe splits data by kind into different database clusters and by quantity into different shards. Each cluster has many shards, and each shard has multiple redundant nodes.

having a few nodes down is perfectly acceptable. I guess they would have had an alert if the number of down nodes exceeded some threshold.


this case that doesn't sound like it was the issue, it was the lack of promotion of new master due to the bug in the system in terms of shard promotion.


In many HA setups, you're supposed to not have to care if any single thing goes down because it should auto recover

The article said that the node stalled in a way that was unforseen which may have caused standard recovery mechanisms to silently fail.


Right, but they didn't recover speedily. To have the cluster in such a state for so long sounds like poor monitoring to me because this can knowingly interfere with an election later.


The health check said it was ok. How would they know it needed to be recovered?

The fault was the bad health check. Not the process.


They only just clarified that monitoring was in place and they were reporting as healthy. See the comments above.


So the article identifies a software bug and a software/config bug as the root cause. That sounds a bit shallow for such a high visibility case - I was expecting something like the https://en.wikipedia.org/wiki/5_Whys method with subplots on why the bugs where not caught in testing. By the way I only clicked on it because I was hoping it would be an occasion to use the methods from http://bayes.cs.ucla.edu/WHY/ - alas no - it was too shallow for that.


It is likely that this RCA was shallow because it was intended for everyone--including non-technical users, who (at least in my experience) tend to misinterpret or get confused by deep technical or systemic failure analysis.

It would be excellent if Stripe published a truly technical RCA, perhaps for distribution via their tech blog, so that folks like us could get a more complete understanding and what-not-to-do lesson (if the failing systems were based on non-proprietary technologies, that is).


From reading the RCA, this should be the trinity of mysql + orchestrator + vitess. If stripe can't get it right, there is no chance for the others.


Anybody know what database they’re using?


MongoDB is the primary data store used at Stripe.


Really speaks volumes about how mature MongoDB has become considering how solid Stripe's reliability is.


MongoDB is a really scary database to use at scale.

It doesn't shard nicely. Failovers have rather nasty semantics that can cause nasty bugs in client side code. Performance cliffs abound.

If your datastore is anything over 1TB, I'd be using postgres, or if you can manage it something bigtable-like.


Not always ;). As someone with experience managing mongo at scale, this really speaks volumes to the amount of effort needed to make it not do the wrong thing. And even then, there are unknown unknowns like this that can pop up at any time.


As I mentioned early, " human error often, configuration changes often, new changes often. " https://news.ycombinator.com/item?id=20406116


This reads like the marketing/PR teams wrote much of it. Compare to the Cloudflare post-mortem from today: https://blog.cloudflare.com/details-of-the-cloudflare-outage...


I'm Stripe's CTO and wrote a good deal of the RCA (with the help of others, including a lot of the engineers who responded to the incident). If you've any specific feedback on how to make this more useful, I'd love to hear it.


I don't think either one is particularly "useful" to me as a consumer of the business, other than knowing that "we have top people working on it right now" and there's a plan in place to try and avoid future problems.

What's fun for a software person is that there's a lot of interesting digressions and stuff to learn in the Cloudflare one. The whole explanation of the regexp at the end is something that no one cares about from the business side, but is an interesting read in and of itself.

It's worth noting that yours came out a bit more than a week faster than theirs, which jgrahamc clearly spent a lot of time writing. No idea if anyone cares about the speed with which these things are released...


Hi Dave, you probably won't remember me (we only spent about 2 months together in Stripe), but I bet Mr Larson remembers.

The first question is who is this written for: It lacks the detail I would write for the incident review meeting audience, while lacking a simpler story for the non technical. As it is at the time I read it, I don't think it aims any audience very well.

I understand that the level of detail of the internal report might be excessive for the internal report, but if technical readers are the target, some more details would have helped. For example the monitoring details that Will described in another thread are a key missing detail that, if anything, would make Stripe look better, as problems like that happen all the time. I bet there are more details that are equally useful that would be in an internal report that would not reveal delicate information. In general, the only reason I could follow the document well is that I remember how the Stripe storage system worked last year, and I could handwave a year worth of changes. Since this part of the Stripe infrastructure is relatively unique, it's difficult to understand from the outside, and looks as if it doesn't have enough information.

In particular, the remediations say very little that is understandable from the outside: Most of the text could apply to pretty much any incident on a storage of queuing subsystem I was ever a part of: More alerts, an extra chart in an ever growing dashboard, some circuit breakers to deal with the specific failure shape... It's all real, but without details, it says very little.

I understand why you might not want to divulge that level of detail though. If we want fewer details, then the article could cut all kinds of low-information sections, and instead focus more on the response, and the things that will be changed in the future. The most interesting bit about this is the quick version rollback, which, in retrospect, might not have been the right call. A more detailed view of the alternatives, And why the actions that ultimately led to the second incident were made would be enlightening, and would humanize the piece.

Thank you for not just providing a public root cause analysis, but coming here to discuss it in HN.


I work at Stripe, on the marketing team, and assisted a bit here. My last major engineering work was writing the backend to a stock exchange.

If anyone on HN knows anyone who has the sort of interesting life story where they both know what can cause a cluster election to fail and like writing about that sort of thing, we would eagerly like to make their acquaintance.


Maybe Kyle Kingsbury (aka @aphyr) is the person you are looking for?

https://jepsen.io/services#consulting


Kyle used to work at Stripe and left. I don’t think he would come back unfortunately. That guy is absolutely amazing, especially with regards to distributes DBs and writing about them


For starters maybe provide more details beside the vague information of some feature of some database didnt work as expected. Imagine you are giving this to your employees (especially new ones) to learn something. How much actual useful knowledge is being shared here to learn from?


Unexpected things are bound to happen. But, one thing that stuck out to me is that you dont seem to have a safe way to test changes (which would have prevented the second failure). Are there no other environments to test changes on? Is there no way to incrementally roll-out? Is there not another environment which can step in in place of a failing one while you investigate? These seem like fairly common industry practices which help you deal with unexpected failures, but I dont see a mention of if/why these practices failed and if/how that is being remediated.


It would be great that in these types of situations if the CC Tokens validity period is extended, or at least known as the documentation states it is short. For our app if the tokens were valid longer, we could write this up as a non-event and retry when things were better.


>This reads like the marketing/PR teams wrote much of it.

The remediation part is quite cautious/generic but overall it seems like a good faith effort by someone constrained by corporate rules.


Out of curiosity, how would you have preferred to see a shard unable to accept writes? I think in both post-mortems, you would see comparable graphs - usage and then a drop in usage. I think it's easier to document a failed regex versus "here's our cluster architecture that we've been using for 3 months".

Also, does your company's engineering decisions change based on other companies' post-mortems?


Is this Stripe's first public RCA? Looking through their tweets, there do not appear to be other RCAs for the same "elevated error rates". It seems hard to conclude much from one RCA.


Why don't they call 'significantly elevated error rates' an 'outage' instead?


(Stripe CTO here)

That's a reasonable question. We wrote this RCA to help our users understand what had happened and to help inform their own response efforts. Because a large absolute number of requests with stateful consequences (including e.g. moving money IRL) succeeded during the event, we wanted to avoid customers believing that retrying all requests would be necessarily safe. For example, users (if they don’t use idempotency keys in our API) who simply decided to re-charge all orders in their database during the event might inadvertently double charge some of their customers. We hear you on the transparency point, though, and will likely describe events of similar magnitude as an "outage" in the future - thank you for the feedback.


And thank you for the answer and for being open to outsider input.


Because "A substantial majority of API requests during these windows failed. " implying that there was not a complete outage.

I don't understand why people demand the usage of incorrect language.


In my mind, a "degradation" would be if some fraction of requests were randomly failing, but they would be likely to eventually succeed if retried. Or if the service itself was essentially accessible, but some non-essential functionality was not working correctly.

On the other hand, if for a significant number of users the site was completely unusable for some period of time, then I think it's fair to use the word "outage". (Even if it's not a complete outage affecting all users.)

I don't know whether other people would interpret these terms the same way I do, nor do I think there's enough information in this blog post to determine for sure which label is more accurate for this particular incident. So personally, I'm not going to be too picky about the wording.


> Because "A substantial majority of API requests during these windows failed. " implying that there was not a complete outage.

The fact that you needed to qualify “outage” with “complete” clearly means the word on its own is not incorrect for cases where a system was “only” mostly unavailable rather than completely so.

> I don't understand why people demand the usage of incorrect language.

The irony.


My guess is that it's because not everything was down so it wasn't a total outage. From the post mortem:

> Stripe splits data by kind into different database clusters and by quantity into different shards.

So in theory any request that didn't interact with the problematic database should have been OK (I don't know if the offending DB was in the critical path of _every_ request).


Since both companies' root cause analysis are currently trending on HN, it's pretty apparent that Stripe's engineering culture has a long ways to go catch up with Cloudflare's.


"We identified that our rolled-back election protocol interacted poorly with a recently-introduced configuration setting to trigger the second period of degradation."

Damn what a mess. Sounds like y'all are rolling out way to many changes too quickly with little to no time for integration testing.

It's a somewhat amateur move to assume you can just arbitrarily rollback without consequence, without testing etc.

One solution I don't see mentioned, don't upgrade to minor versions ever. And create a dependency matrix so if you do rollback, you rollback all the other things that depend on the thing you're rolling back as well.


Yes this was very surprising. The system was working fine after the cluster restart. There was no need for an emergency rollback.

Doing a large rollback based on a hunch seems like an overreaction.

It's totally normal for engineers to commit these errors. That's fine. The detail that's missing in this PM is what kind of operational culture, procedures and automation is in place to reduce operator errors.

Did the engineer making this decision have access to other team members to review their plan of action? I believe that a group (2-3) of experienced engineers sharing information in real-time and coordinating the response could have reacted better.

Of course, I wasn't there so I could be completely off.


"That's fine."

idk the suits have a very different viewpoint; 30 minutes of downtime for a large financial system isn't fine. it can be very costly.


I think the GP means that as far as incidents occurring, so far as care is (or was) taken to prevent them and learn from them, then that's all one can really reasonably ask for. The first incident falls under that heading and 'is fine' in a 'life happens' sense.

The following incident comes across as reckless and avoidable as there should have been procedures to safely test the rollback (and perhaps there were, but a perfect storm allowed it fail in prod). Lacking details about how the second incident came to be or how they will be prevented going forward places the second incident as 'not fine'.

This information is what the GP comment is asking for.

Compare this PM with Cloudflare's PM, where they detail how they tested rules, what safeguards were in place, how the incident came to be, and how they intend to prevent similar incidents; the impression given here is that they will put up more fire alarms and fire extinguishers but do little fire prevention.


Not sure why this is downvoted but it all really looks like non-tested deployments to production servers.


Possibly downvoted because of the name-calling ('what a mess', 'amateur move'), which degrades discussion and is against the site guidelines. It's also sort of distasteful to pile on like that.

https://news.ycombinator.com/newsguidelines.html




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: