Stripe – Outage postmortem

jorgeortiz85 · on Oct 10, 2015

Hi, I work in infrastructure at Stripe and I'm happy to provide more insight. Several threads here have commented on our tooling and processes around index changes. I can give a bit more detail about how that works.

We have a library that allows us to describe expected schemas and expected indexes in application code. When application developers add or remove expected indexes in application code, an automated task turns these into alerts to database operators to run pre-defined tools that handle index operations.

In this situation, an application developer didn't add a new index description or remove an index description, but rather modified an existing index description. Our automated tooling erroneously handled this particular change and interpreted it not as a single intention but instead encoded it as two separate operations (an addition and a removal).

Developers describe indices directly in the relevant application/model code to ensure we always have the right indices available -- and in part to help avoid situations like this. In addition, the tooling for adding and removing indexes in production is restricted to a smaller set of people, both for security and to provide an additional layer of review (also to help prevent situations like this). Unfortunately, because of the bug above, the intent was not accurately communicated. The operator saw two operations, not obviously linked to each other, among several other alerts, and, well, the result followed.

There are some pretty obvious areas for tooling and process improvements here. We've been investigating them over the last few days. For non-urgent remediations, we have a custom of waiting at least a week after an incident before conducting a full postmortem and determining remediations. This gives us time to cool down after an incident and think clearly about our remediations for the long-term. We'll be having these in-depth discussions, and making decisions about the future of our tooling and processes, over the next week.

asuffield · on Oct 11, 2015

(Tedious disclaimer: my opinion, not speaking for my employer, etc)

I'm an SRE at Google, where postmortems are habitual. The thing that jumped out at me here is that a production change was instantaneously pushed globally, instead of being canaried on a fraction of the serving capacity so that problems could be detected. That seems like your big problem here.

(Of course, without knowing how your data storage works, it's difficult to tell how hard it is to fix that.)

jorgeortiz85 · on Oct 11, 2015

Yup.

This is one of our few remaining unsharded databases (legacy problems...), so we can't easily canary a fraction of serving capacity. However, one clear remediation we can implement easily is to have our tooling change a replica first, failover to it as primary, and, if problems are detected, quickly fail back to the healthy former primary.

Lesson learned. We'll be doing a review of all of our database tooling to make sure changes are always canaried or easily reversible.

eldavido · on Oct 10, 2015

hi jorge

I'd actually applied to work at stripe about two years ago, you guys turned me down ;)

I was responsible for ops at a billion-device-scale mobile analytics company for about 1.5 years. Your tooling is far superior to anything we produced. I like the idea of a single source of truth describing the data model (code, tables, query patterns, etc.) a lot, and doubly-so that it's revision-controlled and available right alongside the code.

I think it's far from decided though, how much to involve human operators in processes like this. Judging from this answer, you seem to be on the extreme end of "automate everything". How then, I'm curious, do you train/communicate to developers what can be done safely vs. something that would cause i/o bottlenecks, slowdown, or other potentially production-impacting effects? Can you even predict these things accurately in advance? (Some of our worst outages were caused by emergent phenomena that only manifested at production scale, such as hitting packet throughput and network bandwidth limits on memcached -- totally unforseeable in a code-only test environment).

It sounds like you let developers request changes (a la "The Phoenix Project") but ops is responsible for final approval of the change? That actually sounds like a great system. Would love some elaboration on this.

In any case, great writeup and from one guy who's been there when the pager goes off to another, sounds like the recovery went pretty smoothly.

jorgeortiz85 · on Oct 10, 2015

This is indeed a tricky balance. We want developers to iterate quickly, but we also want to understand the impact of production changes. With a small team and small sets of data, it's easy for everyone to understand the impact of changes and it's easy for modern hardware to hide inefficiencies. As we grow, the balance changes. It's harder for any one person to understand everything. It's also harder to hide inefficiencies with larger data sets.

We're always learning and improving. In order to scale, we'll need better ways to manage complexity and isolate failure. Our tools, patterns, and processes have changed quite a bit over the last few years, and they will continue to change. Ultimately, we want every Stripe employee to have the right information evident to them when they make decisions. This will be challenging, especially as we grow, but I'm excited to take on that challenge.

If you're still interested in working at Stripe, I'd encourage you to reapply! Our needs have changed quite a bit since you applied, and we're willing to reconsider candidates after a year has passed. Feel free to shoot me a resume: jorge@stripe.com

toomuchtodo · on Oct 10, 2015

Shouldn't developers understand how a database change is going to impact an environment based on the code they've written?

BinaryIdiot · on Oct 10, 2015

Yes they very much should! But in my, admittedly anecdotal, experience only the best / most senior ever do. Almost every junior or mid developer I've worked with (and a small handful of senior folks) not only have no idea how changes like this would impact the larger environment but many won't even care to look into it.

noir_lord · on Oct 11, 2015

In part though that's because the tooling to do it easily absolutely sucks, the impedance mismatch (overused but in context here) between the two parts of the system causes a lot of the underlying issues, better tooling is a large part of the solution I think but I've not seen anything that would help and the surface area of a modern RDBMS is so large without even getting into vendor specific stuff I'm not sure what that would even look like.

BinaryIdiot · on Oct 11, 2015

That's certain a great point! If there was a way to automatically test much of this I bet even the newest of engineers could stop this. Doing that is tough, hmm...

noir_lord · on Oct 11, 2015

I think the only way you could do it on top of a RDBMS is to use a strict subset of features that are common (something that many ORM's already do) which reduce the problem scope down to something manageable, the issue then would be that there would always be the temptation to use something outside that subset and forgo the easier testing, fast forward and you have the same issue.

It would be interesting to build a RDBMS that enforced that subset by simply not allowing those features to be used/abused with support for many of the modern features (JSONB etc) but that is way beyond my area of expertise.

dexterdog · on Oct 10, 2015

You would think but far too many developers don't really know how databases work under load.

devit · on Oct 10, 2015

Why not just use simple version controlled database migrations, and testing them in a test environment?

tempestn · on Oct 10, 2015

Generally you want your database migrations described in a straightforward manner for development; the migrations will contain a straightforward change from old to new (and back). With a live (busy) production database, it is often necessary to handle things differently to maintain up-time.

As a simple example, to make an atomic change to a write-only table, you could create a copy of the table, alter the copy as necessary, then in a single rename operation, rename the live table to '_old' and the '_new' table to live. You most likely would not want to add two additional table schema and all of those steps to your development database operations.

It's entirely possible that they could capture what is done in production as migrations, and test them first, but it would still likely be separate from what the application developers are working with.

devit · on Oct 10, 2015

Development databases normally have a small amount of data, so migrations should execute instantly or nearly so no matter how complex they are.

tempestn · on Oct 11, 2015

True, but I don't think it negates anything I wrote. You don't keep development migrations simple so they'll run quickly; you keep them simple so they're easy to create and understand. Writing migrations (whether automated or manual) for production is a separate task and even a separate skill from designing the database structure itself, so there's no reason why the two need to be (or should be) combined.

tempestn · on Oct 11, 2015

Meant to write 'read-only' in the example there. Those steps wouldn't work well for a table that's being written to, since it could change in the process. Anyway, it was just an example.

raspasov · on Oct 10, 2015

What kind of database was the incident on?

lacksconfidence · on Oct 10, 2015

have you considered integrating index statistics into these changes? To take an example from mysql, there is the INDEX_STATISTICS table in information_schema that contains the current number of rows read from the index. Checking this twice with a one minute interval before applying the index drop could have shown that the index was under heavy usage, and might require human intervention.

codahale · on Oct 10, 2015

MongoDB doesn't track this information, unfortunately.

vostro_mf · on Oct 11, 2015

It looks like the latest version does: https://jira.mongodb.org/browse/SERVER-2227

The problem with MongoDB is that teams think they can get away by just setting it and forgetting it. Real companies have DBAs that monitor it and understand it and make a living through it. They're just trying to automate it using fancy ui's. That's what you get for trying to automate your DBAs.

codahale · on Oct 11, 2015

3.1.x is a development branch and not intended for production use. When they release 3.2, MongoDB will support it.

spudlyo · on Oct 10, 2015

That was my thought as well, but this change was done by an Operator and not a DBA, who tend to be a bit more curious about these kinds of changes.

Animats · on Oct 10, 2015

What's so striking about this is that entire retail chains can be shut down by a problem in some cloud server. Starbucks had a server outage in April which caused stores to close.[1]

There's a trend towards "hosted POS", where point of sale systems have to talk to the "cloud" to do anything, even handle cash. Until recently, most POS systems were running off a server in the manager's office, which communicated to servers and credit card systems elsewhere. A network outage didn't affect cash transactions. Often, the systems could even process a credit card transaction without external help, giving up real-time validation but still logging the transaction for later processing.

There's a single point of failure being designed into retail here.

[1] http://www.usatoday.com/story/tech/2015/04/24/starbucks-poin...

thrownaway2424 · on Oct 10, 2015

When Blue Bottle Coffee switches to Square there was a noticeable decline in the throughput at the cash register. It just takes the retail employee longer to do anything on an iPad. Pretty much everything can be done faster on a real cash register. There have also been the requisite outages, of course. Recently I was at Blue Bottle and the Square terminal wasn't opening the cash drawer. They were making change out of the tip jar :-/

Animats · on Oct 10, 2015

There's something that a lot of retailers don't get - never put an obstacle in front of the customer giving you their money. Don't let lines form at checkouts. Don't clog up the counter with impulse-purchase stuff. Don't put displays in the path of customers headed for checkout. Don't make customers jump through hoops with loyalty cards and data entry. Don't do anything that slows the checkout process.

There are expensive retail consultants who clean up stores and improve sales by doing this.

Gap gets this right. Gap stores have big, clear counters, so you can bring up lots of merchandise and have a place to put it. This increases sales per customer. Gap is amazingly successful despite a rather blah product line.

taftster · on Oct 10, 2015

Oh my goodness, 1000x yes. I am so sick of retailers putting crap in front of their checkout counters (Best Buy, Walmart, Barnes & Noble, every Indian run gas station, and thousands of others). Like running me through a rat maze of bookshelves, display cases, and pallets full of unopened stock is somehow going to make me happy about buying from there. Every time I see this in a retail store, it annoys me to death and seriously taints my view on the shopping experience. If I don't have enough counter room to put my purchases on or if I can't find the freaking entrance to your checkout corn maze, I'm going to leave my purchases on the ground.

Ironically, I think "open" stores (like an Apple store) exhibit the same behavior. Even though there's technically no labyrinth to navigate through, good luck finding an efficient way to purchase a product, especially if it's just a small one. You first signal an Apple checkout fairy to grace you with her checkout wand, so that someone else can find you and swipe your card. If your purchase is small, they hardly even give you time of day. So stupid.

And the analogy can be extended to online stores too. Any store with a "just add this too" in the checkout experience invokes the same negative reaction. Though I am a heavy Amazon customer, their checkout process is just barely tolerable in this way. And think about GoDaddy and how miserably awful their checkout experience has been (and continues to be) throughout the years.

manigandham · on Oct 10, 2015

Sure, that's your reaction and it makes sense.

But for every person like you, there are about 50 others who do browse through all the stuff and perhaps add more to their shopping. Not everyone is in a hurry and capitalizing on small additions and combo deals and sales signups can be great business for the retailer.

icelancer · on Oct 11, 2015

>Gap gets this right.

I did consulting for them and they hired aforementioned retail consultants for this job. I will admit I initially thought it was stupid, having been a wise retail manager at the age of 17 at one point in my life (and did a lot of level-1 tracking of sales and A/B testing, even back then!). But when these retail consultants showed me their data and how it was formatted... it was unbelievable. Most of what I experienced in retail were these nebulous and stupid ideas that were conjured from the ether from my direct managers and general managers and the like. While I'm sure not all the information I saw trickled down to employees, it was truly enlightening and gave me zero room to argue particulars with how they were studying and implementing retail flow, checkout, and return procedures.

jacobolus · on Oct 10, 2015

On the other hand, having a line out the door seems to sometimes be good marketing.

mikekchar · on Oct 11, 2015

I had not thought about this before the parent posted, but there is a difference between a line on the way in and a line on the way out. The first is attractive and the second is obnoxious. One of the reasons I love British pubs and Japanese ticket-machine restaurants is that you pay up front and when you are ready to go home you just leave.

Japanese convenience stores also get this right most of the time. There is almost never a line up at the cash register. As soon as one begins to form, the employees will drop whatever they are doing and open up a new register. At some places, the register can actually handle more than one customer at a time, so if you have someone fumbling for change, they can start tallying up goods for the next person in line.

Possibly I'm not typical, but I often spend a lot of time browsing in stores. As soon as I have decided that I'm done, though, I get very irritated at anything that slows me down. I had never really put my finger on it until reading the OP's post.

Animats · on Oct 10, 2015

No. That went out decades ago. The Syufy chain of theaters (later Century) used to try to create lines by not opening up enough ticket windows. Then came video rental. Then came half-empty theaters.

Retailers no longer have the power to make customers wait. Consumers have too many other buying options.

peoplelovelines · on Oct 10, 2015

Wrong. Clearly you've never lived in NYC, where people apparently use lines as a proxy for how good something is. Lots of other big cities too.

Also, box office revenue hasn't collapsed or anything, so I'm not sure what you mean with your example. In fact, come to think of it, I see people lining up for things multiple times every year at the theaters I frequent.

oalders · on Oct 10, 2015

When I see those long lines in front of [insert trendy food item] stores I think, "I'll have to try that one day when there's no lineup" but when that day actually comes, I've already forgotten about the place. Some of us do gauge popularity by lineups, but we also can't personally be bothered to stand in line.

chrisabrams · on Oct 10, 2015

Yeah, here in NYC people prefer to go to places where they wait in line. No line == no good.

intopieces · on Oct 10, 2015

>Retailers no longer have the power to make customers wait. Consumers have too many other buying options.

And yet Apple had a line 12+ hours before opening to release their 9th generation iPhone -- despite also taking orders online.

Hopdoddy in Austin has a line out the door for lunch and dinner; they even have a person dedicated to managing it, along with waitstaff to take drink orders

I think the line is alive and well in America for certain establishments.

onedev · on Oct 10, 2015

I absolutely adore Hopdoddy's and waiting in line is considered part of the experience to be honest.

chrisabrams · on Oct 10, 2015

You've clearly never been to a big city.

thrownaway2424 · on Oct 10, 2015

You can have a line and good throughput at the same time. It's easy to develop a line. What you don't want is to have two espresso machines with three group heads each and two baristas just standing around because the counter is too slow.

raverbashing · on Oct 10, 2015

Exactly this

The "2.0 generation" (or was that 5 years ago?) is forgoing a lot of redundancy and safeguards that existed in the previous systems

When did we assume a fast data link (and all the infrastructure behind it) would be available 24/7?

At least do cash transactions and note them in a notebook

cookiecaper · on Oct 10, 2015

Yep, I'm not sure what the appeal here is. It just seems like companies are putting out less robust solutions because it saves them time and money and people are going for it because the company has sleek Silicon Valley marketing, regardless of the practicalities.

noir_lord · on Oct 10, 2015

This is one of the best incident reports I think I've ever read, detailed, honest and without recrimination.

I'm reminded of something I read in a book a while back about aviation disasters "It's not the first problem that kills you, it's the second".

In this post the described system for change management is at least as good as any I've seen in production and yet a series of small problems got out of hand quickly.

squiggy22 · on Oct 10, 2015

Totally agree. Came here to say pretty much the same thing. On a related note, highly recommend the Checklist manifesto (book) for devops.

j_s · on Oct 10, 2015

http://amzn.com/B0030V0PEW

methehack · on Oct 10, 2015

I appreciate the post-mortem and, of course, we've all been there. I have to say, though, that the cause is a little surprising. That DDL needs to be executed sequentially is pretty basic and known by, I'm sure, everyone in the engineering and operations organizations at Stipe. It surprises me that an engineering group that is obviously so clever and competent would come up with a process that lost track of the required execution order of two pieces of DDL. If process (like architecture) reflects organization, what does this mean about the organization that came up with this process? It's not a sloppy exactly. Is it overly-specialized? It reminds me of that despair.com poster "Meetings: none of us is as dumb as all of us" in that a process was invented that makes the group function below the competence level of any given member of the group.

toomuchtodo · on Oct 10, 2015

It's pretty easy to monday morning quarterback other org's choices or actions here. Every person posting here will have something in their org break at some point that was "obvious" to the rest of us.

EDIT: Someone, somewhere is going to have a bad day because they didn't know what you did. This is why sharing knowledge is so important. That's part of what HN exists for! Share what you know! Help improve open source tools! Help your fellow IT professional get a good night's sleep.

potatosareok · on Oct 10, 2015

I get that but I'm wondering who is this postmortem written for? For other engineers? Not entirely - it's seems to be written partly as PR piece.

In that case I don't need to basically congratulate Stripe on messing up and then posting PR piece on how they messed up, especially when it's such a trivial mistake (by that I mean it's not anything technically interesting for what went wrong). I guess I'll concede that what is technically interesting is not objective - many things I consider complicated other would consider basic, so I don't really have the right to serve as the arbiter of what is technically interesting.

What happened? We dropped an index. Why? Bad tooling. Fix? Patch code, add index. Future fix? Vague goals to stop this from happening.

Though I might be taking it too far, I don't see why I need to give props to someone for doing messing up something relatively basic and then fixing it - don't people complain enough about kids getting participation trophies?

Anyway, don't mean to call out any specific Stripe engineer, it's failed process at multiple levels (guy who drops index has no visibility into DB?).

onewaystreet · on Oct 10, 2015

> I get that but I'm wondering who is this postmortem written for?

For their customers.

> In that case I don't need to basically congratulate Stripe on messing up and then posting PR piece on how they messed up, especially when it's such a trivial mistake (by that I mean it's not anything technically interesting for what went wrong).

Few screw-ups are ever that technically interesting. The point of a postmortem isn't to be interesting, it's to explain what went wrong, why, and what you are doing to prevent it from happening again in the future.

toomuchtodo · on Oct 10, 2015

I would encourage you to accept that their post mortem was released in good faith, and its purpose is to both technical knowledge sharing as well as PR. I know I personally value technical organizations that are honest and forthcoming when things go south.

manigandham · on Oct 11, 2015

> I know I personally value technical organizations that are honest and forthcoming when things go south.

Why? How does the honesty (in this case openness really) change the quality?

Genuine question: Would you rather have an org that's always reliable but private in their tech or one that has issues but open about them?

icelancer · on Oct 11, 2015

The second. Because "always reliable" isn't. So when something goes down and there is nothing being communicated, that's truly infuriating.

manigandham · on Oct 11, 2015

This particular post-mortem by Stripe makes me trust them less as it's a fairly simple mistake that shouldn't have been made.

Plenty of companies also communicate the status and that something is happening but don't fully expound on all the internal details. Not sure why it's such a big difference if they did. It feels like fake PR trust to me.

toomuchtodo · on Oct 11, 2015

> This particular post-mortem by Stripe makes me trust them less as it's a fairly simple mistake that shouldn't have been made.

You are, of course, entitled to your opinion. I don't think its going to hurt their business at all.

manigandham · on Oct 11, 2015

Didn't say it was going to hurt their business. Only that a post-mortem doesn't somehow change my opinion on their quality or reputation, and in this case is the opposite.

perlgeek · on Oct 10, 2015

> It surprises me that an engineering group that is obviously so clever and competent would come up with a process that lost track of the required execution order of two pieces of DDL.

To me it seemed that it was a bug in the tooling that split the index change into two separate change requests. I'm sure a change request supports more than one piece of DDL, and it must have worked in the normal case, otherwise they would have run into this problem much earlier. So it was likely some weird corner case.

Now that I think about it, this could happen to us as well. We have peer review for each database change, but only at the source level (so, definition of the schema); the actual commands for schema changes are usually generated automatically. If some bug in that step existed that was only triggered in weird corner cases, we'd be screwed.

A backup of our main database takes about 7 hours to restore, and then we'd have to reply the binlogs from the time of the last snapshot up to the failed schema changes, so I guess we'd lose about a solid work day if things went down south. Yikes.

mst · on Oct 10, 2015

> Now that I think about it, this could happen to us as well. We have peer review for each database change, but only at the source level (so, definition of the schema); the actual commands for schema changes are usually generated automatically.

If you're using any of the DBIx::Class deployment tools they should be perfectly happy writing the DDL to disk and then running it from there, specifically to make it possible to audit the DDL as part of the commit that changed the result classes.

Generated != unauditable, especially when the tools are trying their best to co-operate :D

jlarocco · on Oct 10, 2015

I was thinking the same thing.

If nothing else, I don't understand the role of "database operator" if they'll just blindly delete a critical index without thinking about it. Shouldn't that person have known better than anybody how critical the index was?

gms · on Oct 10, 2015

There is an open secret you may not be aware of: every startup is a shitshow on the inside.

xacaxulu · on Oct 10, 2015

Also top 20 financial institutions, USG orgs and places that store your healthcare and tax information ;-) I'd argue that a good number of startups these days (especially ones borne out of larger organizations with lots of combined experience) are way more capable of handling these issues with finesse and speed.

fein · on Oct 10, 2015

Just startups? It's just as hilarious in corporate America.

perlgeek · on Oct 10, 2015

There are different levels at which you can operate a database.

One is to keep it running, monitor disc space, response times etc. but otherwise leave the schema to the developers.

Or you can own the schema, discuss all the changes and migrations with the developers etc.

If it was the first kind of DB operation (which wouldn't surprise me, because that's what our $work has as well), it's not surprising that they trust the developers to provide sane DDL patches.

jlarocco · on Oct 10, 2015

Okay, I've never seen the first kind of DB op. The places I've worked with dedicated DB people were a combination of the first and second types you listed. That situation would explain the Stripe outage, though.

TBH, it seems odd to call the first one a "database operator" instead of an IT admin.

eldavido · on Oct 10, 2015

I had this thought too. I'm actually surprised it was something this simple for an engineering org as competent and storied as Stripe, guess it proves Murphy gets all of us at one point or another.

The issue IMO is splitting an otherwise atomic procedure (creation/drop of index) into two change tickets. I'd be interested to know how the DDL was communicated to ops that led to its getting split.

vkat · on Oct 10, 2015

We do a ton of database updates regularly and we rely on a simple wiki put together that has all install instructions for a release which goes through sr engineers peer review. Seems like having a single tool/document for code and db updates could have mitigated this?

mercurial · on Oct 10, 2015

How about automated migrations? At the scale of Stripe, you may want to have dedicated personnel keeping an eye on it, but had it been in the same SQL file, none of that would have happened...

jblow · on Oct 10, 2015

Don't presume to give them so much benefit of the doubt. Apparently Stripe are kind of jokers.

This incident report reminded me of the 'Game Day Exercise' post from 2014:

https://stripe.com/blog/game-day-exercises-at-stripe

in which one robustness check that should be a continuous-integration kind of test, or at least a daily test of a normally working system, is such a big deal to them that they make a big 'Game Day' about it, and serious problems result from this one simple test.

After they have lots of paying customers, of course.

I know we are supposed to be positive and supportive on HN but this was a red flag that the entire department has no idea what an actual robust system looks like and were so far away from that, after having built a substantial amount of software, that expecting them to ever get there may be wishful thinking.

So I am completely unsurprised that they are having this kind of problem. The post-mortem reveals problems that could only occur in systems designed by people who do not think carefully about robustness ... which is consistent with the 2014 post. It kind of shocks me that anyone lets Stripe have anything to do with money.

wpietri · on Oct 10, 2015

Nicely done. Good job avoiding retrospectively blaming people and instead focusing on future system improvements.

(For those wondering why this is important, Sidney Dekker's "Field Guide to Understanding Human Error" is a mindblowing book.)

jskulski · on Oct 11, 2015

Agreed. It looks like this was caused not because the application developer or DBA caused an error, but because the system didn't allow for ticket dependencies.

wpietri · on Oct 11, 2015

Definitely. But it would have been very easy to yell at the developer ("you should have known not to do it that way") or the DBA ("why are you doing tickets out of order? you know we have to do deletes last!").

Especially after a dramatic event, those are very easy reactions to have, and they can sound very sensible.

siliconc0w · on Oct 10, 2015

For schema/index changes I prefer migrations. These should be committed along side the code, tested, and promoted through environments. This largely prevents dependency issues because the migrations are ordered.

Devops is like 80% dependency management. It's painful initially but you have to crack down on manual changes to production - all production changes should be defined in code, committed to git, tested, and flow through non-prod environments first. (just had an outage yesterday that was essentially because I failed to do this)

rawnlq · on Oct 10, 2015

Do people even test for perf issues in non-prod environments?

toomuchtodo · on Oct 10, 2015

I've found it to be extremely difficult to replicate production workloads in staging/development.

siliconc0w · on Oct 10, 2015

Yes you can do scaled down testing in non-prod. An example test may try 100,000 or 1,000,000 operations when in production you might see 100,000,000. So you won't catch subtle performance problems but you'll catch order of magnitude problems (i.e something that is O(n) vs O(log(n))

Although in this case the test for this change may just EXPLAIN a known query and ensure the output matches what is expected.

kevan · on Oct 10, 2015

We do in a staging environment, but usually only with releases where we think something we changed will cause a major difference in how the app behaves under heavy load. It's not cheap to keep a full staging environment around, but when you're playing around with CDN optimizations it's a lifesaver.

Alan01252 · on Oct 10, 2015

I try to at least have the slow query log on and monitored in staging. It's pretty easy to forget an index and not notice in a dev environment.

egwor · on Oct 10, 2015

If it's a requirement to be within X performance then yes.

pcunite · on Oct 10, 2015

>> the database operator didn’t have a way to check whether the index had recently been used for a query.

Appreciate the disclosure. Stripe engineers have been so helpful to me and my business. Keep up the good work.

leeleelee · on Oct 10, 2015

What is the point of having a human "database operator" who carries out simple tasks like deleting an index when it shows up in a work request log? Is this how most companies structure their dev teams?

If you have a human there to perform tasks, then it would seem natural to allow them "human advantages" such as the ability to communicate with the person who created the request, or the ability to have some of their own checks and balances before performing the index deletion (ex. let's take a look and see if this seems safe based on the current schema and codebase).

I am also surprised how easy it was for a single dev to make a request that subsequently results in the modification of a production database.

thrownaway2424 · on Oct 10, 2015

Indeed, what is the point? If they have some separation of concerns between dev and ops then it also makes sense to give ops some separate rights and responsibilities, such as vetting a change like this by analyzing the resulting query plans from a sample of production database queries. I'm sure something like that will be in their internal postmortem action items under "prevention".

vruiz · on Oct 10, 2015

> At 00:08 UTC, our on-call engineer had been paged and had responded. At 00:10 UTC, we linked the API degradation to the removal of the index.

I'm sure stripe has very good metrics of all their systems, nonetheless that's some rockstar level debugging skills.

kzhahou · on Oct 10, 2015

Well the DB admin had just removed the index a few minutes before, and immediately the API times spiked. When you remove an index there is exactly one risk --slowdown-- and that's precisely what the admin noticed.

vruiz · on Oct 10, 2015

That might be it, but by "on-call engineer" I understood it was someone else.

kzhahou · on Oct 10, 2015

Sure. I picture it as:

"Hey {on-call-engineer}, we're seeing a huge slowdown all of the sudden."

"Was there any change in the last few minutes?"

"We just deleted a database index."

"Yup, that'll do it. Gotta restore it."

protomyth · on Oct 10, 2015

We had a utility I wrote at one of the places I worked where you ran it against the database and it showed you all of the query plans running. It was easy with 15 seconds to see some non-indexed query and what was executing it. We used to run it after new version deployments to see if we had query problems. Tooling is very important.

perlgeek · on Oct 10, 2015

As a stopgap solution, enabling the slow query log can also be very helpful.

protomyth · on Oct 10, 2015

Yep. A lot of databases have some form of system tables that can tell you a lot about a running system with just some queries. I know Sybase and MS SQL server have the tables to find all the current queries and what they are doing. Ingress had a neat graph (seeing FSM should illicit a panic in most DBAs) and a log monitor. Read up on what is part of the database and write some tools.

shazow · on Oct 10, 2015

That sounds really handy, any chance it's publicly available or you can make it so?

protomyth · on Oct 10, 2015

No, I couldn't release it due to not having the source or being in a position to ask permission. It was for Sybase but the same thing can be achieved in MS SQL by reading some system tables. It really wasn't that hard of a thing to write.

thezilch · on Oct 10, 2015

I'd hope one of the first things for DB slow down is the running queries (eg. `show processlist` in flavors of MySQL) and seeing the number of rows examined. Maybe your DB doesn't have that, so you see hopefully see long running queries and they are mostly the same one, or you can see the same queries in your DB's engine status (eg. `show innodb status` in MySQL). If you don't have rows examined from earlier, you can run whatever equivalent you have of `explain` on these long running queries that are happening a lot, and now you should find the number of rows it is examining (and probably that it is running a scan query and not indexed).

I mean, you only get these scars from actually working with your DB and hopefully in development / capacity planning stages, so maybe that makes you a rockstar. Not to take anything away from the Stripe engineer(s).

nartz · on Oct 10, 2015

Good post mortem, just some thoughts, would love to hear what others think.

Shouldn't the app developers be in charge of deploying their code and making sure its working? It seems odd that they pass off a migration like this to a DBA to then go deploy randomly, or that they weren't around when it was happening to monitor it.

Also, it seems like there should have been a script that can be run that encapsulates the required dependencies, namely, don't drop an index before building the new one maybe? This should at least minimize the amount of context needed in a ticket.

Relying on fully correct context in tickets seems like it could be super error prone.

perlgeek · on Oct 10, 2015

It's a vicious circle: an index is missing, so requests take longer, so more DB queries run at the same time, the load on the DB server shoots through the roof.

And then it needs to rebuild an index on top of the already above-average load.

la6470 · on Oct 10, 2015

Why would someone delete a existing index without recreating it first? This is common sense for a DBA but not for our full stack engineers who have to know everything about everything g.

ryporter · on Oct 10, 2015

People seem to be focusing on the fact that the two operations should have been linked, but I think they actually should have been further separated. The old index presumably could have been left around for days, at the cost of some performance, and only deleted once they were sure that the new index was successfully being used by all production code.

While splitting these two changes may be silly in the case of a simple index change, I think that it's a good general policy to only deploy the minimal set of changes to a production database at once. On the production system I manage (admittedly, much simpler, and many orders of magnitude smaller), I always deploy changes in two stages, with the second stage generally deployed about a week later, once I'm sure that only new code will be running against the database.

This comment is not meant to second guess the Stripe developers (who produced a great postmortem), but to suggest another possible remediation.

mst · on Oct 10, 2015

I had a similar thought but if it's that critical an index it's entirely possible they couldn't afford the write overhead to keep both copies around.

pbiggar · on Oct 10, 2015

Very interesting.

In a continuous delivery environment like this, you generally want to test and validate any change you make. Do any DBs support the idea of turning off an index (but keep updating it). That way you could disable it, keep going for a while, and then delete it once you're sure it's unused?

tempestn · on Oct 10, 2015

In MySQL/MariaDB (and presumably other major databases) you can instruct any given query to ignore an index, but it has to be done at the query level. Perhaps the db interaction layer of the application code could be written such that all select queries are built with a variable containing a list of indexes to disable. You could then set that globally (or using dependency injection) and any query using the index would stop using it, without needing to go around and search for uses manually.

coolrhymes · on Oct 11, 2015

Great postmortem report. I am very surprised to see that the DB ops simply deleted the index without even considering the consequences. More importantly, why isn't the 2 changes part of a single migration file? For e.g. in Django, south migrations can have both remove, create indexes in one file and executed together. Also, like some of the guys mentioned, why is the update performed to your entire global cluster. Shouldn't it be incremental like for e.g. one availability zone at a time?

t0mk · on Oct 11, 2015

Would be cool to start the report with a tldr, containing just the essence of the incident, sth like

"dev needed to update index and at time .. submitted two tickets for new ibdex creation and old index removal. at time .. Op processed the removal ticket first which caused outage in service .. It was alerted at time .. and on-call op identified it at time .. He proceeded to ...

Just for people who want to know what's happened but dont care for details.

radicalbyte · on Oct 10, 2015

Is it an option for you to maintain a set of tests which simulate behavior on the tables?

I've have great success doing this with datawarehouses (i.e. star schema, large tables with few writes). You run the tests after each index change on acceptance. It caught a errors.

For OLTP it's harder, you need to record some production workload and reply it. At your scale it's easier said than done, though.

brown9-2 · on Oct 10, 2015

If the change that was introduced required creating a new index, it's unlikely that a test would have though to remove the now-outdated index or testing performance if the create-new/remove-old operations were reversed.

matthewarkin · on Oct 10, 2015

Woot, a postmortem from Stripe! I've been asking for these for over a year and hopefully we'll see more from them in the future.

scurvy · on Oct 10, 2015

"Quick code fixes" almost always make the problem worse and just cause stress and anxiety. It's always better to simply tackle the root cause and fix that.

Unless you're disabling a feature, don't push "quick code fixes". You'll pay for it later.

MichaelGG · on Oct 10, 2015

Except this seems to be counter factual? And the quick fixes were just temporary until the index was finished building. Seems like a perfectly fine thing to do to restore service during an outage, as long as they're rolled back or reviewed later.

scurvy · on Oct 11, 2015

> Seems like a perfectly fine thing to do to restore service during an outage, as long as they're rolled back or reviewed later.

Rolling back later seems to forgotten in most cases (unless it's disabling a feature). Then you end up with this weird behavior and code path that few remember. "Oh yeah we did that when...." You usually only rediscover it after that quick hack is a problem. This is decades of experience here.

mst · on Oct 10, 2015

Given it was temporary and they said it led to the API running "slightly degraded", it sounds like disabling a feature is pretty much exactly what they did.

protomyth · on Oct 10, 2015

So, and engineer can just submit a production change that an database operator will execute? What is the ceremony around this and does Stripe employ DBAs for production. What is the review process for a production change?

jewel · on Oct 10, 2015

The change was to add a new index and then remove the old one. It's something that would have passed review.

The cause was a defect in the tooling. The requests weren't tied together in the way they were displayed and so the DBA removed the old index first.

protomyth · on Oct 10, 2015

That's not a tooling defect, that's a process and personal knowledge of the system defect. No DBA should remove an index without knowing what is replacing it or that code got deployed to make it obsolete. Checking if that index is currently being used would be a minimum. If a DBA cannot stop a deploy you have process problems.

mst · on Oct 10, 2015

The post mortem specifically noted that it looked unused, presumably because the code in master now claimed to depend on the new index.

protomyth · on Oct 10, 2015

I do not mean by looking at the source code. I mean by looking at the production database that you are about to remove an index from. This is a financial institution, and I am really struggling with this post mortem to understand how they do not have more ceremony around production changes. Checks and balances are important.

aidos · on Oct 10, 2015

I suspect that it's more common than you'd guess.

In one situation I was working for a supplier to a massive corporation (a household name). They had all sorts of red tape in the process. Every change required filling in forms and getting official sign off from several parties before getting the release code over.

Anyway, we had a database user that was so restricted the we couldn't run the install process of a new product. I knew they ran the releases as SA so the start of my release script upgraded all the permissions of our user to give full access to everything.

protomyth · on Oct 10, 2015

I guess its very common, but this is a financial institution.

What you describe as a solution strikes me as an amazing way to destroy production data. Upgrading every user can basically lead to one hell of an amazing outside attack. I now only have to get one user / password to compromise your database.

A certain amount of red tape is a needed thing to make sure you don't affect your customer's business.

aidos · on Oct 11, 2015

True, I didn't stop to consider the financial aspect.

Was more just pointing out that there's often this separation between the people making the changes and the people making sure it's safe to release. But really, the safe to release step is often not going to catch things – in many cases because it's barely checked.

To be fair, I only upgraded my user to give full access to my db and I revoked the permissions once the install script had finished running. I'm not a complete monster :-)

protomyth · on Oct 13, 2015

Good... deploys are a strange world sometime.

I just didn't want to hear that you had set yourself up for a CEE (career ending event). People tend to be a tad bit wacky and whacky after a security breech.

codegeek · on Oct 10, 2015

ohh. Could this mean that a payment may have failed ? I did have a failure of a payment around this time but it said "Card declined". So I hope it really was the card and not related to this incident.

matthewarkin · on Oct 10, 2015

Failures caused by this should have returned an api_error, a card_error (like card declined) would have been returned upstream from Stripe.

rdancer · on Oct 11, 2015

"Outrage postmortem"

vostro_mf · on Oct 10, 2015

Sigh, yet another example of hot shot teams using MongoDB just because it's new and sexy. Existing, established tools such as Oracle and Postgres would have offered lots of ways of avoiding such a problem.

dang · on Oct 10, 2015

We detached this subthread from https://news.ycombinator.com/item?id=10366459 and marked it off-topic.

rjbrock · on Oct 10, 2015

I am a huge fan of postgres, but removing an index on postgres like this (especially if it is the primary means of querying a large table) would have the same effects.

This is just a failure of the tool that executes indexing operations, and not of the db itself

pbreit · on Oct 10, 2015

Was it mentioned anywhere this occurred in their mongo environment?

cookiecaper · on Oct 10, 2015

Although it's probably not directly relevant to this problem, I agree with you. MongoDB is the new MySQL; early on the scene and sexy, but at a real cost. There are other solutions doing the same things much better, and your life will be easier if you do your research before jumping into the sexy solution.

Companies should think carefully before introducing MongoDB (or any immature project) into critical production stacks.

depsypher · on Oct 10, 2015

They clearly just need to add Devops. That'll fix everything.

depsypher · on Oct 10, 2015

Wow, message received HN. Don't make fun of Devops! It doesn't seem that long ago that it was hailed as the silver bullet that eliminates the "throw it over the wall" mentality that causes issues exactly like this one.

This issue was caused by a failure in communication between team members. That communication is just as important as good engineering.

kzhahou · on Oct 10, 2015

Your comment was obviously meant as commentary on the DevOps trend, but in fact devops is not so well-defined a trend to make a good target for sniping comments like this. I mean, people talk about "devops" today to mostly just say "we must increase investment in our infrastructure." No one's saying "devops will fix everything." So I don't think people downvoted you for cutting on devops per se, as much as for not presenting any clear POV.

It's not like Node, where evangelists DO go around saying it's the best thing ever and blindly ignore the problems with it. Now... there's a good target for snippy remarks! :-P

geofft · on Oct 10, 2015

I would interpret the message as "Don't make fun," not "Don't make fun of Devops." If you have a technical point to make, being sarcastic and curt is one of the worst possible ways to accurately convey that technical point.

Furthermore, if Stripe already uses devops, then your statement adds nothing to the conversation (maybe this follow-up comment would have, but your original comment didn't), and if they don't, then commentary about a practice that wasn't being used is the definition of an off-topic comment. So in either case, your comment is worthy of downvotes, regardless of the merits of its topic.

I don't think I've ever seen anyone downvoted on HN for being opposed to established wisdom with good reason. I've often seen people downvoted who were being opposed to either established or non-established wisdom in a way that doesn't contribute productively to the conversation. They weren't being downvoted for their beliefs, but for their lack of productive contribution to the conversation.