Details on today's Facebook outage

davidu · on Sept 24, 2010

This is known generally as the "Thundering Herd" problem:

The thundering herd problem occurs when a large number of processes waiting for an event are awoken when that event occurs, but only one process is able to proceed at a time. After the processes wake up, they all demand the resource and a decision must be made as to which process can continue. After the decision is made the remaining processes are put back to sleep, only to wake up again to request access to the resource.

This occurs repeatedly, until there are no more processes to be woken up. Because all the processes use system resources upon waking, it is more efficient if only one process is woken up at a time.

This may render the computer unusable, but it can also be used as a technique if there is no other way to decide which process should continue (for example when programming with semaphores).

Though the phrase is mostly used in computer science, it could be an abstraction of the observation seen when cattle are released from a shed or when wildebeest are crossing the Mara River. In both instances, the movement is suboptimal.

From: http://en.wikipedia.org/wiki/Thundering_herd_problem

schrep · on Sept 24, 2010

We actually encounter Thundering Herd problems on a very regular basis. The Starbucks page has nearly 14M fans and posts may get tens of thousands of comments/likes with a high update rate. You have a lot of readers on a frequently changed value which means it is not often current in cache and you can have a pileup on the database.

Since we encounter this on a regular basis we have built a few different systems to gracefully handle them.

Unfortunately, the event today was not just a thundering herd because the value never converged. All clients who fetched the value from a db thought it was invalid and forced a re-fetch.

datums · on Sept 24, 2010

Would monitoring the rate of invalidation and triggering an event handler help ?

0xbadcafebee · on Sept 24, 2010

The second strangest part of this outage (to me) was that the cluster "was quickly overwhelmed by hundreds of thousands of queries a second." Does Facebook not have a way to curtail the number of queries being sent to its databases? I'm not a highly experienced programmer or dba but i've seen mod_perl websites that can do this with their API layer. It was engineered in once it was realized the database cluster could only do so many queries and connections at a time and they didn't want to lock up the database servers.

The strangest thing to me was that the clients had the permission to essentially cause a race condition across the whole site. From what I understand, the client ran an API call which either forced a re-fetch of a key which apparently only needed to be fetched once (and thus theoretically could have been staged in advance using a database application not running on the frontend site which could update the cache and prevent a database fetch, or just update the database, whichever was more necessary), or failing the database connection due to the aforementioned thundering herd of QPS it also triggered a re-fetch from the db (which again could have been prevented by a db app pre-loading the new value). So, if my outsider's idea of how Facebook's code works is accurate, this could have been prevented if either the cache/database was "pre-fetched" in the background not using client API calls, or if client API calls simply weren't allowed to all modify [and read] from a single key in the database. The latter point seems less likely than the former, but possible.

(Sorry if i'm speaking out of line or as an uneducated FB user, but according to these comments and RJ's breakdown this is how it appears to me)

Fluxx · on Sept 24, 2010

This is a common pitfall when people start using Memcache. Memcache puts less load on the DB, which means you can keep the same DB hardware and scale up your pageviews...until Memcache craps out and all that load is now back on the DB again :(

VladRussian · on Sept 24, 2010

anybody knows what DB their memcached was fronting? Oracle (in dedicated mode)?

pyre · on Sept 24, 2010

Since most of their (original) codebase is/was PHP, I would say there is a high likelihood that their db is MySQL (though I guess you could refer to it as Oracle since they own it).

shalmanese · on Sept 24, 2010

Facebook uses MySQL.

sbov · on Sept 24, 2010

When stuff breaks I find that this phenomena can oftentimes make it hard to differentiate the source of your problem from symptoms of your problem. We usually would do it by going through the logs and see what problems showed up first.

Are there any architectures/patterns/methods that can help make it easier to find the source of performance issues?

ora600 · on Sept 24, 2010

I'm often encountering systems that are designed for very short queries processing small numbers of rows where the number of connections from the app servers is configured to be far greater than the number of CPUs.

Since those systems are doing very little IO, configuring connection pools to start with much more connections than the DB has CPUs and to add more if the connections are busy (i.e. DB is getting slower, probably because it is highly loaded), is guaranteed to cause resource contention on the DB and escalate the issue in case something goes wrong.

Its a total waste and yet the most common configuration in the world.

mikey_p · on Sept 24, 2010

I've also heard this called a Cache Stampede, since it's commonly caused by the sudden invalidation of a cache, which was the case for this outage.

cliffchang · on Sept 24, 2010

i know this isn't reddit, but when you say "thundering herd", do you mean the processes, or the comments on the story?

cageface · on Sept 24, 2010

This certainly won't be the first time that a system designed to increase uptime actually reduces it. I've seen a lot of "redundant" systems that are actually less reliable than simple standalones thanks to all the extra complexity of clustering.

I guess at Facebook's scale you have to build in fallbacks but this is a reminder that you can easily do more harm than good.

smakz · on Sept 24, 2010

Very true - the complexity of the system also comes into the equation when things go wrong and it takes real people to figure out why an outage is happening. Highly complex systems imply longer debugging time, and at a certain point a theoretically lower up time can give you higher practical up time just because engineers can actually understand and debug it.

ora600 · on Sept 24, 2010

Sometimes cluster systems are mistaken for high-availability solutions while they are actually load-balancing solutions and can decrease availability due to added complexity and dependencies.

olefoo · on Sept 24, 2010

Sometimes designers and builders of a system make decisions about tradeoffs between availability and scaling and error recovery that they don't understand until after the system is in the field. Any sufficiently complex system becomes vulnerable to unexpected interactions between subsystems especially when one or more subsystems fails catastrophically. http://en.wikipedia.org/wiki/System_accident has more on the topic, but you do want to read Perrow's book http://books.google.com/books?id=VC5hYoMw4N0C&printsec=f... Normal Accidents.

Someone · on Sept 24, 2010

In am not sure that increasing uptime was a goal for this system. The way I read the story, this is about a system designed to make it easier to manage a server farm that was misconfigured and not robust against such configuration errors. So, it is more about a system designed to keep a server farm consistent. Staging those configuration changes could be a way to decrease the effects of future such errors, as it gives operators time to notice that something is amiss.

jasonwatkinspdx · on Sept 24, 2010

The other thing is that systems that cut across your architecture need to be handled with utter paranoia. They'll be the coupling points that cause global outages.

cageface · on Sept 24, 2010

This is exactly why I've always been skeptical of extensive unit testing. Certainly you want some but the bugs almost always seem to crop up at the integration points. I always try to focus my tests on as much of the stack as I can reasonably address.

Of course, doing whole-stack testing in a meaningful way at Facebook's scale is a very tricky thing.

InclinedPlane · on Sept 24, 2010

Unit testing does not preclude integration testing, they complement each other excellently.

cageface · on Sept 24, 2010

It's not an either-or thing, I agree, but I see so much emphasis placed on unit testing lately, often at the expense of integration testing, particularly in the ruby world where people mock and stub so much of the code that you're testing something very far removed from the real runtime.

InclinedPlane · on Sept 24, 2010

I don't think a lot of devs (perhaps especially the TDD faithful) appreciate how clumsy, immature, and downright primitive modern testing and mocking tools are. Even the best mocking frameworks give you nearly free reign to make up any behavior for mocked objects, many devs don't appreciate the importance of the gap between mocked behavior and actual behavior.

Which makes integration testing all the more critical.

sethwartak · on Sept 24, 2010

My favorite comments on fb page:

  Stick with Mysql

  * * * * If it aint broke, dont fix it!

  Me too Melissa, and it's out there in the media that a group of hackers caused the problem, is this true, Mr Robert Johnson?

  PLEASE !!!! WHAT CAN YOU DO TO HELP THIS FROM EVER HAPPENING AGAIN???????????????? PLEASE!!!!!!!!!!!!!!! CANDY

  Kip da updates comin'

  Did anyone get a message like I did about someone trying to access your account from another state?

swaarm · on Sept 24, 2010

I actually came across a pretty good comment which explains it (to the layman) pretty well:

"Marvin, the server was like a dog chasing its tail...it kept going in circles, but never caught it. They basically had to hold the tail for the dog so he could bite a flea on it. :) LOL"

catshirt · on Sept 24, 2010

"This means that Facebook is a database-driven app :D LOL , just refactor FB, you know it is a mess."

riffer · on Sept 24, 2010

Yeah, I noticed that too, it's kind of sad when you think about it

jrockway · on Sept 24, 2010

The text of the article is:

"You are using an incompatible web browser.

Sorry, we're not cool enough to support your browser. Please keep it real with one of the following browsers:"

This is why I don't use Facebook. It's not 1990 anymore. You don't need User-Agent sniffing.

rimantas · on Sept 24, 2010

Whoa, who did User-Agent sniffing in 1990?

pyre · on Sept 24, 2010

Tim Berners-Lee

itistoday · on Sept 24, 2010

That was an excellent description of the problem. Too excellent, it seems, for many of the commenters. :-p

duck · on Sept 24, 2010

Yeah, I don't know how regular users find posts like this... but that was way over everyones head.

moe · on Sept 24, 2010

At least one actually understood it quite well and translated the issue adequately, and hilariously, for the remaining audience:

   Angee Lening Marvin:
   the server was like a dog chasing its tail...
   it kept going in circles, but never caught it. 
   They basically had to hold the tail for the dog so he 
   could bite a flea on it. :) LOL

pquerna · on Sept 24, 2010

Agree, ideally they should of made two posts, one like this, and another for the 'normal' human being.

Malakin · on Sept 24, 2010

This is how http://twitter.com/facebook (how ironic)

docgnome · on Sept 24, 2010

Ironic? I don't really think so. They aren't exactly competitors.

xatax · on Sept 24, 2010

For a second there, while reading the comments, I wasn't entirely sure I hadn't accidentally been redirected to a redesigned YouTube.

Interesting article, terrible comments.

ergo98 · on Sept 24, 2010

They built their own custom DNS server? Most of the failures that people encountered were a failure to contact the nameserver itself. Perhaps in the rush to try to fix it someone screwed up that as well.

mbreese · on Sept 24, 2010

It seems like that was their attempt to "shutdown" the site. It's a pretty effective way too... The easiest way to stop the stampede was to drop off the net completely. By tweaking their dns, they were able to give themelves enough room to breathe. Then they could slowly start to bring people back.

That's just a guess. But from their post, it seems reasonable.

schrep · on Sept 24, 2010

This was a result of us throttling traffic to the site - not the original outage.

dangrossman · on Sept 24, 2010

Or maybe disabling DNS was how they purposely took down the site, then slowly let people back in, as they said.

ergo98 · on Sept 24, 2010

DNS was never mentioned in that post. Further it's a curious way of controlling flow, though I suppose in a moment of panic...if nothing else works...

brown9-2 · on Sept 24, 2010

Perhaps the DNS issue was their way of shutting off the site.

kahawe · on Sept 24, 2010

My own theory what actually happened: http://www.reddit.com/r/fffffffuuuuuuuuuuuu/comments/di0v0/s...

praeclarum · on Sept 24, 2010

So umm no mention of the 4chan DOS attack? I mean, not that I hang out there or anything, but a friend told me that they organized an attack. You'know. Jus sayin. /b/ye

ramidarigaz · on Sept 24, 2010

DOSing Facebook? It would require a huge effort to create a measurable increase in Facebook's traffic.

staunch · on Sept 24, 2010

Not if they found some expensive requests to use and/or if they have access to large botnets.

146 · on Sept 24, 2010

The Russians were barely able to slow down Facebook back in 2009. Unless someone on 4chan has a bigger botnet and just decided for lulz to DDoS Facebook, 4chan would not even come close to generating enough traffic to register on the #2 website on the Internet.

whatusername · on Sept 24, 2010

sure - if you found some really really expensive requests perhaps. As has been stated -- It's pretty hard to put a bump in FB's traffic.

The more likely explanation is that someone on 4chan noticed FB was down -- and then started talking about DOSing it.

praeclarum · on Sept 25, 2010

Completely possible. Except the posts to call for the attack were made prior to the world-known outages. But I can't argue your point. I just found it curious that no one mentioned the planned attack.

alphabeat · on Sept 24, 2010

Anything is possible thought right? If you target the right weakness?

xentronium · on Sept 24, 2010

Don't be too naive. The 4chan guys are great but not at that scale.

After facebook's outage started, there were like 9128 threads in /b/ with trolls claiming it was their success.