Hacker News new | past | comments | ask | show | jobs | submit login
Details on today's Facebook outage (facebook.com)
108 points by far33d on Sept 24, 2010 | hide | past | favorite | 51 comments



This is known generally as the "Thundering Herd" problem:

The thundering herd problem occurs when a large number of processes waiting for an event are awoken when that event occurs, but only one process is able to proceed at a time. After the processes wake up, they all demand the resource and a decision must be made as to which process can continue. After the decision is made the remaining processes are put back to sleep, only to wake up again to request access to the resource.

This occurs repeatedly, until there are no more processes to be woken up. Because all the processes use system resources upon waking, it is more efficient if only one process is woken up at a time.

This may render the computer unusable, but it can also be used as a technique if there is no other way to decide which process should continue (for example when programming with semaphores).

Though the phrase is mostly used in computer science, it could be an abstraction of the observation seen when cattle are released from a shed or when wildebeest are crossing the Mara River. In both instances, the movement is suboptimal.

From: http://en.wikipedia.org/wiki/Thundering_herd_problem


We actually encounter Thundering Herd problems on a very regular basis. The Starbucks page has nearly 14M fans and posts may get tens of thousands of comments/likes with a high update rate. You have a lot of readers on a frequently changed value which means it is not often current in cache and you can have a pileup on the database.

Since we encounter this on a regular basis we have built a few different systems to gracefully handle them.

Unfortunately, the event today was not just a thundering herd because the value never converged. All clients who fetched the value from a db thought it was invalid and forced a re-fetch.


Would monitoring the rate of invalidation and triggering an event handler help ?


The second strangest part of this outage (to me) was that the cluster "was quickly overwhelmed by hundreds of thousands of queries a second." Does Facebook not have a way to curtail the number of queries being sent to its databases? I'm not a highly experienced programmer or dba but i've seen mod_perl websites that can do this with their API layer. It was engineered in once it was realized the database cluster could only do so many queries and connections at a time and they didn't want to lock up the database servers.

The strangest thing to me was that the clients had the permission to essentially cause a race condition across the whole site. From what I understand, the client ran an API call which either forced a re-fetch of a key which apparently only needed to be fetched once (and thus theoretically could have been staged in advance using a database application not running on the frontend site which could update the cache and prevent a database fetch, or just update the database, whichever was more necessary), or failing the database connection due to the aforementioned thundering herd of QPS it also triggered a re-fetch from the db (which again could have been prevented by a db app pre-loading the new value). So, if my outsider's idea of how Facebook's code works is accurate, this could have been prevented if either the cache/database was "pre-fetched" in the background not using client API calls, or if client API calls simply weren't allowed to all modify [and read] from a single key in the database. The latter point seems less likely than the former, but possible.

(Sorry if i'm speaking out of line or as an uneducated FB user, but according to these comments and RJ's breakdown this is how it appears to me)


This is a common pitfall when people start using Memcache. Memcache puts less load on the DB, which means you can keep the same DB hardware and scale up your pageviews...until Memcache craps out and all that load is now back on the DB again :(


anybody knows what DB their memcached was fronting? Oracle (in dedicated mode)?


Since most of their (original) codebase is/was PHP, I would say there is a high likelihood that their db is MySQL (though I guess you could refer to it as Oracle since they own it).


Facebook uses MySQL.


When stuff breaks I find that this phenomena can oftentimes make it hard to differentiate the source of your problem from symptoms of your problem. We usually would do it by going through the logs and see what problems showed up first.

Are there any architectures/patterns/methods that can help make it easier to find the source of performance issues?


I'm often encountering systems that are designed for very short queries processing small numbers of rows where the number of connections from the app servers is configured to be far greater than the number of CPUs.

Since those systems are doing very little IO, configuring connection pools to start with much more connections than the DB has CPUs and to add more if the connections are busy (i.e. DB is getting slower, probably because it is highly loaded), is guaranteed to cause resource contention on the DB and escalate the issue in case something goes wrong.

Its a total waste and yet the most common configuration in the world.


I've also heard this called a Cache Stampede, since it's commonly caused by the sudden invalidation of a cache, which was the case for this outage.


i know this isn't reddit, but when you say "thundering herd", do you mean the processes, or the comments on the story?


This certainly won't be the first time that a system designed to increase uptime actually reduces it. I've seen a lot of "redundant" systems that are actually less reliable than simple standalones thanks to all the extra complexity of clustering.

I guess at Facebook's scale you have to build in fallbacks but this is a reminder that you can easily do more harm than good.


Very true - the complexity of the system also comes into the equation when things go wrong and it takes real people to figure out why an outage is happening. Highly complex systems imply longer debugging time, and at a certain point a theoretically lower up time can give you higher practical up time just because engineers can actually understand and debug it.


Sometimes cluster systems are mistaken for high-availability solutions while they are actually load-balancing solutions and can decrease availability due to added complexity and dependencies.


Sometimes designers and builders of a system make decisions about tradeoffs between availability and scaling and error recovery that they don't understand until after the system is in the field. Any sufficiently complex system becomes vulnerable to unexpected interactions between subsystems especially when one or more subsystems fails catastrophically. http://en.wikipedia.org/wiki/System_accident has more on the topic, but you do want to read Perrow's book http://books.google.com/books?id=VC5hYoMw4N0C&printsec=f... Normal Accidents.


In am not sure that increasing uptime was a goal for this system. The way I read the story, this is about a system designed to make it easier to manage a server farm that was misconfigured and not robust against such configuration errors. So, it is more about a system designed to keep a server farm consistent. Staging those configuration changes could be a way to decrease the effects of future such errors, as it gives operators time to notice that something is amiss.


The other thing is that systems that cut across your architecture need to be handled with utter paranoia. They'll be the coupling points that cause global outages.


This is exactly why I've always been skeptical of extensive unit testing. Certainly you want some but the bugs almost always seem to crop up at the integration points. I always try to focus my tests on as much of the stack as I can reasonably address.

Of course, doing whole-stack testing in a meaningful way at Facebook's scale is a very tricky thing.


Unit testing does not preclude integration testing, they complement each other excellently.


It's not an either-or thing, I agree, but I see so much emphasis placed on unit testing lately, often at the expense of integration testing, particularly in the ruby world where people mock and stub so much of the code that you're testing something very far removed from the real runtime.


I don't think a lot of devs (perhaps especially the TDD faithful) appreciate how clumsy, immature, and downright primitive modern testing and mocking tools are. Even the best mocking frameworks give you nearly free reign to make up any behavior for mocked objects, many devs don't appreciate the importance of the gap between mocked behavior and actual behavior.

Which makes integration testing all the more critical.


My favorite comments on fb page:

  Stick with Mysql

  * * * * If it aint broke, dont fix it!

  Me too Melissa, and it's out there in the media that a group of hackers caused the problem, is this true, Mr Robert Johnson?

  PLEASE !!!! WHAT CAN YOU DO TO HELP THIS FROM EVER HAPPENING AGAIN???????????????? PLEASE!!!!!!!!!!!!!!! CANDY

  Kip da updates comin'

  Did anyone get a message like I did about someone trying to access your account from another state?


I actually came across a pretty good comment which explains it (to the layman) pretty well:

"Marvin, the server was like a dog chasing its tail...it kept going in circles, but never caught it. They basically had to hold the tail for the dog so he could bite a flea on it. :) LOL"


"This means that Facebook is a database-driven app :D LOL , just refactor FB, you know it is a mess."


Yeah, I noticed that too, it's kind of sad when you think about it


The text of the article is:

"You are using an incompatible web browser.

Sorry, we're not cool enough to support your browser. Please keep it real with one of the following browsers:"

This is why I don't use Facebook. It's not 1990 anymore. You don't need User-Agent sniffing.


Whoa, who did User-Agent sniffing in 1990?


Tim Berners-Lee


That was an excellent description of the problem. Too excellent, it seems, for many of the commenters. :-p


Yeah, I don't know how regular users find posts like this... but that was way over everyones head.


At least one actually understood it quite well and translated the issue adequately, and hilariously, for the remaining audience:

   Angee Lening Marvin:
   the server was like a dog chasing its tail...
   it kept going in circles, but never caught it. 
   They basically had to hold the tail for the dog so he 
   could bite a flea on it. :) LOL


Agree, ideally they should of made two posts, one like this, and another for the 'normal' human being.


This is how http://twitter.com/facebook (how ironic)


Ironic? I don't really think so. They aren't exactly competitors.


For a second there, while reading the comments, I wasn't entirely sure I hadn't accidentally been redirected to a redesigned YouTube.

Interesting article, terrible comments.


They built their own custom DNS server? Most of the failures that people encountered were a failure to contact the nameserver itself. Perhaps in the rush to try to fix it someone screwed up that as well.


It seems like that was their attempt to "shutdown" the site. It's a pretty effective way too... The easiest way to stop the stampede was to drop off the net completely. By tweaking their dns, they were able to give themelves enough room to breathe. Then they could slowly start to bring people back.

That's just a guess. But from their post, it seems reasonable.


This was a result of us throttling traffic to the site - not the original outage.


Or maybe disabling DNS was how they purposely took down the site, then slowly let people back in, as they said.


DNS was never mentioned in that post. Further it's a curious way of controlling flow, though I suppose in a moment of panic...if nothing else works...


Perhaps the DNS issue was their way of shutting off the site.



So umm no mention of the 4chan DOS attack? I mean, not that I hang out there or anything, but a friend told me that they organized an attack. You'know. Jus sayin. /b/ye


DOSing Facebook? It would require a huge effort to create a measurable increase in Facebook's traffic.


Not if they found some expensive requests to use and/or if they have access to large botnets.


The Russians were barely able to slow down Facebook back in 2009. Unless someone on 4chan has a bigger botnet and just decided for lulz to DDoS Facebook, 4chan would not even come close to generating enough traffic to register on the #2 website on the Internet.


sure - if you found some really really expensive requests perhaps. As has been stated -- It's pretty hard to put a bump in FB's traffic.

The more likely explanation is that someone on 4chan noticed FB was down -- and then started talking about DOSing it.


Completely possible. Except the posts to call for the attack were made prior to the world-known outages. But I can't argue your point. I just found it curious that no one mentioned the planned attack.


Anything is possible thought right? If you target the right weakness?


Don't be too naive. The 4chan guys are great but not at that scale.

After facebook's outage started, there were like 9128 threads in /b/ with trolls claiming it was their success.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: