This is known generally as the "Thundering Herd" problem:
The thundering herd problem occurs when a large number of processes waiting for an event are awoken when that event occurs, but only one process is able to proceed at a time. After the processes wake up, they all demand the resource and a decision must be made as to which process can continue. After the decision is made the remaining processes are put back to sleep, only to wake up again to request access to the resource.
This occurs repeatedly, until there are no more processes to be woken up. Because all the processes use system resources upon waking, it is more efficient if only one process is woken up at a time.
This may render the computer unusable, but it can also be used as a technique if there is no other way to decide which process should continue (for example when programming with semaphores).
Though the phrase is mostly used in computer science, it could be an abstraction of the observation seen when cattle are released from a shed or when wildebeest are crossing the Mara River. In both instances, the movement is suboptimal.
We actually encounter Thundering Herd problems on a very regular basis. The Starbucks page has nearly 14M fans and posts may get tens of thousands of comments/likes with a high update rate. You have a lot of readers on a frequently changed value which means it is not often current in cache and you can have a pileup on the database.
Since we encounter this on a regular basis we have built a few different systems to gracefully handle them.
Unfortunately, the event today was not just a thundering herd because the value never converged. All clients who fetched the value from a db thought it was invalid and forced a re-fetch.
The second strangest part of this outage (to me) was that the cluster "was quickly overwhelmed by hundreds of thousands of queries a second." Does Facebook not have a way to curtail the number of queries being sent to its databases? I'm not a highly experienced programmer or dba but i've seen mod_perl websites that can do this with their API layer. It was engineered in once it was realized the database cluster could only do so many queries and connections at a time and they didn't want to lock up the database servers.
The strangest thing to me was that the clients had the permission to essentially cause a race condition across the whole site. From what I understand, the client ran an API call which either forced a re-fetch of a key which apparently only needed to be fetched once (and thus theoretically could have been staged in advance using a database application not running on the frontend site which could update the cache and prevent a database fetch, or just update the database, whichever was more necessary), or failing the database connection due to the aforementioned thundering herd of QPS it also triggered a re-fetch from the db (which again could have been prevented by a db app pre-loading the new value). So, if my outsider's idea of how Facebook's code works is accurate, this could have been prevented if either the cache/database was "pre-fetched" in the background not using client API calls, or if client API calls simply weren't allowed to all modify [and read] from a single key in the database. The latter point seems less likely than the former, but possible.
(Sorry if i'm speaking out of line or as an uneducated FB user, but according to these comments and RJ's breakdown this is how it appears to me)
This is a common pitfall when people start using Memcache. Memcache puts less load on the DB, which means you can keep the same DB hardware and scale up your pageviews...until Memcache craps out and all that load is now back on the DB again :(
Since most of their (original) codebase is/was PHP, I would say there is a high likelihood that their db is MySQL (though I guess you could refer to it as Oracle since they own it).
When stuff breaks I find that this phenomena can oftentimes make it hard to differentiate the source of your problem from symptoms of your problem. We usually would do it by going through the logs and see what problems showed up first.
Are there any architectures/patterns/methods that can help make it easier to find the source of performance issues?
I'm often encountering systems that are designed for very short queries processing small numbers of rows where the number of connections from the app servers is configured to be far greater than the number of CPUs.
Since those systems are doing very little IO, configuring connection pools to start with much more connections than the DB has CPUs and to add more if the connections are busy (i.e. DB is getting slower, probably because it is highly loaded), is guaranteed to cause resource contention on the DB and escalate the issue in case something goes wrong.
Its a total waste and yet the most common configuration in the world.
This certainly won't be the first time that a system designed to increase uptime actually reduces it. I've seen a lot of "redundant" systems that are actually less reliable than simple standalones thanks to all the extra complexity of clustering.
I guess at Facebook's scale you have to build in fallbacks but this is a reminder that you can easily do more harm than good.
Very true - the complexity of the system also comes into the equation when things go wrong and it takes real people to figure out why an outage is happening. Highly complex systems imply longer debugging time, and at a certain point a theoretically lower up time can give you higher practical up time just because engineers can actually understand and debug it.
Sometimes cluster systems are mistaken for high-availability solutions while they are actually load-balancing solutions and can decrease availability due to added complexity and dependencies.
Sometimes designers and builders of a system make decisions about tradeoffs between availability and scaling and error recovery that they don't understand until after the system is in the field. Any sufficiently complex system becomes vulnerable to unexpected interactions between subsystems especially when one or more subsystems fails catastrophically. http://en.wikipedia.org/wiki/System_accident has more on the topic, but you do want to read Perrow's book http://books.google.com/books?id=VC5hYoMw4N0C&printsec=f... Normal Accidents.
In am not sure that increasing uptime was a goal for this system. The way I read the story, this is about a system designed to make it easier to manage a server farm that was misconfigured and not robust against such configuration errors. So, it is more about a system designed to keep a server farm consistent. Staging those configuration changes could be a way to decrease the effects of future such errors, as it gives operators time to notice that something is amiss.
The other thing is that systems that cut across your architecture need to be handled with utter paranoia. They'll be the coupling points that cause global outages.
This is exactly why I've always been skeptical of extensive unit testing. Certainly you want some but the bugs almost always seem to crop up at the integration points. I always try to focus my tests on as much of the stack as I can reasonably address.
Of course, doing whole-stack testing in a meaningful way at Facebook's scale is a very tricky thing.
It's not an either-or thing, I agree, but I see so much emphasis placed on unit testing lately, often at the expense of integration testing, particularly in the ruby world where people mock and stub so much of the code that you're testing something very far removed from the real runtime.
I don't think a lot of devs (perhaps especially the TDD faithful) appreciate how clumsy, immature, and downright primitive modern testing and mocking tools are. Even the best mocking frameworks give you nearly free reign to make up any behavior for mocked objects, many devs don't appreciate the importance of the gap between mocked behavior and actual behavior.
Which makes integration testing all the more critical.
Stick with Mysql
* * * * If it aint broke, dont fix it!
Me too Melissa, and it's out there in the media that a group of hackers caused the problem, is this true, Mr Robert Johnson?
PLEASE !!!! WHAT CAN YOU DO TO HELP THIS FROM EVER HAPPENING AGAIN???????????????? PLEASE!!!!!!!!!!!!!!! CANDY
Kip da updates comin'
Did anyone get a message like I did about someone trying to access your account from another state?
I actually came across a pretty good comment which explains it (to the layman) pretty well:
"Marvin, the server was like a dog chasing its tail...it kept going in circles, but never caught it. They basically had to hold the tail for the dog so he could bite a flea on it. :) LOL"
At least one actually understood it quite well and translated the issue adequately, and hilariously, for the remaining audience:
Angee Lening Marvin:
the server was like a dog chasing its tail...
it kept going in circles, but never caught it.
They basically had to hold the tail for the dog so he
could bite a flea on it. :) LOL
They built their own custom DNS server? Most of the failures that people encountered were a failure to contact the nameserver itself. Perhaps in the rush to try to fix it someone screwed up that as well.
It seems like that was their attempt to "shutdown" the site. It's a pretty effective way too... The easiest way to stop the stampede was to drop off the net completely. By tweaking their dns, they were able to give themelves enough room to breathe. Then they could slowly start to bring people back.
That's just a guess. But from their post, it seems reasonable.
So umm no mention of the 4chan DOS attack? I mean, not that I hang out there or anything, but a friend told me that they organized an attack. You'know. Jus sayin. /b/ye
The Russians were barely able to slow down Facebook back in 2009. Unless someone on 4chan has a bigger botnet and just decided for lulz to DDoS Facebook, 4chan would not even come close to generating enough traffic to register on the #2 website on the Internet.
Completely possible. Except the posts to call for the attack were made prior to the world-known outages. But I can't argue your point. I just found it curious that no one mentioned the planned attack.
The thundering herd problem occurs when a large number of processes waiting for an event are awoken when that event occurs, but only one process is able to proceed at a time. After the processes wake up, they all demand the resource and a decision must be made as to which process can continue. After the decision is made the remaining processes are put back to sleep, only to wake up again to request access to the resource.
This occurs repeatedly, until there are no more processes to be woken up. Because all the processes use system resources upon waking, it is more efficient if only one process is woken up at a time.
This may render the computer unusable, but it can also be used as a technique if there is no other way to decide which process should continue (for example when programming with semaphores).
Though the phrase is mostly used in computer science, it could be an abstraction of the observation seen when cattle are released from a shed or when wildebeest are crossing the Mara River. In both instances, the movement is suboptimal.
From: http://en.wikipedia.org/wiki/Thundering_herd_problem