This is known generally as the "Thundering Herd" problem: The thundering herd pr...

schrep · on Sept 24, 2010

We actually encounter Thundering Herd problems on a very regular basis. The Starbucks page has nearly 14M fans and posts may get tens of thousands of comments/likes with a high update rate. You have a lot of readers on a frequently changed value which means it is not often current in cache and you can have a pileup on the database.

Since we encounter this on a regular basis we have built a few different systems to gracefully handle them.

Unfortunately, the event today was not just a thundering herd because the value never converged. All clients who fetched the value from a db thought it was invalid and forced a re-fetch.

datums · on Sept 24, 2010

Would monitoring the rate of invalidation and triggering an event handler help ?

0xbadcafebee · on Sept 24, 2010

The second strangest part of this outage (to me) was that the cluster "was quickly overwhelmed by hundreds of thousands of queries a second." Does Facebook not have a way to curtail the number of queries being sent to its databases? I'm not a highly experienced programmer or dba but i've seen mod_perl websites that can do this with their API layer. It was engineered in once it was realized the database cluster could only do so many queries and connections at a time and they didn't want to lock up the database servers.

The strangest thing to me was that the clients had the permission to essentially cause a race condition across the whole site. From what I understand, the client ran an API call which either forced a re-fetch of a key which apparently only needed to be fetched once (and thus theoretically could have been staged in advance using a database application not running on the frontend site which could update the cache and prevent a database fetch, or just update the database, whichever was more necessary), or failing the database connection due to the aforementioned thundering herd of QPS it also triggered a re-fetch from the db (which again could have been prevented by a db app pre-loading the new value). So, if my outsider's idea of how Facebook's code works is accurate, this could have been prevented if either the cache/database was "pre-fetched" in the background not using client API calls, or if client API calls simply weren't allowed to all modify [and read] from a single key in the database. The latter point seems less likely than the former, but possible.

(Sorry if i'm speaking out of line or as an uneducated FB user, but according to these comments and RJ's breakdown this is how it appears to me)

Fluxx · on Sept 24, 2010

This is a common pitfall when people start using Memcache. Memcache puts less load on the DB, which means you can keep the same DB hardware and scale up your pageviews...until Memcache craps out and all that load is now back on the DB again :(

VladRussian · on Sept 24, 2010

anybody knows what DB their memcached was fronting? Oracle (in dedicated mode)?

pyre · on Sept 24, 2010

Since most of their (original) codebase is/was PHP, I would say there is a high likelihood that their db is MySQL (though I guess you could refer to it as Oracle since they own it).

shalmanese · on Sept 24, 2010

Facebook uses MySQL.

sbov · on Sept 24, 2010

When stuff breaks I find that this phenomena can oftentimes make it hard to differentiate the source of your problem from symptoms of your problem. We usually would do it by going through the logs and see what problems showed up first.

Are there any architectures/patterns/methods that can help make it easier to find the source of performance issues?

ora600 · on Sept 24, 2010

I'm often encountering systems that are designed for very short queries processing small numbers of rows where the number of connections from the app servers is configured to be far greater than the number of CPUs.

Since those systems are doing very little IO, configuring connection pools to start with much more connections than the DB has CPUs and to add more if the connections are busy (i.e. DB is getting slower, probably because it is highly loaded), is guaranteed to cause resource contention on the DB and escalate the issue in case something goes wrong.

Its a total waste and yet the most common configuration in the world.

mikey_p · on Sept 24, 2010

I've also heard this called a Cache Stampede, since it's commonly caused by the sudden invalidation of a cache, which was the case for this outage.

cliffchang · on Sept 24, 2010

i know this isn't reddit, but when you say "thundering herd", do you mean the processes, or the comments on the story?