Can you explain a little about the problem? It makes sense to me to store everything in the database and scale that. A solution to scaling the database seems to be horizontal partitioning where possible. Is this hard to do?
Am I missing something about why scaling frameworks like rails or django is hard?
On the topic of twitter scaling, I can see the need for heavy database activity here causing scaling issues. Can anyone provide insight into how they solved this if it wasn't something like horizontal partitioning?
It makes sense to me to store everything in the database and scale that... Is this hard to do?
Is Larry Ellison a multibillionaire?
I think it's impossible to explain database scaling and performance tuning in a handful of paragraphs, and I'm certainly not the one to try. If I were to try and tell you exactly what it is about Twitter that makes engineers cry, I'd point out that every single page is dynamic, every user's main page requires a giant JOIN, there are lots of writes coming in all the time from every direction and writes are harder to scale, low latency is a requirement for many people, and there's no obvious axis along which to "partition" Twitter. For example, the PlentyOfFish guy talked about how he could split up his databases based on geography -- it's overwhelmingly likely that people in South Bend, Indiana want to look for dates within fifty miles of South Bend, Indiana, rather than in Spokane. But on Twitter I can follow anyone and anyone can follow me, so the giant JOIN that builds my homepage has to span the entire dataset. And, sure, you can build a cache for every user, but then every user who sends a tweet to 1000 followers triggers the update of one thousand caches, with one thousand internal messages to one thousand event queues... and when one machine full of user caches goes down, what then? It's not acceptable to drop the message on the floor for a subset of users.
Some folks will solve this, but they will be better hackers than I, they will spend a lot of money on hardware, and they will drink a lot of coffee. And it will take time.
If I were to try and tell you exactly what it is about Twitter that makes engineers cry, I'd point out that every single page is dynamic, every user's main page requires a giant JOIN, there are lots of writes coming in all the time from every direction and writes are harder to scale, low latency is a requirement for many people, and there's no obvious axis along which to "partition" Twitter.
That's more along the lines of what I was looking for. Thanks. This sounds tricky...
That 'point' is where a world of troubles begin. That point is where the number of database updates per second exceeds the capability of your hardware platform.
At this point, like massive sites such as YouTube, flickr, or facebook, you are faced with the option of database sharding - splitting your data up by value (ie. Users A-K on this server and L-Z on that server).
You can no longer treat the database as a black box which does all the magic of multi-table joins. Rails also does not provide any tools to help accomplish sharding out of the box.
Am I missing something about why scaling frameworks like rails or django is hard?
On the topic of twitter scaling, I can see the need for heavy database activity here causing scaling issues. Can anyone provide insight into how they solved this if it wasn't something like horizontal partitioning?