Why Can't Twitter Scale? Blaine Cook Tries To Explain

aschobel · on May 12, 2008

The meat in the comments of Blaine Cook's blog entry:

http://romeda.org/blog/2008/05/scalability.html#140141155247...

"Scaling Twitter as a messaging platform is pretty easy. See Mickaël Rémond's post on the subject. Scaling the archival, and massive infrastructure concerns (think billions of authenticated polling requests per month) are not, no matter what platform you're on. Particularly when you need to take complex privacy concerns into account."

Sounds like they are having the kinds of problems that Friendster had years back.

How come sites like MySpace don't have these issues? They also seem to have a pretty complex social graph.

wanderingmarker · on May 12, 2008

MySpace does have massive scaling problems, most directly related to the difficulty of partitioning a relational database.

http://www.baselinemag.com/c/a/Projects-Networks-and-Storage...

patrickg-zill · on May 12, 2008

I have to admit that I am lacking in clue about Twitter.

Are they handling more than 64GB of user-generated data per hour? If not, why not just store everything into RAM on a big 128GB RAM server and query that?

diego · on May 12, 2008

In terms of public messages, they certainly aren't. We are collecting their public timeline and it's about 400k messages per day, each one less than 200 bytes (if you count metadata). That's 80 MB of uncompressed data per day.

I don't know how much the private messages add to this, but it can't be an order of magnitude higher.

Perhaps they are continuously doing DB writes instead of keeping a write cache and storing content in batches, who knows.

michaelneale · on May 13, 2008

"Perhaps they are continuously doing DB writes instead of keeping a write cache and storing content in batches, who knows."

If they are using rails (and certainly they started with it) that is almost certainly what they are/were doing. Thats the Rails Way. Store everything, load it up, remember nothing at all.

Retric · on May 12, 2008

My guess is they have a really stupid architecture if they are having significant issues. This is probably due to reusing legacy code when they should have started from scratch. I am honestly temped to code something up this weekend to see if I can find out what there issue is.

keating · on May 12, 2008

I concur. I see all these startups where I know I can do much better on the technical side, but I'm just not enthusiastic enough about the subject matter to actually put the time in to do it.

It seems like a lot of these start-ups are the founders' first tries at something big, so they make a lot of mistakes; but since it's new and exciting to them, motivation wins out and they actually accomplish something.

I really admire their enthusiasm and stick-to-it-iveness.

xirium · on May 12, 2008

> first tries at something big

That'll be the Second-system effect ( http://en.wikipedia.org/wiki/Second-system_effect ).

If you've previously had a manager who vetoed your doohickies then there's a risk that some get implemented by your startup. Any of them could be a scalability nightmare.

mtts · on May 12, 2008

Using Ruby on Rails is an issue, I would think. You do not need a framework for storing and sending out 140 byte messages but if you do use one, its overhead will actively hurt performance.

The fact that they chose to use RoR anyway hints that they may not be one hundred percent technically competent.

raganwald · on May 12, 2008

Not sure I understand how to integrate what you are saying with what I read in the post. If RoR is dog-slow--let's pick 100x slower than something you would code in the weekend--but scales, then what they can serve with 100 servers = what you serve with one server. To double capacity, you add one server and they go from 100 to 200. And so it goes.

Scalability is a second-order issue: if you go from 1 server to 2, but they need to go from 100 servers to 400 servers to handle the same load, then not only are they slower than you are, but they can't scale as well as you can.

I get why using a framework for the UI and business logic could make them 100x slower than something custom-coded. And from the original post, I get why a conventional database+read cache may not be appropriate for a messaging application.

But what I don't get is the connection between RoR and scalability. Unless you are speaking of its default configuration, namely RoR+ActiveRecord+MySQL. Which speaks more to the architecture choice (tables, rows) than to the framework choice (views, models, controllers).

or am I missing something?????

mtts · on May 12, 2008

There is no connection. The fact that they're using RoR to build what is really a very, very simple web site suggests their problem is not knowing which tools to use, which is a problem you can not solve by throwing more hardware at it. Talk about scalability is a red herring in this case. The real issues are elsewhere.

raganwald · on May 12, 2008

Ah, so it's a case of "I believe their choice of Tool A is wrong for solving Problem B, thus although I cannot see what they have done with Architecture C to solve Problem D, I don't have a lot of confidence they made the right decisions."

Thanks for explaining your reasoning.

fendale · on May 12, 2008

I am pretty certain it was mentioned in a blog post recently (I cannot find it now) that Ruby on Rails is only used for a small part of what Twitter is doing - ie the front end, and there is a lot of non Rails or Ruby code in the background.

justindz · on May 12, 2008

Correct. Blaine Cook blogged (in the post this article riffs from) that they don't use ActiveRecord, for example.

I believe someone has made the defensible claim that most of their activity occurs via clients (twhirl, alert thingy) and SMS over the API rather than through the website. Unless they are using ActionController and ActionView to do the rendering and output there, it seems they would be using "traditional RoR" in a very limited capacity.

I'm not being an apologist, but I think if people are discussing Twitter and scaling for the purposes of learning something useful and not just framework bashing, language bashing or avoiding doing homework, we should probably avoiding talking as if they use the full stack in a significant way as it's a red herring.

fendale · on May 12, 2008

Well said - I remember reading somewhere before that the database was mentioned (as usual) as the problem area.

Someone else in the comments mentioned 400K messages a day, plus personal messages, say thats even another 400K.

I have zero experience working on a high traffic web application, but I work on a serious database application were we can push somewhere in the region of 1M transactions in a 6-8 hour window. Each of these transactions comes in the form of an XML string and results in as many as 20 plus selects and maybe 10 - 20 inserts.

Our application is build on top of Oracle and implemented in PLSQL - not too sexy, but it seems to get the job done.

The point I am trying to make is that with a bit of a caching layer it takes some serious throughput to reach the limits of database scalability (certainly with Oracle, where is where my experience lies).

brooksbp · on May 13, 2008

Database scaling isn't their problem. Sure they push lots of data around, but great programmers should be able to architect an optimal caching layer between the app and the database.

So now that metadata caching is out of scope here, we turn to frameworks, architectures, and application design. The issue Twitter has is in aggregation of their meta data. People who propose a distributed solution for Twitter obviously miss the inherent nature of Twitter; it's a centralized system. Twitter should remain as it is, but ditch RoR and become more modularized. This is where they start developing real systems; the type of systems they talk about in those mundane CS classes, like programming in C -- that stuff. Modularize the applicaion, develop systems that scream for aggregation, cache the hell out of everything, and start applying some computer science.

Twitter was born as one of those next-gen Web 2.0 "keep track of your friends" hip Ruby on Rails insertbuzzwordhere application. Now it poses actual architectural challenges. It's so similar to the evolution of Facebook. Just take a step back and look at it.

ericb · on May 13, 2008

You're plainly ignoring all the posts where the people involved say ROR is not the problem. Caching does not work if the data is different for everyone. Please stop hearing what you want to hear--that ROR is a trendy flash in the pan and not useful har har--as you read and instead take a step back and listen to what the people involved are actually saying.

yariv · on May 13, 2008

"Caching does not work if the data is different for everyone."

I think caching can work, but at a different level of granularity. Rather than cache a person's full timeline, which is composed of multiple sub-feeds (each of which requires a database query), cache the data from the sub-feeds themselves, then recombine them on every page load. This would significantly lower the number of database queries, as each cache element would be invalidated only when its "owner" sends a tweet. This solution would be much more CPU intensive on the application servers, though, and Ruby may not be the best tool for the job if that were the case.

ericb · on May 13, 2008

"Traditional page caching does not work" would have been a better way for me to phrase that. I don't mean to suggest that no caching anywhere in the stack (machine or software) will be used--that would be absurd. I only mean that the cheap and easy page caching used to scale most web apps where you keep pages or chunks of pages in a cache goes out the window for twitter.

brooksbp · on May 13, 2008

You simply cannot argue that caching doesn't work if the data is different for everyone. I understand that. It's aggregation. But the metadata lives within a database and it is imperative for that to be glossed over with a RAM based cache.

I really hope my initial post wasn't interpreted the way it was to validate that response. I don't take back anything I said--and I'm sure it's not what I want to hear. Architectures scale, not languages... Twitter simply cannot be distributed... it's not news, I've written about it earlier.

tdavis · on May 12, 2008

I am having an impossible time coming up with a unbiased and substantiated opinion on this issue. On the one hand, we know nothing about Twitter's architecture other than they're (to some extent) using a language that has a notoriously under-optimized interpreter and a framework with reported scaling issues (unless you do a lot of hacking to it).

On the other hand, there's gotta be something fishy going on in Twitter-ville. Although I agree "the idea that building a large scale web application is trivial or a solved problem is simply ridiculous," by now Twitter should have enough performance data to know exactly which part of the process is causing the high-load issues they're having. If we are to assume that after each outage they at least "throw more hardware at it," then, theoretically, it's a problem that horizontal scaling cannot solve and the issue is deep-rooted in the system -- somewhere in the basic architecture of it.

Is it Ruby? RoR? Poorly optimized queries? Improper caching? Lack of domain knowledge? Leprechauns? I don't know, neither does anyone else outside of Twitter... but I guess speculation can be entertaining.

jrockway · on May 13, 2008

I don't get why Twitter doesn't scale. It's just webmail, but with smaller messages and a simpler UI. Here's how twitter should work: every user should have a list of users following them. When they tweet, each follower gets a copy of that message in their personal inbox. A copy is also attached to the tweeter's account, so new followers can suck that copy in when they start following them.

That's it. Now, sending a message takes O(n) (n=followers) time, which is really cheap. On my machine, it takes about a second to create and sync 40,000 files (there's not much data, so replicating this via NFS wouldn't be that expensive either). With that out of the way, all you have to do is ls your "twitter directory" to see all of your friend's messages. This is another incredibly cheap operation. It's easy to distribute, and there's no locking.

Anyway, just look at the mail handling systems at huge universities and corporations. They scale fine, and they're much more complicated than twitter. Twitter is just a subset of e-mail, so it should be implement that way, not as a "SELECT * FROM tweets WHERE user IN (list, of, followers) ORDER BY date". That is the wrong approach because it makes reads (very common) expensive and writes (very uncommon) cheap. That's why twitter doesn't scale.

sutro · on May 13, 2008

Of course all of you Hacker News Monday morning quarterbacks have taken a website from launch to Twitter's current level of traffic, right? If so, please let us know which site that was so we can compare your experience and your decisions against those of the Twitter team. If not, perhaps you should go and do that first before sounding off on Twitter. No, I'm not a friend of Blane Cook's, nor am I a Twitter apologist -- I don't even use the service. But I do respect startup founders and builders far more than their critics, and I know that website architecture -- like most things -- is always easier in hindsight. For perspective, please read the Hot or Not story in "Founders at Work," then read Teddy Roosevelt's "Man in the Arena" quotation that Arrington is so fond of. (And yes I realize that reference is ironic given that Arrington started the pile-on against Blaine Cook. Arrington should try to remember that "it's not the critic that counts.")

dmose · on May 13, 2008

It's doubt it's the language. Sure, you might get more efficient throughput with another framework or language..but I think the core problem lies in how many polling API connections they have. If their API is based on their core DB and not polling off a read only copy..they have a serious design flaw.

API conections, given their nature of having to do a security check each time should be based on a slave copy read-only db which is near-realtime, or potentially dirty read but who cares, its for their API. The security lookup should be cached since you rarely change API security accounts more than once..if you do, then you update the caches.. etc.

Granted I'm no expert, but it just sounds like they're overloading their DB with polling.

Their web pages for each user should be cached since they're not really hit as much as the api or rss

I dont' care what anyone says, having THAT many API connections constantly polling their DB along with what blane said about having to do authentication requests on every singel hit is taxing on any setup you can put together.

Oh and if they have a single table holding all their tweets...big issues there. I'd have a 26 "tweet" tables, one for each letter of the alphabet and my switch in the business layer. Then simply ship tweets older than a month, or patition based on that.

gleb · on May 12, 2008

Common usage of word "scale" is that things continue to work well as you add load. Whether that's done by having the software fast enough to handle everything one computer, adding servers or adding pet monkeys is not important to the user.

Responding to reasonable complaints by using a different definition of word "scale," makes for a weak argument.

ojbyrne · on May 12, 2008

It also seems like he tries to deflect the argument by saying that people are saying Ruby (the language) doesn't scale, and then ridiculing them for that, when in fact people are saying Rails (the architecture) doesn't scale.

ks · on May 12, 2008

The author also says that it's perfectly acceptable for something to be 10x slower, because "a well-capitalized company" can afford 10x more servers... I am thinking of how to respond to that, but I don't know where to start

raganwald · on May 12, 2008

Here's a suggestion: compare it to using 10x as many programmers to write something in J2EE because it's easier to hire 10 J2EE drones than one RoR whiz (where "whiz" is understood to mean "person who has experience solving RoR problems").

It's all a question of what you believe to be the scarce resource. H=e obviously believes that for a well-founded company, server CPU cycles are not a scarce resource. This implies that he believes that RoR addresses some issue raised by something else being the scarce resource.

xirium · on May 12, 2008

There's also the issue of the timescale. Start with one programmer and server. You have the finances to purchase 10 additional servers or hire one additional programmer.

You can get 10 servers bought, delivered, installed and configured in one week. These servers can be deployed to do anything from load balancing, web caching, database caching, being database read slaves or application servers. Regardless of the quality of your architecture, 10 servers will probably add some scalability to your system. Furthermore, they can be re-purposed as the situation demands.

Hiring a better programmer (or just another warm body) takes longer. Furthermore, the improved code they produce isn't as flexible as surplus hardware.

Of course, the situation changes when you've got thousands of servers.

ivankirigin · on May 12, 2008

The web frontend could just be a static list of your contacts, with ajaxy grabbing the tweets from each. This way, you could cache each users tweets and avoid the issue of each user's page being different.

That's just one idea. I'm not really an expert.

systems · on May 12, 2008

I have several questions, because I feel there should be a better system approach to this twitter problem

is there any publicly available documentation for twitter's architecture?

did they use consultant help, did they contact SUN, IBM, Oracle or any other respected consultant when they started facing those problems?

i recommend you watch this video: http://www.infoq.com/presentations/qcon-voca-architecture-sp...

really ... we need to have a look at twitter architecture before we discuss this further

raganwald · on May 12, 2008

> SUN, IBM, Oracle or any other respected consultant

Are you implying that Sun, IBM, and Oracle are respected for the prowess of their consulting organizations? I'm sure they have some Fine People working there, however the view that "writing a seven-figure cheque to these companies guarantees a positive outcome" is not universally held.

goodkarma · on May 12, 2008

Unless you know what you need to scale to, you can't even begin to talk about scalability. How many users do you want your system to handle? A thousand? Hundred thousand? Ten million? Here's a hint: the system you design to handle a quarter million users is going to be different from the system you design to handle ten million users.

http://teddziuba.com/2008/04/im-going-to-scale-my-foot-up-y....

bluelu · on May 12, 2008

If he can't scale a simple message passing system, what can he scale then? It's not that they are doing rocket science at twitter. He probably was the wrong person for the job.

In the good old days, every message over ICQ was sent over their servers, and they probably had more messages to handle than twitter does nowadays. And there wasn't a single problem then.

quellhorst · on May 12, 2008

ICQ didn't archive a copy of every message sent.

bluelu · on May 12, 2008

This doesn't increase much the complexity...

Not thinking long about it (5 minutes), but this is how I would do it: Not going muhc into detail (I don't use twitter, just had a quick look at it).

1) 1 replicated system where you can fetch messages by id (complete body with who sent it, to whom, time, etc...) 2) User page: list of ids.... 3) Private messages: list of ids... 4) When a new message comes in, write it to a queue. Process those ids and append the ids to the different pages. Multiple processes doing this (each process has a subset of users with the complete list of people following those users). One could add another layer to do bulk inserts.

Could be easily done in Memcachedb. One page view takes x + 1 memcachedb requests (x number of items on page). One can still optimze this by caching (static html pages which are deleted when a page is updated for a user). When inserting, replace existing data by adding the ids.

Everything is nicely seperated. (Eg pages for user 1-10000 are on server 1, etc... Messages can be nicely sperated as well).

Any thoughts on this? To twitter: hire me not him ;)

jdavid · on May 13, 2008

i think the problem with twitter is that they did not use a real time architecture, like email, irc, or some other messaging platform. instead i think they write each page dynamically off of a db call based on who subscribed where. ie, you would have to start from scratch to scale it. i think the problem is that now that twitter, let the cat out of the bag, they are now chasing problems on an old architecture, and trying to scale that, while also trying to build a new version. it suppose its like running two tech companies.