I have to admit that I am lacking in clue about Twitter.
Are they handling more than 64GB of user-generated data per hour? If not, why not just store everything into RAM on a big 128GB RAM server and query that?
In terms of public messages, they certainly aren't. We are collecting their public timeline and it's about 400k messages per day, each one less than 200 bytes (if you count metadata).
That's 80 MB of uncompressed data per day.
I don't know how much the private messages add to this, but it can't be an order of magnitude higher.
Perhaps they are continuously doing DB writes instead of keeping a write cache and storing content in batches, who knows.
"Perhaps they are continuously doing DB writes instead of keeping a write cache and storing content in batches, who knows."
If they are using rails (and certainly they started with it) that is almost certainly what they are/were doing. Thats the Rails Way. Store everything, load it up, remember nothing at all.
My guess is they have a really stupid architecture if they are having significant issues. This is probably due to reusing legacy code when they should have started from scratch. I am honestly temped to code something up this weekend to see if I can find out what there issue is.
I concur. I see all these startups where I know I can do much better on the technical side, but I'm just not enthusiastic enough about the subject matter to actually put the time in to do it.
It seems like a lot of these start-ups are the founders' first tries at something big, so they make a lot of mistakes; but since it's new and exciting to them, motivation wins out and they actually accomplish something.
I really admire their enthusiasm and stick-to-it-iveness.
If you've previously had a manager who vetoed your doohickies then there's a risk that some get implemented by your startup. Any of them could be a scalability nightmare.
Using Ruby on Rails is an issue, I would think. You do not need a framework for storing and sending out 140 byte messages but if you do use one, its overhead will actively hurt performance.
The fact that they chose to use RoR anyway hints that they may not be one hundred percent technically competent.
Not sure I understand how to integrate what you are saying with what I read in the post. If RoR is dog-slow--let's pick 100x slower than something you would code in the weekend--but scales, then what they can serve with 100 servers = what you serve with one server. To double capacity, you add one server and they go from 100 to 200. And so it goes.
Scalability is a second-order issue: if you go from 1 server to 2, but they need to go from 100 servers to 400 servers to handle the same load, then not only are they slower than you are, but they can't scale as well as you can.
I get why using a framework for the UI and business logic could make them 100x slower than something custom-coded. And from the original post, I get why a conventional database+read cache may not be appropriate for a messaging application.
But what I don't get is the connection between RoR and scalability. Unless you are speaking of its default configuration, namely RoR+ActiveRecord+MySQL. Which speaks more to the architecture choice (tables, rows) than to the framework choice (views, models, controllers).
There is no connection. The fact that they're using RoR to build what is really a very, very simple web site suggests their problem is not knowing which tools to use, which is a problem you can not solve by throwing more hardware at it. Talk about scalability is a red herring in this case. The real issues are elsewhere.
Ah, so it's a case of "I believe their choice of Tool A is wrong for solving Problem B, thus although I cannot see what they have done with Architecture C to solve Problem D, I don't have a lot of confidence they made the right decisions."
I am pretty certain it was mentioned in a blog post recently (I cannot find it now) that Ruby on Rails is only used for a small part of what Twitter is doing - ie the front end, and there is a lot of non Rails or Ruby code in the background.
Correct. Blaine Cook blogged (in the post this article riffs from) that they don't use ActiveRecord, for example.
I believe someone has made the defensible claim that most of their activity occurs via clients (twhirl, alert thingy) and SMS over the API rather than through the website. Unless they are using ActionController and ActionView to do the rendering and output there, it seems they would be using "traditional RoR" in a very limited capacity.
I'm not being an apologist, but I think if people are discussing Twitter and scaling for the purposes of learning something useful and not just framework bashing, language bashing or avoiding doing homework, we should probably avoiding talking as if they use the full stack in a significant way as it's a red herring.
Well said - I remember reading somewhere before that the database was mentioned (as usual) as the problem area.
Someone else in the comments mentioned 400K messages a day, plus personal messages, say thats even another 400K.
I have zero experience working on a high traffic web application, but I work on a serious database application were we can push somewhere in the region of 1M transactions in a 6-8 hour window. Each of these transactions comes in the form of an XML string and results in as many as 20 plus selects and maybe 10 - 20 inserts.
Our application is build on top of Oracle and implemented in PLSQL - not too sexy, but it seems to get the job done.
The point I am trying to make is that with a bit of a caching layer it takes some serious throughput to reach the limits of database scalability (certainly with Oracle, where is where my experience lies).
Database scaling isn't their problem. Sure they push lots of data around, but great programmers should be able to architect an optimal caching layer between the app and the database.
So now that metadata caching is out of scope here, we turn to frameworks, architectures, and application design. The issue Twitter has is in aggregation of their meta data. People who propose a distributed solution for Twitter obviously miss the inherent nature of Twitter; it's a centralized system. Twitter should remain as it is, but ditch RoR and become more modularized. This is where they start developing real systems; the type of systems they talk about in those mundane CS classes, like programming in C -- that stuff. Modularize the applicaion, develop systems that scream for aggregation, cache the hell out of everything, and start applying some computer science.
Twitter was born as one of those next-gen Web 2.0 "keep track of your friends" hip Ruby on Rails insertbuzzwordhere application. Now it poses actual architectural challenges. It's so similar to the evolution of Facebook. Just take a step back and look at it.
You're plainly ignoring all the posts where the people involved say ROR is not the problem. Caching does not work if the data is different for everyone. Please stop hearing what you want to hear--that ROR is a trendy flash in the pan and not useful har har--as you read and instead take a step back and listen to what the people involved are actually saying.
"Caching does not work if the data is different for everyone."
I think caching can work, but at a different level of granularity. Rather than cache a person's full timeline, which is composed of multiple sub-feeds (each of which requires a database query), cache the data from the sub-feeds themselves, then recombine them on every page load. This would significantly lower the number of database queries, as each cache element would be invalidated only when its "owner" sends a tweet. This solution would be much more CPU intensive on the application servers, though, and Ruby may not be the best tool for the job if that were the case.
"Traditional page caching does not work" would have been a better way for me to phrase that. I don't mean to suggest that no caching anywhere in the stack (machine or software) will be used--that would be absurd. I only mean that the cheap and easy page caching used to scale most web apps where you keep pages or chunks of pages in a cache goes out the window for twitter.
You simply cannot argue that caching doesn't work if the data is different for everyone. I understand that. It's aggregation. But the metadata lives within a database and it is imperative for that to be glossed over with a RAM based cache.
I really hope my initial post wasn't interpreted the way it was to validate that response. I don't take back anything I said--and I'm sure it's not what I want to hear. Architectures scale, not languages... Twitter simply cannot be distributed... it's not news, I've written about it earlier.
Are they handling more than 64GB of user-generated data per hour? If not, why not just store everything into RAM on a big 128GB RAM server and query that?