I don't know about you, but to me this seems like a big blow to the reputation of Cassandra. I know a bunch of people (including myself) whose primary motivation for investigating Cassandra as a large-scale data store was "Twitter is using it, therefore it's got to be fine for what we want".
In my experience this was not the case -- after quite a lot of prototyping, we moved away from Cassandra. It scales brilliantly, but its query time has a pretty solid floor that I was unable to shift to speeds approaching what I needed: it never went up, but nor did it go down.
Of course, I am a total noob at distributed systems -- perhaps it's simple to get it to go faster, but Twitter's move away suggests not. (Facebook's use-case is quite different, and in any case they still use an awful lot of MySQL too) The other systems they cite as currently using Cassandra in production are nowhere near the scale of the primary tweet store.
I don't think they're avoiding Cassandra as their tweet store because it's not capable of performing well. It's probably more cause it would be a really long process migrating every tweet into Cassandra. I'm sure if they could, they would go back to the first day Twitter launched, and use Cassandra as the tweets store.
As for query time, maybe your problem was I/O throughput? I've read Cassandra can be really slow if you have poor I/O throughput. This happens a lot in the cloud, where I/O is usually shared.
I think that apart from the speculations about why twitter does not want to move to a new store for tweets, the important "take off" here is that what they (and other companies) do is experimenting with different platforms, and then using what they feel it's right for them.
Don't use what others are using or suggesting. Download a few stores, take your time to understand how they work, and then do your pick. Time consuming? Not as time consuming as starting a big project with a system that is not a good fit.
1. Cassandra is very young! Especially, the design and implementation of local storage and local indexing are junior and not good.
2. Pool read-performance is also due to the poor local storage implementation.
3. The local storage, indexing and persistence structures are not stable. They need to be re-designed /re-implemented. If Twitter move data to current Cassandra, they should do another move later for a new local storage, indexing and persistence structure.
4. There are many good techniques in Cassandra and other open-sourced projects (such as Hadoop, HBase) etc. But, they are not ready for production. Understand the detail of these techniques and implement them in your projects/products.
Every time I use Twitter, it is malfunctioning in a very noticeable and bizarre new way. I don't care about scale, how difficult it is to be twitter, etc - I would not take their word about any technical matter without a tsp of salt.
Ha ha. Everything I say on this site is down voted recently. As if I fucking care.
Twitter :
- can't find my @ replies. Shows zero. Not correct.
- has been showing 2-5 copies of each message in my own history for the past week
- gave me 30 error screens in a row, then an hour later on their status blog "we've noticed elevated rates of errors". Like a constant stream of errors is normal for them, but elevated rates - time to do something!
- can't get their widgets/external code straight, redirects people to twitter.com/undefined and twitter.com/null
Etc, etc. Give me a break. Twitter is hopeless, and has always been hopeless in the tech department.
Interesting to see that they use two different db systems. I'm a bit worried about this. I believe it is just postponing an important decision. In the future they will have to choose for one db system. With the growth of Twitter wouldn't it be better to make a decision now and focus all resources on optimizing one system?
Why do they have to choose 1 database system? I never understand this reasoning. Pick the best tool for the job!
I for one whole-heartedly agree with their decision. Considering the issues they've had recently with their internal networks, it would be insanity to move the majority of their stack to a new technology.
MySQL scales just fine, provided you are good at partitioning and at least it's a known quantity.
There's no reason a company can't use different tools for different jobs. When I worked at Amazon.com we used many storage and query systems: Oracle, MySQL, S3, BDB (and various simple distributed systems built on BDB), Dynamo, SimpleDB, memcached, etc. etc.
One size doesn't fit all. MySQL and Cassandra are radically different system in both the query model and distribution models. Cassandra and HBase are similar in the data models (the differences are mandated by their respective distribution models), but are radically different in their distribution models. Entirely different systems are suited for entirely different applications.
To my mind, the whole NoSQL thing is primarily about "polyglot persistence" - building systems on more than one storage system. I think large scale apps that use more than one db system are becoming the norm. NoSQL means acknowledging that different storage systems are good for different features.
Meh. Many, many applications have at least two database systems: a caching layer, such as memcached; and a persistance layer, such as PostgreSQL. Twitter's basically just using Cassandra in place of memcached. This doesn't strike mas a dangerous or bizarre architecture.
"Twitter's basically just using Cassandra in place of memcached"
Not sure about that. They also said they are using it for analytics. Actually, since Cassandra is persistent to disks, using it as a memcached doesn't sound right.
For actual data storage they build this layer on top of MYSQL ( and redis, and all other things that need sharding ) : http://github.com/twitter/gizzard
In my experience this was not the case -- after quite a lot of prototyping, we moved away from Cassandra. It scales brilliantly, but its query time has a pretty solid floor that I was unable to shift to speeds approaching what I needed: it never went up, but nor did it go down.
Of course, I am a total noob at distributed systems -- perhaps it's simple to get it to go faster, but Twitter's move away suggests not. (Facebook's use-case is quite different, and in any case they still use an awful lot of MySQL too) The other systems they cite as currently using Cassandra in production are nowhere near the scale of the primary tweet store.