Cassandra at Twitter Today

seldo · on July 10, 2010

I don't know about you, but to me this seems like a big blow to the reputation of Cassandra. I know a bunch of people (including myself) whose primary motivation for investigating Cassandra as a large-scale data store was "Twitter is using it, therefore it's got to be fine for what we want".

In my experience this was not the case -- after quite a lot of prototyping, we moved away from Cassandra. It scales brilliantly, but its query time has a pretty solid floor that I was unable to shift to speeds approaching what I needed: it never went up, but nor did it go down.

Of course, I am a total noob at distributed systems -- perhaps it's simple to get it to go faster, but Twitter's move away suggests not. (Facebook's use-case is quite different, and in any case they still use an awful lot of MySQL too) The other systems they cite as currently using Cassandra in production are nowhere near the scale of the primary tweet store.

alex1 · on July 10, 2010

I don't think they're avoiding Cassandra as their tweet store because it's not capable of performing well. It's probably more cause it would be a really long process migrating every tweet into Cassandra. I'm sure if they could, they would go back to the first day Twitter launched, and use Cassandra as the tweets store.

As for query time, maybe your problem was I/O throughput? I've read Cassandra can be really slow if you have poor I/O throughput. This happens a lot in the cloud, where I/O is usually shared.

jacobbijani · on July 10, 2010

Don't they not show / hide & archive old tweets anyway? Seems like they could really simple transition Cassandra in if they wanted to.

mrduncan · on July 10, 2010

On a small scale, yes. On Twitter scale (in February they were averaging 50 million tweets/day [1]), it's probably not so simple.

[1] http://blog.twitter.com/2010/02/measuring-tweets.html

Edit: As of mid June, it was about 65 million per day.

http://blog.twitter.com/2010/06/big-goals-big-game-big-recor...

jacobbijani · on July 10, 2010

My point was that they aren't importing legacy data at all if they aren't showing old stuff. Just start writing the new ones to the new DB, right?

antirez · on July 10, 2010

I think that apart from the speculations about why twitter does not want to move to a new store for tweets, the important "take off" here is that what they (and other companies) do is experimenting with different platforms, and then using what they feel it's right for them.

Don't use what others are using or suggesting. Download a few stores, take your time to understand how they work, and then do your pick. Time consuming? Not as time consuming as starting a big project with a system that is not a good fit.

jacobbijani · on July 10, 2010

Or you could just use MySQL.

schubertzhang · on July 10, 2010

1. Cassandra is very young! Especially, the design and implementation of local storage and local indexing are junior and not good.

2. Pool read-performance is also due to the poor local storage implementation.

3. The local storage, indexing and persistence structures are not stable. They need to be re-designed /re-implemented. If Twitter move data to current Cassandra, they should do another move later for a new local storage, indexing and persistence structure.

4. There are many good techniques in Cassandra and other open-sourced projects (such as Hadoop, HBase) etc. But, they are not ready for production. Understand the detail of these techniques and implement them in your projects/products.

Schubert Zhang

code_duck · on July 10, 2010

Every time I use Twitter, it is malfunctioning in a very noticeable and bizarre new way. I don't care about scale, how difficult it is to be twitter, etc - I would not take their word about any technical matter without a tsp of salt.

code_duck · on July 10, 2010

Ha ha. Everything I say on this site is down voted recently. As if I fucking care.

Twitter :

- can't find my @ replies. Shows zero. Not correct.

- has been showing 2-5 copies of each message in my own history for the past week

- gave me 30 error screens in a row, then an hour later on their status blog "we've noticed elevated rates of errors". Like a constant stream of errors is normal for them, but elevated rates - time to do something!

- can't get their widgets/external code straight, redirects people to twitter.com/undefined and twitter.com/null

Etc, etc. Give me a break. Twitter is hopeless, and has always been hopeless in the tech department.

tszming · on July 10, 2010

At the middle of Hype cycle, they are now getting real.

niekmaas · on July 10, 2010

Interesting to see that they use two different db systems. I'm a bit worried about this. I believe it is just postponing an important decision. In the future they will have to choose for one db system. With the growth of Twitter wouldn't it be better to make a decision now and focus all resources on optimizing one system?

lox · on July 10, 2010

Why do they have to choose 1 database system? I never understand this reasoning. Pick the best tool for the job!

I for one whole-heartedly agree with their decision. Considering the issues they've had recently with their internal networks, it would be insanity to move the majority of their stack to a new technology.

MySQL scales just fine, provided you are good at partitioning and at least it's a known quantity.

mbrubeck · on July 10, 2010

There's no reason a company can't use different tools for different jobs. When I worked at Amazon.com we used many storage and query systems: Oracle, MySQL, S3, BDB (and various simple distributed systems built on BDB), Dynamo, SimpleDB, memcached, etc. etc.

strlen · on July 10, 2010

One size doesn't fit all. MySQL and Cassandra are radically different system in both the query model and distribution models. Cassandra and HBase are similar in the data models (the differences are mandated by their respective distribution models), but are radically different in their distribution models. Entirely different systems are suited for entirely different applications.

simonw · on July 10, 2010

To my mind, the whole NoSQL thing is primarily about "polyglot persistence" - building systems on more than one storage system. I think large scale apps that use more than one db system are becoming the norm. NoSQL means acknowledging that different storage systems are good for different features.

gecko · on July 10, 2010

Meh. Many, many applications have at least two database systems: a caching layer, such as memcached; and a persistance layer, such as PostgreSQL. Twitter's basically just using Cassandra in place of memcached. This doesn't strike mas a dangerous or bizarre architecture.

ora600 · on July 10, 2010

"Twitter's basically just using Cassandra in place of memcached"

Not sure about that. They also said they are using it for analytics. Actually, since Cassandra is persistent to disks, using it as a memcached doesn't sound right.

babo · on July 10, 2010

Since Cassandra 0.6 there is a way to keep data in memory, acting as a cache.

c3o · on July 10, 2010

They're also using their own FlockDB (http://github.com/twitter/flockdb) to store their social graph.

alex1 · on July 10, 2010

FlockDB isn't a data store itself. They use FlockDB on top of MySQL AFAIK.

xal · on July 10, 2010

For actual data storage they build this layer on top of MYSQL ( and redis, and all other things that need sharding ) : http://github.com/twitter/gizzard