I wonder if their dislike for Cassandra is based on previous versions pre-2.0. F...

bloodredsun · on April 16, 2013

Not if you want strong consistency. Cassandra's performance sucks in comparison with the likes of MongoDb or Couchbase when reading with strong consistency since the clients have no idea of the server topology.

cnlwsu · on April 16, 2013

umm what? Cassandra is just as fast/faster (depending on both configurations and load) compared to MongoDB with consistant read/writes. Definitely with writes but reads get tricky.

bloodredsun · on April 16, 2013

Firstly, these sorts of applications are always going to be more read heavy so the reads are more important. Secondly, Cassandra cannot and will not be as good as something like Couchbase since the client libraries are not aware of the server topology so they cannot make direct connection to the server hosting the data. This means that depending on your consistency requirements, Cassandra will be merely okay to occasionally terrible depending on whether you care about 99th percentile. This behaviour was one of the reasons my company moved away from Cassandra

This is probably the best benchmark of Cassandra/MongoDb/Couchbase http://www.slideshare.net/renatko/couchbase-performance-benc...

pkolaczk · on April 16, 2013

This benchmark is pure marketing. Cassandra clients can be token aware and can do connections directly to the right node.

If you want a scientific, independent, peer-reviewed NoSQL benchmark, read this: http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf

Cassandra is a clear winner here.

bloodredsun · on April 17, 2013

No, the benchmark is not pure marketing. Why would you claim that it is? Apart from Astyanax, which clients are token aware?

That paper is very useful so thanks for posting the link but it has a number of issues as I see it.

1) It considers Cassandra, Redis, VoltDB, Voldermort, HBase and MySQL. It does not cover either MongoDB or Couchbase.

2) Latency values are given as average and do not show p95/p99. In my experience, Cassandra in particular is susceptible to high latency at these values.

3) Even considering average values, the read latency of Cassandra is higher than you would see with either MongoDB or Couchbase.

4) Cassandra does not deal well with ephemeral data. There are issues while GC'ing large number of tombstones for example that will hurt a long running system.

The long and short of it is that Cassandra is a fantastic system for write heavy situations. What it is not good at are read heavy situations where deterministic low latency is required, which is pretty much what the pinterest guys were dealing with.

pkolaczk · on April 17, 2013

It is marketing, because Couchbase is a featured customer of Altoros, the company that did the benchmark. And the rule of thumb is: never trust a benchmark done by someone who is related to one of the benchmarked systems. Obviously they'd not publish it if Couchbase lost the benchmark. They must have had been insane to do it.

Another reason it is marketing is because it lacks essential information on the setup of each benchmarked system. E.g for Cassandra I don't even know which version they used, what was the replication factor, what consistency level did they read data at, did they enable row cache (which decreases latency a lot), etc.? Cassandra improved read throughput and latency by a huge factor since version 0.6 and is constantly improving so the version really matters.

rbranson · on April 20, 2013

First, let me concede that Cassandra has had a storied history of terrible read performance. However, if the last time anyone looked at Cassandra for read performance was 0.8 or used size-tiered compaction, I'd encourage them to take another look.

The p95 latency issues were largely caused by GC pressure from having a large amount of relatively static data on-heap. In 1.2, the two largest of these: bloom filters and compression data were moved off-heap. It's my experience that with 1.2, most of the p95 latency is now caused by network and/or disk latency, as it should be.

I'm not going to compare it with other data stores in this comment, but I'd encourage people to consider that Cassandra is designed for durable persistence and larger-than-RAM datasets.

As far #4, this is mostly false. Tombstones (markers for deleted rows/columns) CAN cause issues with read performance, but "issues while GC'ing large number of tombstones" is a bit of a hand-wavey statement. The situation in which poor performance would result from tombstone pile-up is if you have rows where columns are constantly inserted and then removed before GC grace (10 days). Tombstones sit around until GC grace, so effectively consider data you insert to live for at least 10 days, unless of course you do something about it.

Usually people just tune the GC grace, as it's extremely conservative. It's also much better to use row-level deletes if possible. If the data is time-ordered and needs to be trimmed, a row-level delete with the timestamp of the trim point can improve performance dramatically. This is because a row-level tombstones will cause reads to skip any SSTables with max_timestamp < the tombstone. It also means compaction will quickly obsolete any succeeded row-level tombstones.

Here's a graph of P99 latency as observed from the application for wide row reads (involving ~60 columns on average, CL.ONE) from a real 12-node hi1.4xlarge Cassandra 1.2.3 cluster running across 3 EC2 availability zones. The p99 RTTs between these hosts is ~2ms.

http://i.imgur.com/WRdps3B.png

This also happens to be on data that is "ephemeral" as our goal is to keep it bounded at ~100 columns. The read:write ratio is about even. It has a mix of row and column-level deletes, LeveledCompactionStrategy, and the standard 10 day GC grace.

alexfernandez · on April 20, 2013

Cassandra is a real winner in that study only if you need on the order of 100K ops/sec. Otherwise high latency can be a killing factor.

zcam · on April 15, 2013

You probably mean pre 1.2 as 2.0 isn't there yet.

cnlwsu · on April 16, 2013

DataStax did call it 2.0 and now 3.0

pkolaczk · on April 16, 2013

DataStax Enterprise 2.0 shipped with Cassandra 1.0 DataStax Enterprise 3.0 shipped with Cassandra 1.1

DataStax Enterprise is a different product than Cassandra. Cassandra is one of its components, but there are more things bundled, e.g. Solr and Hadoop.