Visual Guide to NoSQL Systems

jbellis · on March 14, 2010

Submitted http://news.ycombinator.com/item?id=1191075 because a lot of people in this thread could use an explanation of what CAP actually means. :)

strlen · on March 14, 2010

I think P should be an S ("split"), a lot of people seem to confuse partition in the context of "partitioning a database" with a network partition ("split-brain" scenario).

Perhaps a better example to use should be a distributed chat room.

strlen · on March 14, 2010

This doesn't look entirely right to me.

Some comments:

Some RDBMS actual perform background asynchronous replication. If you're doing your reads from slaved replicas, you've got an "eventually consistent" system ("A{possibly, P}"). If you're only doing reads from the master node, you're no longer highly available. If you're "promoting" slaves in the case of a master failing, you're _AT_ best back to the case of eventual consistency (if the master comes back and the replication backlog is replayed correctly) or you might not be consistent at all. To echo others, let's call this "potential consistency".

Eventual consistency isn't something Amazon decided to make up, we've dealt with eventually consistent systems on a daily basis long before (DNS, usenet, some distributed file systems, etc...).

BigTable and co. are not "CP" and much more true to the "CA" definition than say MySQL replication or Postgres + Slony. They use protocols like Paxos with leases (BigTable) and Zab (used by ZooKeeper based systems). These are modified 3PC protocols: unlike 2PC protocols, they still guarantee fault tolerance ("A") even if the coordinator node fails; unlike pure 3PC protocols they offer certain guarantees of liveliness (e.g., through the use of leases) and even some degree of partition tolerance. Still, Google had to build Spanner on top of BigTable to provide partition tolerance / true multi-datacenter operation.

cx01 · on March 14, 2010

> If you're "promoting" slaves in the case of a master failing, you're again back to the case of eventual consistency.

I don't think so. If you use asynchronous replication and the master fails directly after acknowledging a new request but before forwarding it to the slaves, this request may be totally lost. So you'd have no consistency at all (or at least not eventual consistency).

strlen · on March 14, 2010

My bad. Editing the original comment. You're completely right, I forgot the case of master failing before all writes made to it are replicated.

gruseom · on March 14, 2010

Given that it's been proven that there is a tradeoff between C, A, and P, this triangle is a good idea, one of those ideas that is obvious as soon as you see it (but not before).

It's fascinating how much disagreement there is in the comments, here and in the OP, about where to place the various systems. We know that a system can't occupy more than one edge, yet apparently it's non-trivial in several cases to decide which. I wonder why? Are people are defining terms differently? Are some of the systems themselves poorly specified?

cx01 · on March 14, 2010

It's unfortunate, but a lot of people participating in these discussions have no clue. They read that CouchDB claims to support "ACID" and then conclude that it fulfills the C in CAP. Now in reality, CouchDB isn't ACID compliant because it doesn't even support transactions. Additionally the C in ACID is not equivalent to the C in CAP.

Or take a look at this sentence from the article "Consistent, Available (CA) Systems have trouble with partitioning and typically deal with it with replication and sharding.". I assume that instead of "partitioning" he meant "partitions". Even then, partitions can not be dealt with by sharding in any sense. And mentioning "replication" is totally misplaced here because obviously all of these data stores use some kind of replication.

by · on March 15, 2010

"Additionally the C in ACID is not equivalent to the C in CAP."

It would help if the people writing papers about CAP knew what the C in ACID meant before they wrote their papers.

Here is the relevant paragraph from the paper "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services" Created: 10/08/2003 14:42 http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pd...

"Most web services today attempt to provide strongly consistent data. There has been significant research designing ACID2 databases, and most of the new frameworks for building distributed web services depend on these databases. Interactions with web services are expected to behave in a transactional manner: operations commit or fail in their entirety (atomic), committed transactions are visible to all future transactions (consistent), uncommitted transactions are isolated from each other (isolated), and once a transaction is committed it is permanent (durable). It is clearly important, for example, that billing information and commercial transaction records be handled with this type of strong consistency."

Notice it says "... committed transactions are visible to all future transactions (consistent), ..."

This is not correct because the data can be overwritten by a subsequent transaction.

In this version of the same paper http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.... Created: 09/11/2007 01:17 the paragraph is the same except it says

"... transactions never observe or result in inconsistent data (consistent), ..."

If we go back to the people who seem to have coined 'ACID' in this paper "Principles of Transaction-Oriented Database Recovery" Theo Haerder and Adreas Reuter 1983. http://www.minet.uni-jena.de/dbis/lehre/ws2005/dbs1/HaerderR...

We see

"Consistency. A transaction reaching its normal end (EOT, end of transaction), thereby committing its results, preserves the consistency of the database. In other words, each successful transaction by definition commits only legal results."

and later they say

"... a database is consistent if and only if it contains the results of successful transactions. Such a state will hereafter be called 'transaction consistent' or 'logically consistent'. A transaction, in turn, must not see anything but effects of complete transactions (i.e., a consistent database in those parts that it uses), and will then, by definition, create a consistent update of the database."

that is for consistency the trasaction only has to have isolation and to commit consistent results, not 'observe' consistent data. This is because the database up to the point where the transaction starts has been created by a number of committed transactions against an originally empty database. By induction the data the transaction 'observes' must be consistent because the previous transactions have only committed consistent results.

I do not know what the C in CAP means.

cx01 · on March 15, 2010

> Notice it says "... committed transactions are visible to all future transactions (consistent), ..." > This is not correct because the data can be overwritten by a subsequent transaction.

It seems the paper is completely wrong here. AFAIK the C in ACID means that the integrity constraints in the database will always hold, i.e. no foreign key may reference a non-existing row. This would explain why they changed it in the later version of the paper.

Your reference to the Haerder, Reuter paper is interesting. It seems they kept the definition of consistency pretty vague.

by · on March 15, 2010

Yes, that seems correct. The Wikipedia entry explained consistency best for me

http://en.wikipedia.org/wiki/ACID#Consistency

which is as you say with the keys for normal databases.

It made clear for me the distinction between application level consistency, the bits that have to be enforced by the program, and database level consistency, which can vary depending on the sophistication of the database, and is enforced by the database. I hadn't really thought about that before. I suppose you could have a database that was less clever and left most of the consistency to the application but still be ACID compliant so long as you defined the scope of the database consistency carefully.

nathanh · on March 14, 2010

It's hard to often hard to reach a consensus because each system has mechanisms for compensating for features that they lack. Ex: RDBMSs often deal with partition intolerance using replication and sharding.

viraptor · on March 14, 2010

I'm not familiar with some of the solutions, but some of them got me wondering:

- Why is Tokyo Cabinet on the AP line? It's not distributed in any way by itself. It's protected by a single write lock, so it's definitely consistent. If they mean Tokyo Tyrant which provides a simple way of replication (even master-master replication), then...

- Why is MySQL only on AC line? If you do master-master setup with it, it could be on the AP line too. (and probably any relational db that does replication / clustering)

javery · on March 14, 2010

Partition Tolerant really means that it doesn't rely on replication/sharding which means something like Cassandra that is truly distributed.

viraptor · on March 14, 2010

Then TC is definitely in the wrong place, even if they meant TT, because it's simply replicated.

MySQL Cluster on the other hand.... It does have a consistent view works on many nodes and in some configurations doesn't depend on the number of data nodes. It's properly distributed, not replicated. Doesn't seem to violate any of C A P. What am I missing in this example?

cx01 · on March 14, 2010

Mysql Cluster performs 2phase commit for every transaction, so availability is not guaranteed.

sjs · on March 14, 2010

You can tune Riak to have whichever your qualities you need with its NRW values. N is the number of replicas of your data which can be specified per-bucket. R is the number of nodes that have to successfully respond to a read operation before it's considered to succeed overall, and W is the number of nodes that must respond to a write. R and W are specified with each read and write operation which is quite flexible.

If you want consistency you push R and W closer to N. If you need availability and partition tolerance then you lower R and W. You get to choose the mix of CAP that you need, for each bucket and each request.

I'm a Riak noob but this is my understanding of the system so far.

oomkiller · on March 14, 2010

Since Riak has tunable NRW parameters, it could be in any of these groups.

wanderr · on March 14, 2010

Can anyone explain to me how redis supports partitioning? I'm still learning but so far I haven't found any support for it.

strlen · on March 14, 2010

Partition tolerance in CAP means tolerance to a network partition. An example of a network partition is when two nodes can't talk to each other, but there are clients able to talk to either one or both of those nodes. If you've ever used IRC and experienced a netsplit, this is a great example of that.

A CA system guarantees strong consistency, at the cost of not being able to process requests unless all nodes are able to talk to each other. An AP system is able to function during the network split, while being able to provide various forms of eventual consistency.

There's several forms of eventual consistency. There's the "weak eventual consistency", where you may not be able to read your writes. In the context of Dynamo-based key/value stores this actually only occurs as a failure condition: when the first node in the preference list for a key isn't available.

Another form of eventual consistency is "read-your-writes" consistency, where by you use a quorum of R reads, W writes out of a total of N replicas and set R + W > N. In this case you're guaranteed to be able to read your writes even if multiple nodes in the preference list for a key fail (as long as you're able to meet a quorum).

cx01 · on March 14, 2010

You mean something like "store all keys with an even hash value on server X, and all keys with an odd hash value on server Y"? You can do this yourself in your application: Just hash the value and then send the request to the corresponding server.

wanderr · on March 15, 2010

How is that any different than how you would do it with MySQL? If both products have the same solution to the same problem (handle it in your application) then why aren't they on the same edge?

cx01 · on March 15, 2010

Read the comment by "strlen". Partitioning and network partitions are two totally different things.

wanderr · on March 15, 2010

I did, but I still don't see how redis and MySQL differ in how they handle either type.

cx01 · on March 16, 2010

If you want to partition (shard) your data, then this will work roughly the same in MySQL and Redis. But this article is only about partition tolerance. Neither MySQL nor Redis are inherently distributed, so it's not possible to generally state that either is partition-tolerant or not. The article is misleading here. You could use a replication strategy that is partition-tolerant or you could use a strategy that's not partition-tolerant.

wanderr · on March 17, 2010

Thanks, that's what I thought. I was hoping I was wrong. :P

jcdreads · on March 15, 2010

Oracle Coherence (formerly Tangosol) can be configured to be more or less anywhere you like on this triangle, including persisting everything to disk. Makes me sad that it's so expensive, and it chats a lot if you replicate fast-changing data across too many nodes, but man it sure is nice to have around.

javery · on March 14, 2010

I could be wrong but it seems like MongoDB isn't really partition tolerant since it relies on replication or sharding (just like an RDMS). I think it is more of an AC (with a caveat on the C)

mark_l_watson · on March 14, 2010

As you say, MongoDB has support for replica pairs that are partition tolerant, but part of this functionality is in client libs. For Ruby, you register both dual masters in a replica pair with the client library. This seems like a practical approach to me, or at least fairly easy to work with.

javery · on March 14, 2010

replica pairs are replication though, they aren't truly distributed. (they negotiate which one is master on launch)

mark_l_watson · on March 14, 2010

Nice! A well thought out reference. One thing: I wish that people would add RDF data stores to NoSQL discussions.

sid0 · on March 14, 2010

http://quotes.burntelectrons.org/683 comes to mind. :)

sabat · on March 14, 2010

Not entirely sure why Postgres, MySQL are in a NoSQL guide -- maybe the title should be different? Interesting chart to look at, though.

FWIW: yourtinnitus.com is down right now.

dws · on March 14, 2010

"Compare and Contrast." They're on the chart to show what your options are when you select CA from CAP, and to provide a basis for comparison with the other options.

davidw · on March 14, 2010

As a point of reference, presumably.