WTF is a SuperColumn? An Intro to the Cassandra Data Model

StrawberryFrog · on Sept 9, 2009

So it's not in first normal form. http://en.wikipedia.org/wiki/First_normal_form What's old is new again, and I wonder if they even know it.

What's not addressed is that this may not be an advance over Relational databases. In fact, when relational databases were first designed, databases like this were all-to-common, and RDBMS rejected this kind of data model explicitly, for good design reasons. For instance, how do you select from or join to parts of the complex values inside a column?

jbellis · on Sept 9, 2009

> What's not addressed is that this may not be an advance over Relational databases

It's not. The relational model is much better, for many applications.

BUT.

The missing context is, the relational model cannot be sanely scaled out across multiple machines. Replication mostly works with some pain, but scaling writes is just a nightmare -- and what you have to do to partition your data across multiple rdbms nodes means giving up all that relational goodness. So what you end up with is neither of {easily scaled, relational}.

So if you can't have a scalable rdbms, the next best thing is a scalable key/columnfamily dbms. Something like Cassandra. Which, as a bonus, gives you significantly better per-node performance on modern hardware.

harkham · on Sept 10, 2009

I'm curious: how easily does "easily scaled" mean? How much advance planning needs to be done about the eventual size of a cassandra cluster?

I think I can see how you can add machines when you need more capacity, but what about when you don't need all that capacity anymore? How do you go about removing machines from the cluster, and how does all of the nicely scaled out content get rebalanced when you do?

jbellis · on Sept 10, 2009

> what about when you don't need all that capacity anymore

It's not really an interesting use case... For the same reason that in every language I can think of, hashtables grow as you insert items but they don't shrink as you remove them, because if it got to size X once, it will probably do so again in the near future.

That said, decommissioning nodes will fall out naturally from our work on automatic load balancing for the 0.5 release.

gruseom · on Sept 9, 2009

Which, as a bonus, gives you significantly better per-node performance on modern hardware

Could you expand on that last bit? i.e. what is it about modern hardware (and about Cassandra) that makes this true?

jbellis · on Sept 10, 2009

Summarized here: http://news.ycombinator.com/item?id=813894

Maro · on Sept 9, 2009

Can you explain what the major advantage (and disadvantage) of the data model of Cassandra over a flat key-value store is?

jbellis · on Sept 10, 2009

You can update at the column granularity, so having a row of 100k columns is fine since you don't have to rewrite the whole thing to update a small part. (Cassandra rows have a sparse list of columns, so you can treat a row as a sorted set, not just a fixed-size vector. The model Digg describes in another article uses this: http://blog.digg.com/?p=966)

Columns are also indexed so you can retrieve from large rows efficiently too.

Finally Cassandra also adds the concept of a SuperColumn, which is a column that contains other columns.

trapper · on Sept 10, 2009

I'll bite. There is nothing in the design of relational databases that specifies that a record must be fixed length. Most good dbs don't do this at all, simply using something akin to:

Each row in your db has:

row size: row data

for each column:

cell size, cell data

In multi-value databases, which the cassandra supercolumn seems to be mimicing somewhat, your cell data can contain more cells again, and can support huge hierarchies. I like the design of these systems that keeps the relational model, but stores all related data in two places, which dramatically increases speed as you are reading related data off the same disk block. Typically this means huge performance gains as you read in large blocks (kb) off disk as it is.

In my opinion, having written a few multi dimensional indexing systems and db's from scratch (including a multi value db), there is little preventing a relational database becoming the best of both worlds. What do you think about the relational model makes it so "impossible" to scale?

I tend to think it's simply legacy code. Has anyone created a relational database with the goal of horizontal scalability from the outset?

jbellis · on Sept 10, 2009

> What do you think about the relational model makes it so "impossible" to scale?

ACID and the CAP theorem.

trapper · on Sept 10, 2009

My point is that a better designed database, specifically for scalability may alter the point at which CAP becomes relevant for all but the largest problems. E.g. custom indexing, partitioning & caching algorithms.

We see this with vertical-specific databases, having 1-3 orders of magnitude performance advantage over traditional dbs.

bhousel · on Sept 9, 2009

You have some catching up to do... :)

They are not at all concerned about normal form or joins - it's just not that kind of a database.

Here are some good blog posts that elaborate on the current non-RDBMS fad:

http://www.readwriteweb.com/enterprise/2009/02/is-the-relati...

http://www.metabrew.com/article/anti-rdbms-a-list-of-distrib...

http://randomfoo.net/2009/04/20/some-notes-on-distributed-ke...

http://bcbio.wordpress.com/2009/05/10/evaluating-key-value-a...

http://blog.boxedice.com/2009/07/25/choosing-a-non-relationa...

seiji · on Sept 9, 2009

I believe the answer is http://browsertoolkit.com/fault-tolerance.png

People are just trying to store data in the most efficient system possible. Using RDBMS software to store data only queried by primary key that you know will never need to be aggregated or dynamically queried can be overkill.

th0ma5 · on Sept 9, 2009

The guy that invented the normal forms also went on to invent analytical forms, which are the exact opposite. For example, the Entity Attribute Value (EAV) model, used by many including the W3C Semantic Web.

blasdel · on Sept 9, 2009

It's too bad that RDF has been tainted by the W3C's pipe -dreams and been bludgeoned with a terrible XML serialization.

EAV is by far my favorite data model.

cosmohh · on Sept 9, 2009

looks pretty interesting. anyone already used it and maybe compared it with CouchDB, too ?

jbellis · on Sept 9, 2009

Cassandra is a scalable key/columnfamily database designed for supporting low-latency applications with vast amounts of data partitioned across many machines. (Facebook has 40TB on a 150-machine Cassandra cluster.)

CouchDB is a document database that supports two-way replication so you can re-sync after taking part of the data offline, but it's still designed around the concept of a single master that holds all the data.

Completely different animals, in other words.

damienkatz · on Sept 9, 2009

> CouchDB is a document database that supports two-way replication so you can re-sync after taking part of the data offline, but it's still designed around the concept of a single master that holds all the data

To clarify, CouchDB's replication features are peer-based and ad-hoc, there is no "master" replica. And you can take part or all of the data offline, or spread it geographically.

I think by single master, you mean CouchDB doesn't have support for partitioning a single logical database across machines, which is true (though there are projects for large scale partitioning with CouchDB, but it's not in the core yet).

ddiljoy · on Sept 9, 2009

Does a Cassandra cluster stay write-available in the event of a network partition?

If so, how does it reconcile writes when the partition heals? Last I looked, Cassandra doesn't use vector/logical clocks - doesn't that potentially cause data loss when the partition heals if you're using a simple last-write-wins based on physical timestamps for a reconciliation policy? Does Cassandra use merkle trees for anti-entropy?

From what I can tell, although Cassandra claims to be write-fault-tolerant, the dependence on physical timestamps and the lack of the self-healing properties that merkle trees provide make me nervous about data loss and inconsistency when deploying it at scale.

jbellis · on Sept 9, 2009

> Does a Cassandra cluster stay write-available in the event of a network partition?

The client can specify whether it wants consistency (refuse writes if not enough write targets are there) or availability.

If it chooses availability, then Cassandra sends extra copies to nodes it _can_ reach, with a tag that specifies who the "real" destination is. When that node is reachable again it will be forwarded. ("Hinted handoff.")

> how does it reconcile writes when the partition heals?

As you said, last-write-wins. The experience with Dynamo showed that most apps don't want to deal with explicit conflict resolution, and don't need it. (But, I suspect we will end up adding it as an option for those apps that do. In the meantime, if Cassandra isn't a good fit, we're not trying to hard-sell anyone. :)

> Does Cassandra use merkle trees for anti-entropy?

Not yet, but my co-worker Stu Hood is working on this. Should be part of the 0.5 release.

> the dependence on physical timestamps and the lack of the self-healing properties

Whether the first is an issue is app-specific. As to the latter, I'm excited to get the merkle tree code in, too.

In the meantime, Cassandra _does_ do read repair and hinted handoff, so in practice it's what I would call "barely adequate." :)

_csoo · on Sept 9, 2009

Facebook has 40TB on a 150-machine Cassandra cluster

So why did people stop using mainframes again?

btilly · on Sept 10, 2009

People haven't stopped using mainframes. In fact sales of mainframes have been growing at a very healthy clip.

That said, commodity hardware is a lot cheaper than niche hardware due to increased competition and economies of scale. So any problem that can reasonably be tackled on commodity hardware is generally cheaper to do so.

However not all problems are a good fit for commodity hardware. Mainframes win if you need reliability guarantees, and win again for sustained high volume IO. Commodity hardware doesn't keep up. Similarly there are computational problems that require a lot of IO and communication between the processing nodes. Supercomputers beat clusters of commodity hardware for those problems.

But if you've got a computational problem and it doesn't fall into one of those narrow categories, commodity hardware will be cheaper.