> What's not addressed is that this may not be an advance over Relational databases
It's not. The relational model is much better, for many applications.
BUT.
The missing context is, the relational model cannot be sanely scaled out across multiple machines. Replication mostly works with some pain, but scaling writes is just a nightmare -- and what you have to do to partition your data across multiple rdbms nodes means giving up all that relational goodness. So what you end up with is neither of {easily scaled, relational}.
So if you can't have a scalable rdbms, the next best thing is a scalable key/columnfamily dbms. Something like Cassandra. Which, as a bonus, gives you significantly better per-node performance on modern hardware.
I'm curious: how easily does "easily scaled" mean? How much advance planning needs to be done about the eventual size of a cassandra cluster?
I think I can see how you can add machines when you need more capacity, but what about when you don't need all that capacity anymore? How do you go about removing machines from the cluster, and how does all of the nicely scaled out content get rebalanced when you do?
> what about when you don't need all that capacity anymore
It's not really an interesting use case... For the same reason that in every language I can think of, hashtables grow as you insert items but they don't shrink as you remove them, because if it got to size X once, it will probably do so again in the near future.
That said, decommissioning nodes will fall out naturally from our work on automatic load balancing for the 0.5 release.
You can update at the column granularity, so having a row of 100k columns is fine since you don't have to rewrite the whole thing to update a small part. (Cassandra rows have a sparse list of columns, so you can treat a row as a sorted set, not just a fixed-size vector. The model Digg describes in another article uses this: http://blog.digg.com/?p=966)
Columns are also indexed so you can retrieve from large rows efficiently too.
Finally Cassandra also adds the concept of a SuperColumn, which is a column that contains other columns.
I'll bite. There is nothing in the design of relational databases that specifies that a record must be fixed length. Most good dbs don't do this at all, simply using something akin to:
Each row in your db has:
row size: row data
for each column:
cell size, cell data
In multi-value databases, which the cassandra supercolumn seems to be mimicing somewhat, your cell data can contain more cells again, and can support huge hierarchies. I like the design of these systems that keeps the relational model, but stores all related data in two places, which dramatically increases speed as you are reading related data off the same disk block. Typically this means huge performance gains as you read in large blocks (kb) off disk as it is.
In my opinion, having written a few multi dimensional indexing systems and db's from scratch (including a multi value db), there is little preventing a relational database becoming the best of both worlds. What do you think about the relational model makes it so "impossible" to scale?
I tend to think it's simply legacy code. Has anyone created a relational database with the goal of horizontal scalability from the outset?
My point is that a better designed database, specifically for scalability may alter the point at which CAP becomes relevant for all but the largest problems. E.g. custom indexing, partitioning & caching algorithms.
We see this with vertical-specific databases, having 1-3 orders of magnitude performance advantage over traditional dbs.
It's not. The relational model is much better, for many applications.
BUT.
The missing context is, the relational model cannot be sanely scaled out across multiple machines. Replication mostly works with some pain, but scaling writes is just a nightmare -- and what you have to do to partition your data across multiple rdbms nodes means giving up all that relational goodness. So what you end up with is neither of {easily scaled, relational}.
So if you can't have a scalable rdbms, the next best thing is a scalable key/columnfamily dbms. Something like Cassandra. Which, as a bonus, gives you significantly better per-node performance on modern hardware.