Relational Data, Document Databases and Schema Design

gfodor · on July 5, 2010

Document databases are a passing fad, a backlash against the flaws in the currently available open source databases. These flaws have nothing to do with the relational model.

If there was an open source RDBMS that allowed instant/lazy schema migrations, direct map-reduce access to its data, was able to automatically partition and shard, and somehow magically had performant joins you would never see these document database come to be so popular. Document databases have no mathematical grounding analagous to the relational model; the reason is because they are a hackneyed application of the relational model that allows implemention to bleed even further into the model than current RDBMses do. For example, the fact a developer now will have to choose between "embedding" or joining documents together application side has absolutely zero different in terms of the true data model. The data model is unchanged in either scenario, it is just being materialized differently on disk for implementation issues.

Edit: To be clear, the way things should work is you should model your data as a normalized, clean model. Then, you should be able to annotate the data or provide alternative views that "embed" documents and so on to reduce cross-table joins. This is an optimization layer and the actual materialization of these views should be maintained by the database. It should be part of the query optimizer, and your joins should simply speed up once they can access the "embedded versions". The fact that you have to do this all manually and store the data yourself, and basically take on the role of the query optimizer as well, shows that we've traded C for assembly code in the db world with these document databases.

It's truly a sad state of affairs that peoples horrible experiences with whatever particular RDBMS they've used are causing them to react by using one of these document databases. Now, they've got a whole series of new problems. There is no free lunch until someone addresses some of the pain points like schema evolution in a way that doesn't sacrifice everything else.

herewego · on July 5, 2010

I disagree. You are talking about the relational model conceptually. The author is talking about the relational model as it is implemented in today's RDBMS solutions.

You elude to this in the last paragraph of your comment. Having a horrible experience horizontally scaling, for example, is ample reason enough to look into document databases. Largely because the current implementations of document databases allow for ease of partitioning in a way that the current implementations of RDBMS' don't.

The author doesn't imply that document datastores are free lunches. They're simply alternatives and he goes on to say that document datastores are better for some problem domains than others.

Also, you oversimplify the difficulties in implementing a highly-performant document datastore by implying that data should be normalized and that storing the data on disk is an optimization layer. This comment comes off as being written by someone who is ignorant of how document disk storage is implemented and why it is so fast. Doing what you suggest would very likely result in a datastore that was drastically slower than the currently implemented document datastores or even key/value datastores for that matter (MongoDB, CouchDB, Cassandra, etc).

Will we someday perhaps have the end-all-be-all of RDBMS's that merges the feature-set of current RDBMS's and the performance of document/key-value datastores? Maybe, but until then, they're alternatives to each other that generally succeed in providing solutions for the problems for which they were designed:

* RDBMS - consistency, transactions, predictable schema, etc. * Document/Key-Value - horizontal scalability, raw performance, flexible schema, etc.

Different solutions for different problem domains.

ljlolel · on July 5, 2010

Hi, I think that you are completely right. Schema evolution and easy partition/sharding are broken on current open-source RDBMS systems. Relational data has strong mathematical guarantees as well as useful normalization properties for ensuring consistency.

I have a question about what good solution you have to solve these problems. Are you suggesting that Oracle does this well as a commercial solution? Is anybody at all working on a next-generation RDBMS that makes schema migration truly easy?

For the sharding/partitioning question, is it even possible to make a relational data scheme that stays available and consistent in the case of network partition?

gfodor · on July 5, 2010

The solution is pretty clear, simply make it easy to do what the author suggests in their article, in a safe and repeatable way. You define your new schema, and you define the tranformation rules to run on the old data. In turn, the database applies these lazily. There'd be some formality necessary to prove certain operations could be applied lazily to provide a consistent view of the data. People are doing this by hand now; it should be part of the system. One can imagine a mathematical model that involves deferred relational operators. The constraints that would be imposed would probably be that certain operations would not be permissable until others complete. For example, it might be necessary, when running one of these lazy migrations, to "flush" the remaining untransformed rows to the end before you are able to apply another one. The upside is your migration scripts run instantly, becuase they have no side effects until the next transaction.

The CAP theorem I think often gets rolled up into these "NoSQL" discussions erroneously. One can still model data relationally but materialize it in a way that reduces ACID compliance to survive in the face of network partitions. Arguably this is a red herring though. In practice, the "network partition" problem to me seems to be only really worth worrying about if you are Google or Amazon and are running cross-data center data storage. For the majority of the world this is not something that needs to be worried about. (See: VoltDB's choice on the matter.)

Edit: By the way, I'm not saying any of this is easy. It's hard, really hard. (I couldn't do it.) But the point here is that it's easy to get caught up and think these 'new' document databases are solving these problems. They're not. They're putting you back down to the assembly code level and forcing you to do these things yourself. The whole point of the RDBMS is to provide an abstraction layer. They're removing this abstraction layer and heralding it as a step forward. It is, in a bizarre sense similar to how it's a "step forward" for you to be able to access the registers on your CPU, but it's surely a step backwards in many ways as well.

jchrisa · on July 5, 2010

The compelling thing about CouchDB is the ability to take a copy of the database offline, make changes, and then sync back up with other copies of the database later (or never).

If this isn't a network partition I don't know what is.

Of course thinking like this is totally unlike what most relational users are accustomed to, but this use case is only becoming more prominent as applications spread to diverse and occasionally connected devices.

I have a hard time imagining practical offline APIs without the full MVCC document model. Once you see this use case's value it's hard not to want to write all your apps so that they can be replicated offline for disconnected use.

mjw · on July 5, 2010

One relevant point here is that it's relatively easy to merge two branches of changes for a simple (key/value) data model, but harder to merge changes to relational data where there are additional constraints (like foreign keys) which must be preserved.

That said, just because it's easy to merge key/value data, doesn't mean that the results of the merge are necessarily going to be correct. The constraints still exist, they're just not explicit in the data model, meaning that when your network partition gets joined up again and it magically merges all those changes at the key-value level, you may be ending up with data that's broken in subtle, difficult-to-track-down ways because other implicit constraints weren't respected by the merge.

So, merging changes in a way that respects constraints which apply to that data, is a hard problem in general. Key/value stores, with their less-structured data model, sweep that problem under the carpet somewhat in order to make things easier in distributed settings. Relational databases suffer from trying to address the problem more fully.

gfodor · on July 5, 2010

This is a really important point. There seems to be this joy people are having from shedding their schema since they can just get stuff done and not have it get in the way. This should be just as much scary as liberating. You're basically flying on manual now, and there's nothing protecting your data integrity anymore.

dalore · on July 5, 2010

I disagree, RDF is gaining ground and is getting more popular. In the real world, schema changes are a fact of life and RDF handles it well. RDF also solves the relationship quandary you mention and even allows database (stores) to be distributed and disparate.

mjw · on July 5, 2010

RDF is based on a highly-normalised data model though--everything is broken down into binary relations.

This is essentially just a rephrasing of the relational model (modulo a relatively simple translation which adds surrogate keys in order to split up >2-ary relations which aren't normalisable any further, into binary relations)

mjw · on July 5, 2010

Couldn't agree more.

They key point which I think people need to realise is that document databases / key-value stores are essentially lower-level tools than relational databases.

Don't expect a free lunch from choosing lower-level tools. Choose them when you need the extra flexibility / performance / scalability, but be prepared for an increased risk of shooting yourself in the foot, and prepared to implement lots of lower-level pieces of the puzzle, which a high-level tool will handle correctly (if not as speedily) for you.

Rather like choosing C over (say) Python. Although I wouldn't push the analogy tooo far.

wanderr · on July 5, 2010

I stopped reading when he said consistency is in your hands (with noSQL). Since you have to manage relationships, and doing so requires several round trips to the datastore, and generally doesn't involve transactions, what happens when inserts 1 and 2 succeed, but insert 3 fails? What happens if you crash before insert 3? In high volume applications, these situations will arise and you will end up with inconsistent data. That doesn't prevent noSQL from being useful in some situations, but it's a far cry from putting control over consistency in your hands.

jchrisa · on July 5, 2010

CouchDB has full ACID on a single document, so you can avoid these scenarios by writing everything that needs to happen in a transaction into a single document. Modeling your data like this means you are also more robust against partial replications (when using offline mode).

It is possible to do banking this way (CouchDB is probably better for banking than relational systems). Here's an example:

http://books.couchdb.org/relax/reference/recipes

wanderr · on July 5, 2010

I'm not too familiar with CouchDB in particular or what exactly a "document" is. Let's say I have a site about books. I need to be able to look up books by author, title, isbn and users who have it in their catalogue. Is there a way in CouchDB to ensure consistency while maintaining those relationships?

jchrisa · on July 6, 2010

Yes, that's what the incremental map reduce indexes are used for. Each book would be a document, which you could query by any of those attributes. I'd store the fact that each user owns a book as it's own document, and then you can easily get a view of books by user, or of users by book.