*This is probably why it isn't ruling the world.* Technical merit and popularity...

dasil003 · on March 30, 2009

I sort of see where you're coming from in terms of the parity mismatch between RDBMSs and typical application code. Certainly we can avoid some coding headaches by just dumping things into a data store that is optimized for what we want to do with that data right now. But then you say that most applications have a well-defined data model and rarely run "queries". This is where I think you've gone terribly terribly wrong.

The value of a relational database is that it most agnostically represents the reality of what the data represents. It's not about "big piles of data" or "making sense of your data", relational databases are about making your data as expressive as possible. You're selling this idea that most applications only use data in a few predefined ways. I have to say that sounds like a complete pipe dream. Requirements change all the time. Reporting needs often are not even conceived until you have hundreds of megabytes of data. Let's not even get into multiple applications using the same database.

In every business I've ever been involved with, the data is always more valuable than the code, and it always outlives the code. Too much of the hype around these alternative database technologies are throwing the baby out with the bathwater. The idea that "most" applications don't need structured data just strikes me as incredibly naive and short-sighted. Far more applications need structured data than need to scale.

jrockway · on March 30, 2009

Well, let's think about this in more detail.

Let's say you have two types of data, customers and orders. Customers have many orders, an order belongs to a customer. This is easy enough with a typical relational database. You have a customers table and an orders table. You can join the tables and ask questions like "how many customers spent more than $300 last year?"

Now let's consider the graph/object database equivalent. You have two classes, Order and Customer. A Customer has a set of Orders, and the Order has a customer. (Cycles are fine, this is a graph, not a tree.) Creating an order works with some method like $customer->new_order_for('some pants'). You store this in the database, and the graph structure is stored and indexed. (Usually, object databases index on class name, but you can always specify other conditions. This makes it basically equivalent to the relational database.) Note that this structure works very well; the in-memory relationship is the same as the in-storage relationship. You can also write the same query as with the relational database. Get all customers, find their orders from this year, and sum the totals. (Instead of writing SQL, you would just write a script here. You can index things like the order year to speed up the query, as well. Otherwise, it's O(n), but so is the relational database without an index. There is no magic after all.)

Anyway, there is no lack of flexibility with the graph database. If you want to query your data, you can. It's just less convenient, since you have to write a program to do it, instead of letting your database management engine do it. (This is actually not true in general, AllegroGraph has a querying engine based on prolog.)

Back when I did data warehousing, we had to move all our data from the web app servers to a warehousing server in a specialized schema so that some GUI software could manipulate the data. Even though we used a relational database, we had to convert anyway. Using an object database would have made the app code simpler, and the warehousing code equally complex. So I think that would be a gain, not a loss.

dagheti · on March 30, 2009

The flexibility of the relational model is built on its bare simplicity: values + sets + logic. If you propose adding to this and giving up the benefits provided by these simplifications (simplicity of reasoning, ability to change the physical implementation without affecting the logical one [those not-magic indexes], declarative constraints, declarative queries), you need to have a really good reason.

The network model if it means anything is letting you add pointers to the mix and changing how you reason about your data from sets to graphs. This makes it harder to declare constraints, check integrity, update sets of data, access your data in different ways, and reason about your queries. An imperative script is in no way as safe a program as a declarative query.

The most clear benefit in my view to the network model it lets you use your object code fairly seamlessly. Ok.

The problem is that this is a good trade off for programs where you don't care primarily about your data, but about your code. I'd argue that isn't true for the majority of programs that have databases at all. The issues that kill your system years down the line and give you nightmares are not code issues, they are data issues. And the code you write to fix those problems... will end up re-implementing all those annoying "heavy" bits of RDBMSs that coders seems to hate.

jrockway · on March 30, 2009

And the code you write to fix those problems... will end up re-implementing all those annoying "heavy" bits of RDBMSs that coders seems to hate.

Or you won't. I have many important KiokuDB applications in production. And they are not toy Web 2.0 things, they are important sites that perform important data analyses. I don't think any of the logic I had to write to support was particualarly difficult, and it was fewer lines of code than I would need to define my ORM classes.

The flexibility of the relational model is built on its bare simplicity: values + sets + logic.

Simplicity is good. However, software is complex, and complexity has to live somewhere. Look at git, for example. The underlying model is simple and beautiful, blobs, trees, and commits. Wonderful. But, to make that beautiful model into a revision control system, thousands of lines of code had to be written. So while simplicity makes the bottom part of the system simpler, it didn't do much for the overall simplicity of the entire system.

I feel the same way about object databases. They are more complex than a relational database (unless you count things like replication and embedded scripting and ... as relational database features, which I don't), but they help me decrease the complexity of the code I see.

As an example, when I use an RDBMS, I have to maintain:

SQL schema + upgrades

ORM classes

Logic classes (for abstractions over multiple tables or data stores)

Random scripts to query the database and give me munged reports

With an object database, I only have to write the logic classes and the random scripts. The random scripts are slightly more complicated, but not significantly more. The main app is much simpler, and much easier to test. I also don't have to hack my data model into the relational model. (Ever represent a tree in a relational database? It's a hack.)

jules · on March 30, 2009

> However, software is complex, and complexity has to live somewhere.

Exactly. This argument comes up in a lot of topics. For example, that pure functional programming is easier to reason about. True, but as soon as you have to simulate state it's more complex that just having state. This is also the reason that pure functional programming isn't automatically parallelizable. Data dependencies don't magically go away.

Same thing if you are simulating object database features in a RDBMS.

jrockway · on March 31, 2009

For example, that pure functional programming is easier to reason about. True, but as soon as you have to simulate state it's more complex that just having state.

I actually disagree here. It's true that the traditional state model of "everything gets everything" makes writing code very easy. Everything can see and modify everything else, so you never have to worry about manually passing state around.

This makes the code easy to write, but very difficult to debug. A function can do something different every time you call it with the same arguments. That is hard to reason about.

There are actually a few layers involved, the programming language, the data types, and the app code. In a non-functional language, you really ignore the data types and let your app code interact with the language. ("var foo = 42; ...; foo++")

In a language like Haskell, it's important to think about your data types. Your state abstraction will live there, and then your app code will not have to worry about the details of passing state around. So in the end, your app code looks the same in either paradigm.

An advantage about this approach is an increase in generality. If you just have some variables hanging around and some "stuff" that uses them, that's all you can ever have. If you interact with state via well-defined interfaces like the Monad or Applicative Functor, you can build abstractions that work on all types like this. This has worked very well for Haskell. (An example is the do syntax for monadic computation. Although you have functional purity, your program looks imperative. This, IMO, is the best of both worlds. There is some complexity under the hood, but you get code that's easy to read, write, and reason about.)

dagheti · on March 31, 2009

I agree with you here 100%. Maybe this is a good pathway to explaining why relational databases are good ideas in much the same way as having a programming strategy that uses a backbone of pure functions. This purity-dividend is exactly why the relational model is such a good strategy for database management.

In the same way haskell lets you separate your pure functional code from your monadic code, giving you the ability to referentially transparent reasoning and construct imperative machines where you must, the RDBMS approach applies total logic programming to database management. It's frustrating because your code cannot all live in one language as it can in Haskell, but the separation of pure from impure is exactly where the value is coming from.

RDBMSs focus on values and total computations (you know they will halt, making it even easier to reason about queries than non-total functions) allows you to isolate simple logic programs from the rest of your impure non-total code.

Navigational databases don't give the same benefits because they are not value based nor are they total. They are the imperative-model of the database world, and though they are easy to write, they will end up difficult to debug, maintain, and keep coherent.

jrockway · on March 31, 2009

I don't think this is what I meant.

dagheti · on March 31, 2009

I agreed with your overall point, but I was expanding on part of what you were saying:

You seem to appreciate how Haskell allows you to separate your pure code from your non-referentially transparent monadic code. This gives you reasoning where you can have it, and you use monads where you can't.

I'm just saying that same argument applies in databases where using a relational database allows you to separate your total code from your pure code. You use the value-based relational strategy where you can have it (necessary and sufficient for database management), and you use functional (or imperative) programming where you can't.

dagheti · on March 30, 2009

Ok so let's clarify this:

(SQL schema & upgrades + ORM classes + Logic classes, random scripts) - (logic classes + random scripts) = SQL Schema & upgrades + ORM classes

The question then becomes how do you handle the trade-off between being able to declare constraints (that evil schema) and the necessity to "map" how you access your data to your programming language.

If you "program" your constraints in your host language, you will need to either recreate the declarative system provided by a RDBMs or else create a system that will be fragile. Sure your code protects you against inserting bad data right? Well what about updating? What about when you delete? What if your constraint references another value in your database? What if that one changes? Integrity is best declared, not guarded by your program.

If your network model implements declarative constraints, then how do you define them? Those "random scripts" again? Suddenly all those performance benefits of navigation disspear when you're updating that data and checking those constraints.

I've seen many improtant production systems that used the network data model and as a warning: it usually ends badly. Be it COBOL, or whatever fresh re-invention of that navigation database wheel.

jrockway · on March 31, 2009

Integrity is best declared, not guarded by your program.

How do programs that don't use a database stay internally consistent?

dasil003 · on April 1, 2009

With much testing and reinventing of the wheel.

dasil003 · on March 31, 2009

I can definitely relate to the ugliness of ORM, and the elegance of object stores in an OOP environment. That's all fine and good, you can chop off the entire bottom of the stack, I get it.

My problem is data modeling with objects. I build my object architecture based on what I want to do with the data, but I've never been happy with it as a direct data model. It changes too quickly. Do I put things in an array or a hash? Well, it depends. Persisting these specific structures might be expeditious for the app at hand, but if I need to access the data in some other context then I could see complexity spiraling out of control very quickly.

A relational schema adds layers yes, but have low-level structured data adds tremendous value. There's a parity mismatch to be sure, but I'd rather do those situps than have no structured data at all.

jrockway · on March 31, 2009

Edit: One other thing, representing polymorphic data in a relational database is a nightmare.

Edit2: Wait, how did this end up in a separate comment? I am confused :)