Performance is always a trade-off; wou rarely get something for nothing. These o...

ubernostrum · on July 2, 2009

"I wonder if we just have a generation of programmers who can't think in SQL."

Maybe we have. I'm not so sure.

What I am sure of is that we've had multiple generations of developers who've seen that real-world data almost never fit into a uniform, enforceable schema, and so are happy to have solutions which accept that reality instead of demanding that developers and DBAs try to shoehorn the data into uniformity anyway.

gaius · on July 2, 2009

I wonder if we just have a generation of programmers who can't think in SQL.

I think that's probably true. SQL requires thinking declaratively, not imperatively, and in sets, not objects. Kids these days don't understand the difference between a table definition and a class and the difference between a row and an object. The declarative/functional style isn't really taught anymore; everyone just wants to learn Java and get a job.

If anything, ORMs are less "scalable" than SQL because under the hood the system is based on SQL, and in an effort to be completely generic, the ORM generates really bad SQL.

rmaccloy · on July 2, 2009

Indeed, "kids" (by which I assume you mean "assembly-line CS grads") these days are pretty much using Hibernate, ActiveRecord or (I guess?) LINQ. This is a good thing: The O/R mismatch exists. For good programmers ORMs are more productive in most cases (and you can write raw SQL for the exceptions); bad programmers have never gotten the relational model (and there's plenty of legacy code to prove it.)

However, the discussion about "kids" is a total red herring, because I'm pretty sure all of the people involved in talking up Tokyo Cabinet, memcachedb, Couch etc on one end and Cassandra, Hypertable or HTable on the other are fully aware of the difference between a class and a table. The truly clueless don't even know the discussion is happening.

The object/document database revival circa 2009 is about three things:

A. People are already using object-based access for almost everything, so if they're not using relational features it's trivial to drop in an object store and get better performance/less overhead.

B. On the low-end, sometimes a relational database is too much overhead -- if not in performance, in administration. (SQLite is cheap, sure, but sometimes you want just a disk-backed hash table.)

C. On the high-volume-data-analytics-end, RDBMSs don't perform well enough- unless you shell out serious cash for Oracle or Greenplum, Aster Data, Teradata, etc, and even then they still can't handle the volumes MapReduce and Bigtable-alikes can.

wvenable · on July 2, 2009

A. That is fine until the first time somebody has to write a report.

B. With the right product, administration is easy. MySQL is completely ubiquitous and something like SQL server is easy. This certainly conflicts with point C.

C. Most of these people aren't doing high-volume anything. If you are doing high-volume stuff (like Google) then of course it makes sense to use something very specific to your task. But that doesn't mean Bigtable, for example, makes for a good generic solution.

rmaccloy · on July 2, 2009

If you anticipate having to write a report, obviously you should use the right tool. If you don't, use what's simple and gets the job done. It's not incredibly hard to switch later.

On the low-end, administration for MySQL is certainly not easy; for an app you build in a day or two on Rails or Django or Sinatra (or as a CLI tool for that matter) you can easily spend more time doing sysadmin work to provision, configure and maintain MySQL than you do writing software. (MySQL shared hosting isn't everywhere.) There are plenty of throwaway webapps out there that just aren't worth setting up a database server for.

On the high end, it's not a matter of what's easy; it's a matter of what's possible.

These are different use cases, and there are different software packages being advocated for them. No one who needs a disk-backed hash table is using Hypertable, and no one who needs massively parallel analytical capability is using Tokyo Cabinet. And no one's advocating reconsidering RDBMS use as a whole; just RDBMS use as the default choice.

rjurney · on July 2, 2009

I think the key insight in all this is that the same database you store your objects in... shouldn't be the same one you write reports from. You can have two. One object store for your application, and another analytic DB for your reports.

The complaints about SQL is that is optimized for report writing, not application development. Fine, so split that out.

Which is happening. Which is working.

div · on July 2, 2009

Do you happen to know any articles on people splitting their data across 2 types of storage this way ?

If there is any data that should be stored in 2 separate stores, I can see applications becoming messy rather fast in trying to replicate changes.

Unless of course this is usually done by having a clean separation between which data goes where, but I doubt if any domain can ever really be that easily split up.

rjurney · on July 2, 2009

Well, for starters - most companies don't run reporting queries on their production SQL database. They mirror, summarize, partition, index and cube a separate reporting DB, so that big/mean queries don't cause massive latency on their site/product/production system. Which isn't quite the same as using a different data-store altogether, but some kind of split is commonplace. Which means that some kind of difference between queries/applications in reports/production is already common. They key here though, is that setting up an RDBMS that can handle analytics on even a moderately large data-set is a major task, can be complex/pricey, tends to use big iron, and only scales so far before it gets very, very expensive.

But, yes there are examples of what I just described. In practice, in many problem domains, most data of interest for reports does not change once it is written, so syncing up is not a major issue.

Streamy is a good example, I think. They use HBase for the front end, and run MapReduce jobs on the back end. http://wiki.apache.org/hadoop-data/attachments/HBase(2f)HBas... Another presentation is here: http://www.docstoc.com/docs/2996433/Hadoop-and-HBase-vs-RDBM... That is Hadoop and Hadoop, which is nice - but HBase is optimized for the front end and is fundamentally different than typical batch operation of Hadoop.

CouchDB sort of takes this approach, albeit with key/value and pre-defined and materialized map/reduce views on the same store. I think this dichotomy will become increasingly common, and will be less cumbersome than it currently is as the tools mature.

Key/Value for the front end and Map/Reduce on the back end makes a lot of sense for a lot of problems, since key/value is how many applications actually work, and there is the added benefit that systems like these scale linearly on commodity hardware using FOSS, can make it cost effective - and much simpler, than scaling a traditional RDBMs as an analytic data-store. The upside to this is too good for these systems not to win a big chunk of the market. And you can have your SQL - albeit on top of MapReduce - in reports, where it belongs :)

jrockway · on July 2, 2009

in an effort to be completely generic, the ORM generates really bad SQL.

Maybe your hand-rolled ORM does, but most off-the-shelf ORMs generate excellent SQL.

gaius · on July 2, 2009

It's a running joke where I work how bad Hibernate's SQL is compared to hand-written by experienced developers. Maybe it's "good enough" for some applications, but we don't really get out of bed for less than 5000 transactions/second...

jrockway · on July 2, 2009

There are potentially two problems:

Your database's query optimizer sucks, or

The overhead is of object inflation, not a slow SQL query.

gaius · on July 2, 2009

Hibernate generates optimal SQL in all situations for all databases and their different dialects? Really?

jrockway · on July 2, 2009

I'm just saying that the optimizer sucks if semantically identical queries don't execute the same instructions against the database.

I would certainly not be surprised if this happens in real life... but the solution is to not to hand-code every SQL query, it's to fix the database.

gaius · on July 2, 2009

You speak as if writing SQL was some unpleasant or arduous task, whereas in reality it's just a DSL for data. In most cases an ORM is just another layer of complexity.

gnaritas · on July 2, 2009

> In most cases an ORM is just another layer of complexity.

That unlike SQL, drastically reduces the amount of code the programmer is forced to write for the vast majority of applications that programmers write.

SQL can't do the one thing most programs actually need, give them the ability to select a starting point, and then navigate the conceptual graph of data as the user moves around the application. ORM's provide that, object database provide that, SQL doesn't.

wvenable · on July 3, 2009

You're right, and that's what great about ORMs! But if you're asking a question of your data, like "how many widgets did I sell today?" then navigating a conceptional graph of data is the slowest, most convoluted, way of getting at that information.

fhars · on July 2, 2009

"I wonder if we just have a generation of programmers who can't think in SQL."

And they don't even teach set theory in elementary school anymore like they did in the early seventies, which may be among the root causes of the malaise. Understanding things like relational databases and conditional probabilities is so much easier if you have been taught a solid foundation from early on.

Nelson69 · on July 2, 2009

> I wonder if we just have a generation of programmers who can't think in SQL. They're used to using ORMs (which are useful abstractions) but can't work at a lower level. Sending a query over the line to get exactly the results you want, and no more, and have it optimized and run entirely on server is pretty damn efficient.

Yeah, it's really obvious if you ever work with a really really senior DBA. I don't think it's the queries though, it's the modeling. I think a lot of software teams model their database sort of like they model data structures and then use various ORM tools to be the glue. DBAs tend to model the data in ways to make it most acceptable to the database and minimize the loss of any information, it's almost always more complex. The database as become an object persistence engine in a lot of cases that has some relational properties that may or may not be used. Instead of writing and managing files in the filesystem, you shove stuff in to a database.

Just for starters, your ORM will model an object for a row of data, how many times do you get the whole row when you're really interested in just a column or two? Does your ORM let you just specify the parts of the row you're interested in or does it hydrate an object and populate all of the columns? (Those extra columns being copied does add up...) ORMs are a religious war, as a software engineer it's a really beautiful idea, in reality I've never seen one that really works well with the database, they're too softwarey.

The other thing that seems to be rampant is the traditional 3 tier application model has kind of collapsed. It's not the case everywhere but I've seen it at more than a couple places where there is a persistence tier and then kind of a combined presentation/business tier. With a more traditional data model the business layer is absolutely critical and shows a lot of value, you might have to glue some more complicated queries together in to objects inside a transaction rather than just hydrating a row from a table. When you use a database as a store for your data structures a business layer just doesn't seem to be as valuable.

If you just want keyed data storage, a relational database does become the wrong tool. I would think, and this might just be parochial thinking, that at some point keyed data storage would essentially reinvent the relational database as the problems grow in complexity.

trezor · on July 2, 2009

Just for starters, your ORM will model an object for a row of data, how many times do you get the whole row when you're really interested in just a column or two? Does your ORM let you just specify the parts of the row you're interested in or does it hydrate an object and populate all of the columns?

I don't know about (n)hibernate or other ORMs, but Linq for SQL let's you specify exactly what you want, and when you inspect the SQL generated, it is usually quite efficient, although not 100% optimal for complex queries.

There is also the issue with nested objects (relations) and if they should be prefetched or not (to avoid in effect nested-loop type SQL), but that is perfectly controllable.

My biggest issue with ORM layers is transaction-handling when you are doing some things in the DB, some things with data from this query and some data from this other query. This can quickly promote to what would in DB be a simple transaction to a distributed transaction.

When used correctly ORMs are very nice tools indeed, but to use them efficiently, you have to know how they work and how the underlying DB they interface with works as well.

johnnybgoode · on July 2, 2009

> I wonder if we just have a generation of programmers who can't think in SQL. They're used to using ORMs (which are useful abstractions) but can't work at a lower level.

There is probably some truth to this, but another reason is that they want to avoid certain scaling hassles or costs.

Edit: I'm not saying this is necessarily a good thing.

trezor · on July 2, 2009

I fail to see how using a proven solution like relational databases involves hassles with scaling.