Hacker News new | past | comments | ask | show | jobs | submit login
Mike Stonebraker: The "NoSQL" Discussion has Nothing to Do With SQL (acm.org)
51 points by neilc on Nov 5, 2009 | hide | past | favorite | 35 comments



Guys, please check his Wikipedia page before starting some random rambling. This guy is a living legend and has several of the most prestigious awards. He also is/was an entrepreneur and large DB corp CTO. Oh, and a very early open source advocate and developer. Among many other things.

Don't say something stupid just because you like NoSQL. In fact, he was an early and strong critic of the one-size-fits-all commercial RDBMS solutions.

  [...] These papers presented reasons and experimental evidence
  that showed that the major RDBMS vendors can be outperformed
  by 1-2 orders of magnitude by specialized engines in the data
  warehouse, stream processing, text, and scientific database
  markets.
1-2 orders of magnitude, that is in the range of 2-99 times faster.

And in particular, if you didn't read at least 5 of his papers you can't pretend to have a clue about how database engines, and in particular the major RDBMS, work.

Thanks.

[A n00b DB engine writer, currently working on a non-SQL database]

PS: If I get to achieve 10% of what he did so far in his life I'd consider my life amazing. And he is still making new things. Respect!


I'm a bit shocked by your post. So because he is a living legend one should not say nothing before reading all the papers cited even if from the article his point is already clear and one may say something meaningful even as a modest developer?

I'm going to say one for instance: the article's main point is about the fact that because of a threaded and disk-based architecture you can't get much faster than well engineered SQL architectures. For instance Redis is completely excluded from this reasoning, being in-memory and single-threaded.

Another point: where are the numbers? For instance I can trivially give proof that Redis can handle 150,000 operations/second per core in a decent Linux Box. I want to know what's the performance of the SQL solutions cited in the article.

Another one: all the SQL solutions cited in the article are names I never heard before of today. NoSQL DBs are tar.gz you can download for free from sites and compile now without paying nothing.

Respect does not mean to shut up.


> the article's main point is about the fact that because of a threaded and disk-based architecture you can't get much faster than well engineered SQL architectures.

I disagree. His main point is that current SQL databases are slow not because they use SQL but because their implementations suck. He then lists 4 areas where the implementations spend most of the time and argues that all of them can be eliminated (how this should be done is explained in the linked paper).


Yep but I read in the article:

> Second, many No SQL systems are disk-based and retain a > buffer pool as well as a multi-threaded architecture. This > will leave intact two of the four sources of overhead above.

That's not always true.

For the other points, please give me the download link of this SQL database that can scale without problems (I want an opensource one since I'm a cheap startup).

Also scalable != fast, it only means that adding more nodes I can scale, possibly almost linearly, but again for startups that are cheap it is also very important how many boxes you are going to need, so it's important to have numbers about this superior SQL engines to do some math.

That said the article is interesting, as the idea is more or less that in theory it's possible to build SQL databases that are very scalable and that mostly our problems in the past where about poor implementations. I guess this is true, but I can't imagine how an ACID SQL system is without overhead compared to a key-value store, even when the right technology is used for the implementation. The only fix for this is to show numbers.

Well also I don't think at all avoiding SQL is a bad idea, but the author wrote in this article he is going to show how the other NoSQL claim is false (the claim is that SQL sucks at modeling a lot of problems).

Anyway the author resembles a lot Adam from Battlestar Galactica and this is a good point.


> For the other points, please give me the download link of this SQL database that can scale without problems (I want an opensource one since I'm a cheap startup).

He didn't say that there are such systems, but that he expects them in the next years.

> so it's important to have numbers about this superior SQL engines to do some math.

In the paper he talks about speedups of 1-2 orders of magnitude compared to current OLTP systems.

> I can't imagine how an ACID SQL system is without overhead compared to a key-value store, even when the right technology is used for the implementation.

If he is right, then there should be no big difference (maybe 10-20%) between a SQL system and a key-value store, assuming that both offer the same ACID guarantees, because most of the complexity lies in the ACID guarantees and not in the data-storage itself.

> Well also I don't think at all avoiding SQL is a bad idea, but the author wrote in this article he is going to show how the other NoSQL claim is false (the claim is that SQL sucks at modeling a lot of problems).

He's going to do that in the next blog entry.


"I want to know what's the performance of the SQL solutions cited in the article."

Vertica will happily arrange for a demo, and they even have a fast track for getting a cluster up on ec2. Teradata, Greenplum and the others you're dealing with the traditional enterprise sales process (bleh).

"all the SQL solutions cited in the article are names I never heard before of today."

This is sort of the point. While they may only be represented in the enterprise market rather than in FOSS, there are linearly scalable database products available today. For businesses that have the funding, these can have huge advantages over the options being birthed by NoSQL (as well as drawbacks).

And the larger point is that the majority of the OSS world working on NoSQL (or Mysql Cluster for that matter) is somewhat ignorant of the parallel database research of the 80's. I'd suggest starting with the position paper "End of an Architectural Era", and then continuing with DeWitt's publications.


My summary: it's possible to jettison a lot of ACID requirements and implement sharding and still have SQL as the interface to your data.

My response: Sure, in theory, but for right now Redis/Tokyo Tyrant/etc actually exist and are free. They don't support a lot of what traditional databases consider required features, but the NoSQL movement is based around a recognition that many applications can sacrifice those in favor of performance and scalability.

Key/value stores are like C, barely disguised assembler that lets you shoot yourself in the foot, but is incredibly flexible. NoSQL is about acknowledging that sometimes that's the right tool for the job. This article doesn't address that choice at all, just handwaves about the wonderful technology that will handle all the problems automagically Real Soon Now. Point me at an SQL database I can use to handle my data workload of thousands of updates a second on a cheapie EC2 instance, that will be a discussion I can use.


There are also SQL based non-relational, non-acid, non (strictly speaking) database systems too, particularly for OLAP uses. Hive (and in the most recent versions/through a plugin, Pig too) provides SQL support on top of Hadoop.

While that's somewhat of a misplaced fit (SQL is based on relational algebra, why use relational algebra on a non-relational store?), it does provide a convenient interface for ad-hoc querying, particularly for non-engineers (e.g. business analysts).

Edit: addendum - Honestly, my own personal preference would be with either a language specifically designed for the non-relational store in question (e.g. Pig Latin) or a DSL in a general purpose programming language (e.g. Cascading or especially Clojure-based macros the Flightcaster team has created on top of Cascading).


Sure, in theory, but for right now Redis/Tokyo Tyrant/etc actually exist and are free.

Yes, there's a question of whether we are talking the right way to build systems, or which system you ought to use in production tomorrow. Building a whole "movement" around the simple lack of some software is a little short-sighted, IMHO.

BTW, Stonebraker's VoltDB (www.voltdb.com) is a high-performance OLTP engine, and apparently an alpha release will be open-sourced shortly (a few months, I believe).

the NoSQL movement is based around a recognition that many applications can sacrifice those in favor of performance and scalability.

You make it sound like the NoSQL people were the first to observe this. That is very far from the truth -- IMS, Codasyl, Berkeley DB, object-oriented DBs, etc. have all been around for a long time.


So, maybe you can clarify something for me.

Stonebraker cites Greenplum / Vertica / etc as examples of sql dbs that scale out. But all the ones he mentions are data warehouses that measure time from load to queriability in minutes. Not ms like the OLTP-focused nosql distributed systems.

And of course systems like RAC or pgcluster rely on a SAN, so that's not really playing the same game either.

Am I missing something? Feels like Stonebraker is cheating a little to make his point to me.


no your not missing anything. And I can speak from experience that RAC doesn't even help. We are in the progress of migrating from a RAC system to a NoSql system. RAC doesn't scale either.


Could you give more detail about what are you doing?


Suffice to say that RAC has been nothing but trouble since we moved to it from DB2 for scale reasons. Rather than pay a consultant tons of money to help us get it stable after less than a year on it we are now migrating off to a non-relational solution. I can't give more detail than that really. Maybe later I'll be able to get permission to post a blog post about it.


If you are scaling without a single image (which RAC gives you) then that's not playing the game either. Sharding is just re-implementing parallel query and partition elimination which you get out of the box with the major RDBMSs, and sacrificing the ability to do joins across shards (which you CAN do with real RDBMSs).

Also OLTP without ACID isn't playing either.


Vertica's load performance is quite impressive actually. Teradata also has installations doing high volume OLTP, not just OLAP.


"You make it sound like the NoSQL people were the first to observe this. That is very far from the truth -- IMS, Codasyl, Berkeley DB, object-oriented DBs, etc. have all been around for a long time."

Yes. The whole "NoSQL movement" doesn't make any sense until you recognize it not as a brilliant technical development, but as a backlash against hordes of people who always answered "What is the best way to store X?" with loud shouts of "SQL! And you're a moron if you disagree!" without ever examining the nature of X.

(Yeah, I'm obviously exaggerating. But only a bit. Don't even pretend that there haven't been people running around and shouting this at every available opportunity.)

"NoSQL" is ultimately more about the observation that relational databases indeed are not the be-all, end-all answer to every problem ever. In a technical sense they're not even remotely new; what's new is the cracking of the SQL dogma in common perception, brought on by an increasing number of workloads that SQL databases just can't handle economically. (Which is to say, even if there exists a SQL DB and a DB server that may meet some need, SQL doesn't win if the server is actually more expensive than using a "NoSQL" solution.)


"SQL! And you're a moron if you disagree!"

No-one ever said this, because no-one who actually knows anything about databases would think you store anything with a query language! It makes as much sense as answering "what filesystem should we use?" with "C!".

"NoSQL" is ultimately more about the observation that relational databases indeed are not the be-all

No, it's about thinking that MySQL is the be-all and end-all of relational database technology, whereas in feature terms it's 15-20 years behind the major players (indeed, only has basic functionality thanks to one of those players). When they come out with statements like "SQL doesn't scale" I look at RDBMSs storing >100T of data and/or handling >10,000 COMMITs/sec and I very quickly figure out who has actual experience and who doesn't.


Actually, there is a subtlety you've missed. We don't use relational databases. We use SQL, and endless variants thereof. I'd be a lot less grumpy about relational databases if there was a relational-but-not-SQL way of using them. Thought this grumpiness has nothing to do with the "NoSQL" issues, I just hate being stuck on 1970s syntax.

(No, this is not a complaint about "sets". It is in fact the opposite, that it doesn't support the truly relational style of programming anywhere near well enough!)

"When they come out with statements like "SQL doesn't scale" I look at RDBMSs storing >100T of data and/or handling >10,000 COMMITs/sec and I very quickly figure out who has actual experience and who doesn't."

And you're gliding right over the question of what is being stored, how it is being accessed, on what timeframe it is being accessed, what is being done with it, and how much it costs vs other options which is something I rather made a point of pointing out. A $10 million dollar solution to something that could have been solved with $0 and five hundred developer hours is not an acceptable solution. The fact that the expensive solution works isn't even relevant, really.

You know, all those pesky engineering details that SQL, sorry, RELATIONAL-uber-alles advocates always conveniently forget about. Nothing is free.


We don't use relational databases. We use SQL, and endless variants thereof. I'd be a lot less grumpy about relational databases if there was a relational-but-not-SQL way of using them.

Have you tried LINQ?

uber-alles advocates always conveniently forget about. Nothing is free

That is what I mean when I say sharding is just reimplementing features you get "for free" with a real database (in the sense that they're built in). There is also the cost of reinventing the wheel...


LINQ, so far as I know, backs to SQL, and thus inherits the inadequate relational aspects of SQL. The syntax is nicer, though, and also goes a long way towards demonstrating my point about the annoying inadequacies of SQL. Especially as I dig deeper into the functional world, the inability to meaningfully compose SQL statements is really getting annoying.

I am in the weird position of being both a relational snob and thinking that relational databases aren't the solution to everything.


Actually many NoSQL DBs are not just about relaxing ACID requirements, but also about limiting functionality for easier scaling via DHT. They only support key/value set/get and not range scan, which is needed to implement SQL semantics.

I think the best interpretation for NoSQL so far is "Not only SQL", which has interface implication as well.


What's DHT?


Distributed Hash Table.


The article starts with a description/definition of 'key-value stores' vs 'document stores'... but doesn't clarify the difference, if any.

When people say 'NoSQL', that doesn't preclude another kind of query language...

I assume NoSQL to mean, roughly : - no traditional ansi RDBMS SQL - key-value store - graph-like / treelike data as first class citizen - map/reduce style operations - bundled with a dynamic lisp/python/ruby/javascript style general programming language.

You'd have to agree that 'NoSQL' is a movement - ie. many people vocalizing their shared emotional frustration at the limits of SQL. Not theoretical limits, but practical ease of use and subjective syntactic inelegance.

I personally identify with the movement, after having built traditional sql systems, preached data normalization etc., then slowly realising that this RDB approach is just really unwieldy and brittle.

Moving from XML to JSON, from C++ to lisp languages.. this makes the ugly syntax of SQL really stand out : as blatantly as COBOL or FORTRAN do to someone who has read K&R.

Maybe we havent found the right replacement... but we still know there has to be a better way.

The term 'NoSQL' is subjective, undefined, fuzzy - but its a label for a real problem that needs solving, and groups together partial solutions.


Indeed Stonebreaker is a living legend. Entrepreneurs may find it interesting that he is currently pitching VCs a commercial venture to augment his latest DB project, http://scidb.org/ (much like Cloudant is to Couch)


I also found this:

"VoltDB is an independent, Boston area-based company co-founded by DBMS R&D pioneer Mike Stonebraker and startup veteran Andy Palmer, working in stealth mode on the next-generation of OLTP DBMS."

So it's a biased legend. Sincerely if one is investing a lot in something like VoltDB this buzz about NoSQL DBs and especially the fact that they are starting to be actually used to get work done can be a problem.


I believe this is the commercialization of H-Store, which has been featured here on HN before.


The title is correct, the article confused and somewhat pointless. NoSQL is about the systems, not the query language, but it's not about RDBMS minus atomicity and consistency, it's about saying "this not SQL, don't compare it to RDBMS that you are used to querying with SQL".

Anyone talking about how you can implement SQL on top of NoSQL systems is missing the point. KV and Document stores are a different way of storing data.

SQL: express what you want to happen using relational notation, and it will magically happen. There are no real-world performance issues.

NoSQL: certain operations are fast, other operations are slow. This fact is reflected in the API.


The article isn't confused - he's using the SQL in the same way you are (in your first paragraph, anyway) - he's referring to RDBMS's. His point is that that many of the performance benefits of a KV/documentstore based db can be achieved in an RDBMS without throwing away time-tested ideas like ACID guarantees.

I was confused by your post, however. You initially state that NoSQL is about the system/approach to storing data (and not the query language) but then proceed to contrast querying APIs.


    > SQL: express what you want to happen using relational notation,
    > and it will magically happen. There are no real-world
    > performance issues.
I don't think that's true: Consider the

CREATE INDEX

statement - it's part of the API and it is intended for improving performance. True, a RDBMS will execute a query even if the programmer didn't issue CREATE INDEX before - but this more reflects the rigid separation ('data independence') of the logical and physical data model than a lack of control (API exposure) over performance details.


I am much more interested in the upcoming post.

Even tough I have not used the products mentioned. I have seen that RDBMSes can be configured and optimized but usually it costs time or money plus it can get quite complex. This complexity wouldn't be so bad if it allowed you to make the rest of your application simpler. Sadly, it usually isn't the case. You end up constantly dealing with theses two different worlds in your application and ORMs also bring their share of complexity when you do more sophisticated or customized things. And I am not even saying anything about when you need to change things.

I think one of the nice things about many of the "noSQL" solutions is that they keep things simple and under your control. You still need to do the complex stuff yourself, but its never simple anyway. I am sure they don't bring a solution to everything but it certainly is nice to see the area of databases moving again after many years of status quo. And the author is certainly one of the people pushing the field.


apparently, as a supporter of the NoSQL movement, i would just like to state a small 'analogy'...

Imagine if someone from amongst our early ancestors hadn’t thought of creating the wheel, where would we be?

And so it is with NoSQL - it is just an alternative way of looking at a problem which in this case is that of databases and looks at it in terms of a non-relational model. Therefore, i would like to urge all who are 'hostile' towards it - just let us do what we want. If the results are not to your liking what can we do.

i personally believe that through pursuing the principles on which NoSQL is based we will manage to achieve something new that mayhap replace SQL.

Additionally, please do also give thought to the fact that Google's Big Table, Amazon's Dynamo and Cassandra can be thought of as part of NoSQl movement.. Rebels are at first turned away from society in general - but later on society realises that the rebels had the correct idea. Change is always essential to progress without which i would not have been typing this..

On another account (sorry for the detour) i am rather getting quite interested in Gopher these days. Anybody interested in Gopher?

[ I apologise for the non-tech response.]

[ Null Expulsion - nullexpulsion [at] gmail [dot] com - how long will it be since spammers figure out how to detour around this? ]


I think it's only a matter of time before this NoSQL movement morphs into a SPARQL movement.

http://en.wikipedia.org/wiki/SPARQL


Please refer to this independent and unbiased study that bolsters my point, a study which I authored.


I see nothing wrong in citing your own work. He does make a disclaimer at the end that he is involved in several RDBMS technology startups.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: