The NoSQL movement

einhverfr · on Feb 20, 2012

The article is, IMO, total crap.

There are advantages to NoSQL in some cases. I am even happy to write about them. However, they are not really the advantages listed here.

The first is that SQL databases can in fact be distributed in some cases. Look at Postgres-XC for an example of write-scalable distributed shards while maintaining a consistent, relational model.

The second is the focus on analytics. NoSQL analytics is very much of a problematic area. Doing the analysis typically requires very intensive searches of the entire data, and NoSQL databases are not optimized for this. Consequently analytical data tends to be slow to build and then maintained on data input, rather than generated ad hoc from existing data (using the entered data as a single point of truth). This leads to a lack of flexibility even though prepared reports load quickly. It isn't clear to me how different this would be from summary tables maintained with triggers.

Now NoSQL has some advantages:

1) Where you don't need ad hoc analytics, where you have highly defined functional requirements, and where you have well defined network protocols for interop, development is often faster, and performance is better.

2) I think it could be very interesting as a network transparent back-plane if you will for various kinds of network services.

jaylevitt · on Feb 20, 2012

I can't even count how many mistakes this article makes. A random sampling:

- Relational databases were designed for a world where availability is unimportant, like transaction processing. Um, no. OLTP has four letters, not two, and the first are as important as the last. Tandem was providing five-9's systems in the '80s and '90s for ATM networks, lottery systems, airline reservations, credit cards, etc. Tandem is a fault-tolerant relational database in hardware.

- Up-front schema design is a poor fit in a world where data requirements are fluid: I don't think "up front" means what you think it means. You can change schemas on the fly nowadays; your schema design is no more up-front than your coding is.

- You can't have millions of columns in a relational database: True, and you wouldn't; you'd normalize that. This is an important difference, but not a disadvantage of relational databases, any more than saying "In a relational database, you'd join URLs with IP addresses, and maybe five other tables; this design isn't even conceivable in a NoSQL database."

- To optimize relational performance, you "do away with joins wherever possible": 1995 called, and it wants its MyISAM back.

- Two-phase commit is so obsolete, even banks don't use it: Of course they do. You still need a two-phase commit to make sure the other end got your data; whether "got your data" happens in the customer path or during reconciliation is a design decision. How, exactly, do you think they discover that you and your spouse both got the money?

- "relational databases were developed when distributed systems were rare and exotic at best." That's nothing; when Von Neumann machines were developed, we didn't even have transistors. Some legacy architectures keep on working.

- "absolute consistency isn't a hard requirement for banks": see above. Yes it is.

- "So the CAP theorem is historically irrelevant to relational databases: they're good at providing consistency, and they have been adapted to provide high availability with some success, but they are hard to partition without extreme effort or extreme cost." ... Wh... Bu... That's not even wrong.

- "consistency requirements of many social applications are very soft." I like the Facebook example from a recent article on causal consistency: I de-friend my boss and then post that I'm quitting. Certain kinds of consistency are in fact critical to social applications.

There are many good reasons to design around a NoSQL database instead of a relational one. This article provides fewer than zero of them.

mjb · on Feb 20, 2012

I agree with you - this article does a very poor job of communicating the advantages of 'NoSQL' databases over traditional solutions. It spends too many words setting up a straw man (SQL == not partitioned, not highly available, not redundant, etc) and too few actually making it's case.

Lines like this are, at best, a gross oversimplification of relational database performance tuning: "But when you need to optimize performance, you look at the queries you actually perform, then merge tables to create longer rows, and do away with joins wherever possible."

And this: "We require sub-second responses to queries."

Really? Single big-box OLTP systems have many issues, but if you are getting query times greater than a second on a typical online database then you are either doing something wrong or have specific requirements. Neither of those things will be fixed by blindly picking 'NoSQL' over 'SQL'.

"any significant database needs to be distributed." I could easily list, off the top of my head, 100 significant databases that aren't distributed. Maybe some of them should be, but saying that "any significant database needs to be distributed" is a real stretch.

> There are many good reasons to design around a NoSQL database instead of a relational one. This article provides fewer than zero of them.

Yeah, that's the thing. An article of a quarter of the length with more research, less strawman bashing and less breathlessness could have made the case for NoSQL much more effectively.

einhverfr · on Feb 20, 2012

I also noticed the emphasis on analytics which is actually very much a weak point of NoSQL. You can't do ad-hoc analytics in a NoSQL database without performance at least an order of magnitude worse than youd get in a relational system. NoSQL only works with analytics when you know all the questions you want ot ask ahead of time.

einhverfr · on Feb 20, 2012

"absolute consistency isn't a hard requirement for banks": see above. Yes it is.

This one gave me a laugh. Thanks for pointing it out.

Let me see.... Your business does nothing but manage money. You don't need absolute confidence over where that money is at any given point of time? Right..... In fact accounting systems (and by extension ERP systems) are about the LAST place you'd want to use anything other than a relational database system/

bwarp · on Feb 20, 2012

Most banks use an eventually consistent message oriented architecture, not a transactional one.

At a low level banks use both absolute consistency (for physical transaction stores) and eventual consistency (for logical transaction implementations). The logical transaction implementation abstracts inter-bank and physical payment messaging.

It would be impossible to have absolute consistency in the logical transaction layer as transaction scopes would have to be open (i.e. locked) for days at a time in some cases. That simply doesn't scale.

Ultimately banks have millions if not billions of pounds floating around not in traditional transactional stores all the time.

einhverfr · on Feb 20, 2012

I agree that ATM's are a bad solution to this. The issue there though is loose coupling of third party financial networks though compared to a general need for consistency in terms of one's own financial needs.

But loose coupling between third party payment networks (say debit card purchases over Cirrus) and the bank is not really the same problem as using Cassandra at Facebook.

einhverfr · on Feb 20, 2012

Re-reading the article I see what the author is saying, namely that some lag between the ATM network and the bank's accounting system is permissible. However, you still need to have a single point of truth that is absolutely consistent. If you don't, the bank's accountants (and probably federal regulators too!) will be rather unhappy! So the author takes a reasonable observation and twists it beyond recognition......

kylebrown · on Feb 20, 2012

>"absolute consistency isn't a hard requirement for banks": see above. Yes it is.

There's a paragraph in the (long) article about absolutely local / eventually global consistency.

>"but they are hard to partition without extreme effort or extreme cost." ... Wh... Bu... That's not even wrong.

I don't know about Tandem but another comment mentioned Postgres-XC so I checked it out. According to the wiki its not globally consistent.[1] Care to elaborate on what isn't wrong, or name any (relational) alternatives to Postgres-XC?

"We need some research work for solutions on following issues.: Global constraint. Can we enforce unique or other constraint exclusion globally in multiple data nodes?" http://wiki.postgresql.org/wiki/Postgres-XC

jaylevitt · on Feb 20, 2012

I'm of two minds about consistency vs banks. On the one hand, yes, in the CAP sense, "consistency" is atomic consistency, and banks allow a form of eventual consistency. On the other hand, I'm not sure if the NoSQL sense of eventual consistency is strong enough for what banks need; how do NoSQL eventual-consistency models deal with conflicts that affect other tables? For instance, a quick Google shows that MongoDB offers "programmatic merge" - but would that also allow you to say "when merging my spouse's ATM transaction with mine in our checking account, also remove the spurious transaction in the bank's cash-on-hand account"?

Maybe I'm thinking too close to the metal; to get eventual consistency at a higher level, you still need atomic consistency at a lower level - even Paxos uses two-phase commit under the covers.

As for the partitioning, what "isn't wrong" is:

- He says the CAP theorem is historically irrelevant to relational databases, which is wrong, and that they have been "adapted" to provide high availability, which is kinda wrong, and then:

- He confuses "partition" in the "partition my database" sense (scale my database across multiple tables or nodes) with "partition" in the original CAP sense (do not permanently explode if a node goes offline), so he isn't even wrong - any multi-server database allows partitioning, by definition.

Also, I think you're misunderstanding that item on the wiki page; it sounds like Postgres-XC IS globally consistent, but that it can't yet support some important forms of consistency like UNIQUE constraints across nodes (you can only have a unique constraint within the same node). All nodes would see the duplicate rows, though, so it is globally consistent.

opendomain · on Feb 20, 2012

- availability means access to the data, not just uptime

- I have never seen changing schemas on the fly in a RDBMS programmatically (except admin functions). I have seen overloading of column types however.

- "super column" data stores have a deistinct design advantage that you can not get from relational DBS.

- everyone that is using relational does the same thing when they get big data: denormalize. From materilzed views al the way to sharding and replication.

- banks use transactions, but not as a two phase commit. it is a single atomic record - not partial updates.

- we still have some species from the age of the dinosaurs, but the land is rules by those that have evolved.

- I agree that banks should be consistent. Not sure what he author was saying here

- CAP is relevant to ALL datastores. Consistancy, Availability, Partition tolerance : pick any two.

- if your post does not get committed to Facebook, it may be important to YOU, but the application is designed more for availability

jaylevitt · on Feb 20, 2012

Of course you can change schemas on the fly. Just because ALTER TABLE goes ka-thunk on MySQL doesn't mean it can't be done in more robust databases.

I'm not sure what you think two-phase commits are, but they ARE atomic transactions. They are essentially "Everyone cool with this transaction?" "Yep!" "GO."

The CAP gotcha is that you already have partition tolerance - if you didn't, your database would go "boom" when a node went offline. So, really, it's "pick one".

And believe it or not, Facebook moved some stuff from Cassandra to HBase in part because of its stronger consistency model.

opendomain · on Feb 20, 2012

You can change schemas on the fly. But no one does it in their application except part of Admin.

Two-phased commits are when you write a partial record to the DB and then cleanup the transaction. When they complete succesfully, they are a complete transaction, but during a failure they are not atomic through the database because it takes extra application logic to cleanup.

Partition tolerance is not when one node goes offline, it is when critical nodes (master) or up to n/2 fail.

For CAP, you can optimize for just Consistancy, Availabilty, or Partition tolerance. But you can also pick two - you just can not do all three.

I did not know Facebook went to Hbase and why it do so - can you provide a link?

jaylevitt · on Feb 20, 2012

It looks like different people use different definitions of partition tolerance; the original meaning was:

"The network will be allowed to lose arbitrarily many messages sent from one node to another."

Which means that, if you have a network, you have partition tolerance, period.

But Stonebraker defines it differently:

"If there is a network failure that splits the processing nodes into two groups that cannot talk to each other, then the goal would be to allow processing to continue in both subgroups."

Great article here:

http://www.cloudera.com/blog/2010/04/cap-confusion-problems-...

As for two-phase commits, I'm not sure what system you've used that requires app knowledge, but (for instance) PostgreSQL does them automagically; you just do PREPARE TRANSACTION and COMMIT PREPARED, and if the transaction fails you need app-level error handling - but presumably you have that error handling even without two-phase commits.

Here's an article on Facebook's move to HBase:

http://www.facebook.com/note.php?note_id=454991608919

jacques_chester · on Feb 20, 2012

Here's my theory.

NoSQL came about because smart people were exposed to relational technology in this order:

1. A university course or book that mostly focused on SQL and then normalised design

followed by

2. Using MySQL in production.

What's missing from this picture is learning the other halves:

1. Normalisation matters to transaction processing. Fast queries is another matter entirely and usually only gets airily waved at in many books and university courses. I went through an entire semester without seeing "OLAP". Techniques I learned on the job were kept to the "Advanced Databases" course which was only taught sporadically.

The idea that OLAP is some high mystery is just silly. It's join-beating, denormalising stuff, like NoSQL, just with decades of literature and code to back it up.

2. It also matters that MySQL is not the benchmark of relational technology performance or features.

My day job is working with Oracle databases. The price, the odd absence of useful features because It's Never Been Done That Way (I'm looking at you, primary key triggers and booleans-stored-as-char(1)) ... sometimes it's amazing that people pay so much for it. Then you see the manuals, the supporting tools[1] and the performance a half-decent DBA can massage and you get that this stuff isn't as bad as the sticker price says.

For my own work, postgresql is where it's at. But for a big site I'd look to DB2, Oracle RAC, Teradata, NonStop, Greenplum and on and on and on before betting on 5-year old technology reinventing a 50-year old paradigm that didn't work real well the first time around.

[1] Except SQL Developer. What a dog.

gbog · on Feb 20, 2012

> Normalisation matters to transaction processing.

If you mean to mean that normalization matters only for transactional processing, I dare disagree. Normalization matters for data sanity. Building and maintaining a complex and agile application is much easier on properly normalized data. Making sure one bit of data is stored only in one place and is properly decoupled from other bits of data's existence is still the Right Thing to do. True, it is sometime in direct contradiction with data access performance, but this is an optimization issue, which can be solved with denormalization, materialization or the use of some "NoSQL" storage.

Anyway, in these NoSQL discussions I always wonder: As far as I know, Wikipedia is still using a purely relational data-store, and it's main data (an enormous dict of blobs) seem to be the perfect candidate for a non-relational storage. So then, if NoSQL is so good at this task, how come they don't have moved yet? Should they move? (Genuine question)

neilk · on Feb 20, 2012

That's a complicated topic.

For the social web, people typically denormalize data because consistency is less important than personalization and/or read performance. At some point, it's just easier to write the same data in a redundant way than to query from a canonical source and transform it on the fly. Wikipedia's raison d'etre is to show the same data to everybody. In fact, it goes to great lengths to ensure that everybody sees the very last updated version, no matter what. So there's no great pressure to denormalize core services; in fact, quite the opposite.

The next part of your question is whether a document-oriented database would be better. I think it would be possible to write a wiki on top of a document store. You'd gain a lot from simplicity, although you'd lose certain kinds of flexibility.

But, this is not practical for Wikipedia at this point. For everything else that goes into rebuilding a page, or administration, there are plenty of traditional joins.

The software is very much married to SQL. MediaWiki, the software that powers Wikipedia, is open source and database agnostic. There are people running MediaWiki sites on pretty much every RDBMS you can name. While queries and updates are all abstracted away, the core concepts are all obviously SQL. The abstraction layer just gets rid of syntactical quirks and handles escaping.

That said, a typical MediaWiki installation is not well normalized either. MediaWiki is capable of hosting a lot of extensions that extend the behavior of the wiki. If the extension needs to persist data that's associated with existing tables, a typical strategy is for the plugin to maintain its own parallel tables of data, which reuse the same primary key.

Like a lot of successful websites, the MediaWiki culture is pragmatic above all else. SQL databases are used to persist data and the best you could say is that it's a hybrid strategy.

jacques_chester · on Feb 20, 2012

> If you mean to mean that normalization matters only for transactional processing, I dare disagree.

I think we are, actually, in agreement. The benefits of ACID are essential to OLTP. It also so happens that normalised data can be written fasted as it only needs to be written in once per datum.

As for Wikipedia, I think there's two reasons. First would be path dependence, second would be that their data is (I'm guessing) easily partitionable.

einhverfr · on Feb 20, 2012

I don't really buy that. I have built applications using BDB on the back-end. It works really well for some things. It just has other shortcomings development-wise that usually makes up for the benefits and then some.

The thing is that object-oriented and relational design are fundamentally different. It is very rare to find people who are very good (or even equally good) at both. They are entirely different design disciplines aimed at very different problems. What non-relational db's do is they free the developer up from this second, very different discipine using tools that usually perform very well for standard OO operations.

That's a powerful thing.

Of course it's also powerful to have a math engine which can take your stored data, digest it, and spit out a report based on criteria you hadn't thought of until 30 seconds ago. And iterating through a collection of objects? Not a good way to do that.

route66 · on Feb 20, 2012

> That's a powerful thing. I have been working with people who know relational databases and OO. And I have been working with people who could only do one of these things.

I know with which group I prefer to work.

If a technology is used as a "don't worry: low barrier to entry" to keep the challenged on board, it will become a great measure for lack of competence if you know it.

Seriously, until now nosql advocates would use specific technical merits of the different options we have now, but "developers don't grok rdbms's": its not an argument. According to some, developers also don't grok fizzbuzz.

If you can't wrap your mind around set theory you should also not iterate through a collection of objects.

einhverfr · on Feb 20, 2012

True enough, but there is a reason why OpenLDAP's performance is atrocious when run with an SQL back-end.

My own viewpoint is that there are cases where the relational model really offers very little of benefit and a lot in terms of cost, but that these are far more narrower than the NoSQL guys suggest. I see NoSQL at its best being a niche tool.

Remember with NoSQL, there's one answer when your customer say "can I tweak this report?" NoSQL? Answer is "No!" of course!

ryutin · on Feb 20, 2012

The survey of nosql technologies sounds more like a rationale for another O'Reilly bookshelf!

I mean, there are definitely arguments that can be made for alternatives to relational databases, but without those special cases, in such an early stage of maturation and without standards, it's worthy for pursuing by the hardy cowboy or blissful novice.

As the nosql technologies do mature and standards emerge, I think it should be expected that they will be subsumed into existing database products as new features.

einhverfr · on Feb 20, 2012

There's already some work on this in PostgreSQL with hstore and Javascript as a stored procedure languae.

DiabloD3 · on Feb 20, 2012

There is no such thing as the NoSQL movement.... its the "NoMySQL and MySQL is the only SQL implementation in the world because we're all PHP users and have never heard of PostgreSQL" movement.

And it doesn't help a lot of the NoSQL dbs out there have SQL-like query languages.

troymc · on Feb 20, 2012

Having the two categories SQL and NoSQL is a bit like having the two categories "books" and "non-books." The latter category includes giraffes, planets, feelings, and windmills.

Okay, maybe it's not that bad, but NoSQL databases still make a huge non-homogeneous set of things.

cageface · on Feb 20, 2012

The debate over NoSQL vs SQL datastores seems to carry many of the same overtones as the debate over static vs dynamic typing in programming languages, with similar arguments being made on both sides.

wyuenho · on Feb 20, 2012

Every time someone says you can't do schema-less design as easily in RDBMS', I cite this EAV article on Wikipedia.

http://en.wikipedia.org/wiki/Entity–attribute–value_model

olalonde · on Feb 20, 2012

Or you can use an ORM that generates the schema on the fly. (Technically, you still have a schema in both cases.)

mbailey · on Feb 20, 2012

I love the ad for MS SQL server in the middle of the page.

olalonde · on Feb 20, 2012

After reading the article and comments here, I'm left even more confused. Anyone care to explain some use cases for NoSQL databases?