Hacker News new | past | comments | ask | show | jobs | submit login
Scalable ACID (dbmsmusings.blogspot.com)
109 points by ithkuil on Sept 1, 2010 | hide | past | favorite | 29 comments



From a development standpoint, I don't quite understand the advantages of NoSQL over SQL. I get that you don't have to have a rigidly defined table structure, and that your data structure can change when your application changes, but I guess I'm one of those curmudgeons that prefers rigidity over permissiveness. I like that I have to think hard about my table structure and models before and as I code. I also rarely feel bogged down by SQL, although I've been using it for over a decade, so that helps.

As for scalability, as this post shows, I think the best thing to come out of NoSQL is the race for greater scalability it has and will continue to inspire in SQL solutions.


"NoSQL" has come to symbolize a number of different things, depending on who you're talking to and when. I've seen all combinations of one or more of the following:

1. The rejection of what some see as an overly complicated and inflexible query language in favor of map-reduce phases or other application-side querying/processing.

2. The rejection of serialized transactions in favor of something like eventual consistency, often with application-specified conflict resolution.

3. Or it could be a rejection of the relational model (especially as it is popularly implemented) in favor of key-value, graph, document, column-oriented, etc. models.

If I read you and the authors right, I think you're thinking more along the lines of 3 and the authors are focussing more on 2.

I think both "sides" are just starting to come to grips with the idea that these design choices can be orthogonal (a relational DB without SQL? an eventually consistent SQL DB? a graph DB with 2-phase commit?) and choosing to reject one part of "the old way" doesn't mean you have to reject all of it. The upshot is, we're going to have a lot more tools to choose from for our own unique problems.

So, I'd say: don't worry about feeling like a curmudgeon! Rigidity will have its place in the beautiful gleaming pluralistic future of datastores that the SQL vs. NoSQL "debate" is building the road towards.


There are some times when the standard normal form is a real pain, for example, and your data fits better as a document or a simple key-value. I mentioned one such instance for me here: http://news.ycombinator.com/item?id=1637903


I think you're making the common mistake of confusing relational databases with entity-relationship modeling.

I can't really figure out your foo example though, needs to be a bit more specific.


This looks interesting, but I still can't tell how it relates to the CAP theorem. Are they sacrificing global consistency? Availability? Partition-tolerance? Or do they have a subtle reformulation of these constraints which allows them to "get around" the restrictions imposed by the CAP theorem?

Most of the popular NoSQL solutions were designed to run as large server clusters, possibly distributed across multiple data centers. They're designed to handle such problems as, "Our fiber backbone is down, and our database has split into a 5,000 machine cluster and a 7,000 machine cluster! And it's three days before Christmas!" (This is more or less the problem Amazon's database is designed to survive.)

For any given "NoSQL" database, you want to figure out how they handle this problem: Do they allow the 5,000 and 7,000 node datacenters to become temporarily inconsistent? Do they block writes to the entire database? Do they block writes to the 5,000 node database (because it doesn't have a quorum)?

Any of these options are valid—you just need to know what tradeoffs you're choosing.


The author basically doesn't believe CAP represents a good model of the database tradeoffs;

"Over the past few weeks, in my advanced database system implementation class I teach at Yale, I’ve been covering the CAP theorem, its implications, and various scalable NoSQL systems that would appear to be influenced in their design by the constraints of CAP. Over the course of my coverage of this topic, I am convinced that CAP falls far short of giving a complete picture of the engineering tradeoffs behind building scalable, distributed systems."

http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-an...


Basically, it doesn't relate to the CAP theorem. ACID is largely orthogonal to CAP, the C notwithstanding. It sounds like the proof-of-concept sacrifices consistency (in the CAP sense): the model is one in which each node executes commands independently, but in a deterministic fashion, so they eventually look the same anyway.


I'm fairly certain they are talking about a CA system, since locks must be acquired for the tuples involved in a transaction (even if only long enough to determine an ordering id for the transaction). In the presence of a partition, it wouldn't be safe to acquire these locks.


It's based on fully logically serializing all transactions, therefore it cannot tolerate partitions.


Our view (and yes, this may seem counterintuitive at first), is that the problem with ACID is not that its guarantees are too strong (and that therefore scaling these guarantees in a shared-nothing cluster of machines is too hard), but rather that its guarantees are too weak, and that this weakness is hindering scalability.

That is indeed counter-intuitive at first, but they make a good case. I'm impressed with the idea, and hope to follow this more closely. Anyone know any other research on the subject?


I only skimmed the paper pointed to by the blog post, but one quote jumped out at me:

"Hence, in an era where disk reads caused wild variation in transaction length, allowing nondeterministic reordering was the only viable option."

In other words, when all your data is in ram (possibly distributed across a cluster of machines connected by a relatively fast network), the performance tradeoffs are very different than when your data is on rotating mechanical disks.

I don't know that I'm convinced that their approach is better, but they've certainly given me something to think about (and re-read in more detail later).


This sounds a lot like optimistic vs. pessimistic concurrency control, but over the set of all transactions in flight. You proceed assuming none of them are going to fail, but if any of them do, you're really screwed--you have to abort all of them and roll back to the last valid state to accept any more. Still, if we can stop rolling our own half-assed transactions over a set of feature-poor data stores, we'll avoid a lot of ugly problems.


If I read the article correctly, one requirement of their system is that transactions have to be defined in such a way that all transactions succeed for some definition of success.


The idea does sound interesting - so it looks like they are trying to reduce the amount of network handshaking by imposing a stricter isolation.

I am not sure I am convinced by their results though. They say their deterministic system seems viable when comparing its performance to traditional systems under short in-memory transactions. That is a special case that is clearly in their favor though. In that situation, the amount of time spent in processing data is greatly reduced so the network overhead becomes much more significant - so the system that does less network communications will obviously win...

I guess it may potentially be good for in-memory database systems for stuff like OLTP apps (e.g. VoltDB and TimesTen), but then I think most OLTP apps are okay with a more relaxed isolation...


Daniel Abadi is one of the authors of the HStore paper that VoltDB is based on. It looks like the deterministic order system is using similar requirements and assumptions.


I think this would be better without the adversity. Nobody has to 'win' or 'lose' on 'either side' of 'the debate.' The whole way this is framed (by everyone, not just this) irks me.

Different tools for different cases. Polyglot persistence is the way forward.


Odd... I didn't find the tone of the article to be at all confrontational. The author gave a great summary of the NoSQL movement, its goals, and how his team has an idea that might accomplish many of those goals in a different way.


If there weren't so many "ZOMG NOSQL SUCKS" articles recently, I might agree with you, but "The poorly kept secret is that it's all about scale, if RDBMS-s would scale, nobody would NoSQL, and by the way, it's a lazy solution anyway!" in the first few paragraphs triggered my "I'm really getting sick of this" reaction.


"In our opinion, the NoSQL decision to give up on ACID is the lazy solution to these scalability and replication issues."

They had me up until they called NoSQL "lazy" and claimed NoSQL solutions "give up on ACID". They aren't lazy and a lot of them support ACID, or most of it.

From that point on this veers away from logic and moves towards a self-justified argument. Tools are just tools. If you need immediate consistency then go for an immediately consistent solution, if you don't then you should feel free to take advantage of the performance gain of eventual consistency.


The thing is that implementing ACID is, well, atomic - you either support it or you don't. Anything else is one of a series of different types of compromise. All currently available NoSQL systems I'm aware of compromise it very heavily, resulting in greatly increased programmer complexity for the types of applications that really require something that looks like ACID. I'd be interested to know which ones you think support ACID - as far as I'm aware none of them claim to support anything like it (with the possible exception of VoltDB, which has its own set of problems).


Scalaris has supported ACID transactions for years, but it never caught on because at scale you care more about availability than about having transactions: http://twitter.com/Werner/status/1008722501


If you're counting important things (money, inventory, etc), being up with the wrong answer can be worse than not being up.


In summary, it is really hard to guarantee ACID across scalable, highly available, shared-nothing systems due to complex and high overhead commit protocols, and difficult tradeoffs in available replication schemes.

It's hard, but it's doable. I find it quite striking that no one ever seems to mention the two leading high performance parallel databases, HP Neoview and Teradata. Is it that people don't realize they exist?


Teradata is meant for data warehousing and analysis with lots of reads and few writes, not for OLTP that has a lot of writes.

Any DML that modifies a row in Teradata either locks the whole table if it doesn't know immediately where the row is, or locks the row hash if it does (if you give it the primary index value). Locking the whole table involves sending a command to all of the nodes and waiting for a response. That's why most writes are typically done as large batch loads at infrequent times.


Neoview and Teradata are OLAP databases. OLTP is a different problem.


I can tell you a hilarious story about a former employer who bought a Teradata to run a trading system on...


Who uses serializable isolation level? I find that claim dubious, and I don't think ordering the way he talks about is better than read committed.


Academics writing papers is not the same thing as people actually solving problems in production.

Not to be ornery about this or anything.


Great production implementations can start from coders reading papers on a rigorously explored idea.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: