How to beat the CAP theorem

mononcqc · on Oct 14, 2011

This doesn't beat the CAP theorem. The approach can be quickly described as a log-based append-only database with read-repair code based on last-write-wins.

The system will have inconsistent views for the whole time a netsplit lasts. If I have two nodes, A and B, and that they both hold a piece of data related to a meeting M0 at 11 on Monday, then a netsplit happens, the strategy suggested won't help with the following case:

Jim and Peter have access to node A, and Mary and Julie have access to the node B. Because there is a netsplit between A and B, communication is impossible there. However, Jim and Peter agree to move the meeting M0 to 3PM on Monday, and we call it {A,M1}. Mary and Julie, however, agree to move the meeting to 1PM on Tuesday. We call it {B,M1}. Now, the M1 record on A and B is no longer the same. If the netsplit isn't resolved before Monday, then inconsistent data will have caused the loss of our team meeting!

When the netsplit is resolved and we consult the meeting, the approach of the blog post will pick either B's version of M1 or A's version of M1 based on the map-reduce we've had. To the reader, we'll have discarded either of them. This is why the last write wins -- it's the latest state in the log of events, the other one is ignored.

Last-write-wins is a default mode to be used on read-repair, and nothing solving the problem of losing consistency. During the netsplit, views of the data remain inconsistent.

The way to keep them consistent would have been to have A and B blocking writes. As such, we would have kept Jim, Peter, Julie and Mary from writing new meeting times and made sure that they all had the same hour presented.

An easy way to test whether you beat the CAP theorem or not is to imagine your system functional during a week-long netsplit. Either it keeps on working with inconsistent data, or it has a way to keep it consistent. The CAP theorem tells us it's impossible to do both at once, and the blog post didn't disprove that, but merely provided a way to resolve conflicts in data in an automatic way.

I would expect someone to 'beat' the CAP theorem as much as I would expect someone to 'beat' the Pythagorean theorem. The blog is interesting in its own approach to make the CAP theorem simpler to handle, but misses its own point of beating it.

pdbne · on Oct 13, 2011

I'm not sure if you're trolling here:

> The CAP theorem has been beaten.

> As you are about to see, a system like this not only beats the CAP theorem, but annihilates it.

The CAP theorem states that you can't have all three of strong consistency, partition tolerance, and availability at once. This is a proven fact (http://www.cs.cornell.edu/courses/cs6464/2009sp/papers/brewe...).

I don't see how your proposed system provides strong consistency. Eventual consistency is not strong consistency. Let's pretend the HDFS deployment in your batch workflow is partitioned from the rest of the system: sorry, but you won't be able to get strongly consistent data from it, and you'll have to either read stale data or simply block until the partition ends. This is pedantic, the CAP theorem is pretty specific.

Now, if you're saying something about the eventual consistency of logically monotonic facts in the system (see the CALM conjecture, now theorem: http://databeta.wordpress.com/2010/10/28/the-calm-conjecture...), which I think you are, then I agree that you have a system providing (even provable) eventual consistency. The basic idea is that if code is logically monotonic, meaning the set of facts it operates on continues to grow over time, then it will obey eventual consistency properties. If I'm not mistaken, your notion of "immutable data" is equivalent to restricting programs to operate on logically monotonic data (in your examples, this is accomplished by making data take the form "fact X is true at time T").

nathanmarz · on Oct 13, 2011

I never said anywhere that it provides strong consistency. You still choose between availability and consistency. What you beat is the complexity the CAP theorem normally causes. You regain the ability to easily reason about your systems.

"There is another way. You can't avoid the CAP theorem, but you can isolate its complexity and prevent it from sabotaging your ability to reason about your systems."

"What caused complexity before was the interaction between incremental updates and the CAP theorem. Incremental updates and the CAP theorem really don't play well together; mutable values require read-repair in an eventually consistent system. By rejecting incremental updates, embracing immutable data, and computing queries from scratch each time, you avoid that complexity. The CAP theorem has been beaten."

adgar · on Oct 13, 2011

> What you beat is the complexity the CAP theorem normally causes. You regain the ability to easily reason about your systems.

So instead of saying "The CAP theorem, while still being completely valid, doesn't have to as annoying as you might think," you said "The CAP theorem [has been] annihilated."

Gotta love blogging.

nathanmarz · on Oct 14, 2011

Wow. This is one of the most blatant examples of taking things out of context I've ever seen. The "annihilated" sentence you're referencing is part of a thought experiment in the post about an ideal system that's infeasible but instructive!

Here's the proper context for that statement:

"The simplest way to compute a query is to literally run a function on the complete dataset. If you could do this within your latency constraints, then you'd be done. There would be nothing else to build.

Of course, it's infeasible to expect a function on a complete dataset to finish quickly. Many queries, such as those that serve a website, require millisecond response times. However, let's pretend for a moment that you can compute these functions quickly, and let's see how a system like this interacts with the CAP theorem. As you are about to see, a system like this not only beats the CAP theorem, but annihilates it.

The CAP theorem still applies, so you need to make a choice between consistency and availability. The beauty is that once you decide on the tradeoff you want to make, you're done. The complexity the CAP theorem normally causes is avoided by using immutable data and computing queries from scratch."

lwat · on Oct 13, 2011

So what you're saying is... the CAP theorem has NOT been beaten.

espeed · on Oct 13, 2011

Beaten != Eliminated

biot · on Oct 13, 2011

It's like beating cancer by learning to mitigate the downsides.

adgar · on Oct 14, 2011

Isn't "beating cancer" typically reserved for people who go into remission and aren't going to be killed by their cancer?

The analogy doesn't hold. OP is still stuck with CAP.

adgar · on Oct 13, 2011

He even said he annihilated the CAP theorem. You can't weasel out of it, the title and intro is linkbait.

rxin · on Oct 13, 2011

Good write up, Nathan. Nonetheless, you started by saying you were going to challenge the basic assumptions of how data systems should be built, but the resulting system architecture is in fact very similar to a lot of data streaming systems that the academia and industry have proposed and built for a number of years. An example of such system is Google's Percolator [1].

In essence, you separate the system into two parts, a read-only part and a real-time part. CAP explains fundamental trade-offs in OLTP type of queries, whereas this system design is catered towards streaming, analytical queries. These are two very different contexts.

[1] http://research.google.com/pubs/pub36726.html

nathanmarz · on Oct 13, 2011

I was under the impression that Percolator was a purely incremental system. I'll have to give the paper another read to see if they're using a hybrid batch/realtime approach.

The system I described isn't limited to any particular "type" of query. It's a way of supporting arbitrary queries on arbitrary data, which encapsulates everything.

Most people assume mutable state and incremental updates when building data systems. This is in fact the approach that almost all databases take. These are the assumptions I challenged.

VladRussian · on Oct 13, 2011

>Most people assume mutable state and incremental updates when building data systems. This is in fact the approach that almost all databases take. These are the assumptions I challenged.

it isn't assumptions. It is requirements for some systems. You drop requirements - you get a breathing room and can build a different system.

nathanmarz · on Oct 13, 2011

Mutability and incremental updates are never requirements. Requirements have to do with performance and resource usage: X amount of space, Y amount of memory, Z amount of latency, etc.

Mutability and incremental updates are approaches. I showed a different approach with immutability at the core which leads to a system with some amazing properties.

eru · on Oct 13, 2011

I hope that examples like this and git make more people see the value of immutability. In the context of functional programming the same concept is called purity.

perlgeek · on Oct 13, 2011

> The key is that data is immutable. Immutable data means there's no such thing as an update, so it's impossible for different replicas of a piece of data to become inconsistent.

I'm not following that point.

Suppose you have an event-souring data store as the OP suggests, and some nodes go out of sync. So one of the nodes records that Sally now lives in Atlanta, the other doesn't.

The query for Sally's location asks for the last update. One node says Atlanta, the other still says Chicago. What's that, if not inconsistent?

Only appending data means you don't corrupt old data, but you still can get conflicting responses from different parts of your system.

nathanmarz · on Oct 13, 2011

What you're describing is a query. "What is Sally's current location?"

Queries can be inconsistent during partitions. This is what eventual consistency means. But once the partition clears, the queries will be consistent again.

The data is immutable. Both "Sally lives in Atlanta as of time X" and "Sally lives in Chicago as of time Y" are true.

This is far, far different from databases based on mutable state. In order for a database like that to be eventually consistent, you need to do read-repair to enforce consistency. This is besides the other problems I brought up with mutable databses, such as a lack of human fault-tolerance.

sorenbs · on Oct 13, 2011

So you would still have to deal with the case of someone reading inconsistent data and taking wrong action as a result. If that action is only internal to the system you could go on and do a cleanup when the system becomes consistent again. If the action is external you can not.

nathanmarz · on Oct 13, 2011

Right. If full consistency is a requirement than you can still have it, at the cost of availability. Alternatively, since the dataset is immutable, it contains a history of everything that happened. So you can resolve problematic actions later on (this is similar to what banks do).

It's important to realize that the tradeoff between consistency and availability is a limitation of nature, not of our tooling.

ww520 · on Oct 13, 2011

So CAP is still a problem, not beaten.

kahirsch · on Oct 13, 2011

If I only have one antique pocket watch and, during a partition, Alan buys the pocket watch while communicating with partition A and Bob buys the pocket watch while communicating with partition B, how do we get back to a consistent state without application code?

nathanmarz · on Oct 13, 2011

A good example to look at is banks. Banks are eventually consistent systems. An ATM allows you to withdrawal funds (to a set limit) even if it can't communicate with the bank. However, because banks keep full audit logs of all transactions (immutable data), they eventually discover that you took out too much money and charge you an overdraft fee.

http://en.wikipedia.org/wiki/Overdraft

lemming · on Oct 13, 2011

This isn't really relevant to the pocket watch problem. This only works because the bank doesn't care about over-committing a finite resource (mostly because they can charge you that fee). My company processes transactions for prepaid credit cards, so any money that is overdrawn is essentially lost - it's important to understand the characteristics of the problem and that there really is no magic anti-CAP bullet.

nathanmarz · on Oct 13, 2011

Exactly right. The bank example just demonstrates an alternative approach to full consistency.

nathanmarz · on Oct 13, 2011

This is a use case where it sounds like you want full consistency, so in the realtime layer you would use a database that becomes unavailable during partitions.

kahirsch · on Oct 14, 2011

I think that most database applications cannot be seen as a monotonically-growing collection of facts, but as a sequence of operations, and those operations don't necessarily commute.

Most of the operations do commute, or almost commute--there are edge cases involving, for example, balances or inventory falling to zero, and with side effects, duplicates, generated ids, timestamps, etc. I think it's difficult for these to be handled automatically, because of semantic issues. For side effects, there have to be compensating actions--charging back credit cards for orders not filled, for example, or sending an email saying, sorry, you're not actually getting the watch. For operations that don't commute, having a batch system isn't going to be adequate.

For actions that do commute, I don't see how having a batch system is necessary. It just means having a third opinion of what the value should be. Unless you have a define down-time, you're introducing more consistency issues.

Also, "online" (as in OLTP), rather than "realtime" is more consistent with standard DBMS terminology.

VladRussian · on Oct 13, 2011

the database does it when partition ends. Its called eventual consistency. The database would use something along the lines of vector clocks.

lemming · on Oct 13, 2011

The problem is that in this case the database can only apply some ad-hoc heuristic. In the case of the pocket watch, this will be: the first user to buy it gets the watch, and the second user gets annoyed by an email saying that "the watch we said you bought was actually bought by someone else". There's no magic bullet here - this may be acceptable for some use cases but will not be for others.

VladRussian · on Oct 13, 2011

>Both "Sally lives in Atlanta as of time X" and "Sally lives in Chicago as of time Y" are true.

>This is far, far different from databases based on mutable state. In order for a database like that to be eventually consistent, you need to do read-repair to enforce consistency.

what databases you're talking about? any specific example?

nathanmarz · on Oct 13, 2011

Dynamo, Riak, Cassandra

VladRussian · on Oct 13, 2011

so this is more complicated than your schema ? :

http://wiki.basho.com/Replication.html#Read-Repair

"Read repair occurs when a successful read occurs — that is, the quorum was met — but not all replicas from which the object was requested agreed on the value. There are two possibilities here for the errant nodes:

1. The node responded with a “not found” for the object, meaning it doesn’t have a copy.

2. The node responded with a vector clock that is an ancestor of the vector clock of the successful read.

When this situation occurs, Riak will force the errant nodes to update their object values based on the value of the successful read."

nathanmarz · on Oct 13, 2011

Yes. First of all, not every algorithm is amenable to read-repair. Imagine, for example, storing a unique count in the database. There's no way to know how to combine divergent values in that case. (If the root value is 4, and you have two divergent values of 5, you have no idea if the increment was due to the same element or not. The right answer is either 5 or 6, but you have no idea).

More importantly, if you make a mistake, you corrupt the database. The system I described based on immutable data is human fault-tolerant, which is a critical property. If you mess up, you can always correct things.

VladRussian · on Oct 13, 2011

>Imagine, for example, storing a unique count in the database. There's no way to know how to combine divergent values in that case. (If the root value is 4, and you have two divergent values of 5, you have no idea if the increment was due to the same element or not. The right answer is either 5 or 6, but you have no idea).

if 2 nodes are allowed to accept writes for the same "cell" independently without synchronization, ie. node A : 4->5, node B : 4->5->6 how your schema would work in this case? (of course any schema would work fine if only one node allowed to master the "cell" )

sritchie · on Oct 13, 2011

I think the point here is that nodes don't accept these random writes; any error that's introduced into a system with this structure is fixed on recompute.

ww520 · on Oct 13, 2011

You use the timestamp to resolve the two values to see whether 5 or 6 is the latest.

Your system would have the same problem to resolve which data is the latest.

lmm · on Oct 14, 2011

In his system you neither partition would have written "5" or "6". Rather, the one on the left would have written a "+1", and the one on the right would have written a "+1", and you can tell whether these are the same "+1" or not. You only combine them when you do the query.

ww520 · on Oct 14, 2011

One is +1 (to 5) and the other one is +2 (to 6). Which one is the correct one?

IgorPartola · on Oct 13, 2011

This is the key point. You either set your writes to be synchronous or asynchronous, depending on whether you want consistency or high availability. The point of having no updates means that you have less possibility for conflicts when sites A and B are appended to simultaneously. This means robust replication even with async writes.

herge · on Oct 13, 2011

You add a version number or timing information to each update. Node A says Atlanta, with version 42. Node B says Chicago, with version 41. 42 > 41, so you go with Atlanta.

VladRussian · on Oct 13, 2011

when you have fault in the system so you can't access node A, node B returns Chicago with version 41 and you proceed with it if your system choice is Availability at the price of Consistency, or you wait for the node A if you choice Consistency at the price of Availability.

ww520 · on Oct 13, 2011

OP adds confusion by calling data immutable. Data has a well accepted common usage of being updatable. Event is the term what OP wants. Event is a snapshot of fact happened in time and cannot be changed and it's immutable.

What OP described in general is the append-log based approach to storing and querying data. There are a whole bunch databases and file systems built on the append-log idea. It's well worth the time to check out the research in this area to see the benefits and pitfalls of the approach.

Also check out the Complex Event Processing topic which bears similarity to what OP described.

richhickey · on Oct 13, 2011

Sorry, no. It is calling data mutable that is the confusion.

Try this exercise:

Identify something you consider a datum. Write it down. Note that it has more than one part. Now imagine it 'changing' and write that datum down. Now, identify the 'it' that (you would say) changed. Note that the 'it' part is the same in the 2 datum. Let's call this 'it' part the identity. Now look at the part that is different. For discussion's sake let's say that part was 42 in the first case and 99 in the second. Did 42 change into 99, or was it replaced by it? Let's call 42 and 99 values. You'll not be able to describe something that 'changed', that incorporates both the identity and value, that doesn't involve some memory/disk/system that had a (just one!) 'place' for storing information about 'it', and it is that place that has had it's contents replaced/changed, not the datum/fact.

Did Obama becoming president change the fact that Bush had been president? No. Could you make an information system that forgot Bush was president? Sure, but that's your fault, not a characteristic of data.

Finally, check out this 30 year old paper that contains this interesting section: UPDATE IN PLACE: A poison apple?

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108....

As long as we continue to use lax language like mutable/updateable data we will be without the tools to describe, think about, and build better systems that independently manipulate identity, value and place, and correctly handle time.

ww520 · on Oct 13, 2011

I'm sorry. I don't understand the point of the exercise.

First, what kind of data are you talking about? Are you talking about data value? Data record? Data object with identity?

Second, are you saying database records being mutable cannot model Bush was a president once Obama has become president?

I don't know the reason for the aversion to mutable data record/object. Both mutable and immutable data record/objects have their places in modeling real world objects.

jcromartie · on Oct 14, 2011

> I don't know the reason for the aversion to mutable data record/object.

I've found that mutable data is a root of all kinds of evil. In fact, I'd guess that most bugs stem from a mis-management of mutable state.

I know that pretty much all of the bugs I see every day are the result of failure to coordinate state changes, or poorly-designed state changes.

ww520 · on Oct 14, 2011

Can you give some specific examples of bugs stemming from mismanagement of mutable state? Some examples on bugs resulted from the failure to coordinate state changes or poorly-designed state changes? And how the immutable data solve the problems?

jcromartie · on Oct 14, 2011

I'm not sure why you don't believe me :)

There are literally too many to list. But if I just had to pick some recent ones:

* A dropdown on a web form that displays a different value than is returned (failure to coordinate the displayed value and the "actual" value)

* Logins remain locked after resetting a password (failure to change a value from "locked" to "not locked", instead of "is the user locked?" being a function of history)

* If you navigate to a business object from a list, and perform an action (via HTTP POST) on that object, you can accidentally perform the same action on other objects from that list through refreshing or clicking multiple times (since the "current" object is in the session state)

* Bugs that only happen when you navigate to a page from a certain other page (because the destination page depends on certain session state that was not set, instead of being a simple function of URL parameters)

ww520 · on Oct 14, 2011

Not that I don't believe you. It's just that I want to learn from specific examples on what kind of problems people encounter when dealing with mutable data and how would immutable data solve the problems.

So how would immutable data solve the above problems? Thanks.

RickHull · on Oct 13, 2011

> OP adds confusion by calling data immutable. Data has a well accepted common usage of being updatable.

I find this point to be generally contentious. Certainly in the realm of programming languages, data is often held immutable. I think mutable vs. immutable data is an important distinction with wide-ranging implications, so I find being explicit here adds clarity rather than confusion.

dasil003 · on Oct 13, 2011

I agree. Saying that data must be updatable is a truism. Of course the vast majority of systems must have updatable data, as that is a fundamental building block of common-sense, real-world reasoning—to the end user data is always mutable. But to the systems designer, thinking about data being immutable and reasoning from there allows deeper thinking that takes us to more interesting architectures.

nathanmarz · on Oct 13, 2011

The key insight is that an "update" is really a query getting the most recent data record for some entity/type.

nathanmarz · on Oct 13, 2011

Like I said, I was going to break your assumptions on data systems.

Treating data as mutable (whether you use MVCC or append logs or any other approach) leads to major complexity problems, as I detailed in the post.

ww520 · on Oct 13, 2011

Redefining the meaning of general term just creating more confusion about the point you want to make. Why not use precise definition.

The term "data" is really really general and it's bad to convey what you really mean. Are you talking about data value? Data object? Data record? Data version? Data event?

Quoting from your blog, "A piece of data is a fact that you know to be true at some moment of time." That's called an Event. That's a snapshot of fact in time.

MVCC is one way to do transaction. The blog didn't deal with transaction at all so I don't see the comparison.

Append log-based databases do the closest to what you described in avoiding changing data in place and recompute/precompute data at query time. It's a very simple mechanism. What complexity problem do you see with it?

balakk · on Oct 14, 2011

If I am not completely mistaken, the pattern that's being hinted is called Command-Query Responsibility Segregation(CQRS) in certain circles:

http://cqrsinfo.com/documents/cqrs-introduction/

While making the problem easier, I do not know if this actually "solves" the CAP theorem issues.

joeyespo · on Oct 14, 2011

At a high level, CQRS is simply separating your reads from your writes. It follows the observation that most web-related DB queries are in the form of reads and much fewer writes. I don't know about solving the CAP theorem, but it does a very nice job of scaling horizontally.

Maro · on Oct 13, 2011

In practice, beating the CAP theorem inside the data center is achieved by using a 2-tiered system.

Have a set of lightweight controller processes which use a consistent majority voting algorithm (Paxos) to replicate the state of the cluster. Since it's lightweight, you can run 5 or 7 of them, meaning 2 or 3 can die. This is enough Availability in practice, per customer input.

Have a set of data nodes which actually store the data. Put groups of 3 of these in replica sets and use consistent replication (Paxos). If one or two of them goes down, the controllers reconfigure the replica set, so you get Availability even if 2 of 3 goes down. (Customers love that they get Consistency but 2 of 3 can go down.)

This scheme is used in my company's product. The idea comes from academic/Google papers, and is used in other products, too.

The scheme does not work for multi-datacenter use-cases. If you want that, you have to give up consistent replication (Paxos) inside the replica sets and use some kind of eventual consistency. This variation of the 2-layered scheme is used by one of the popular competing NoSQL systems.

VladRussian · on Oct 13, 2011

>Beating the CAP theorem inside the data center is fairly easy in practice by using a 2-tiered system.

>Have a set of lightweight controller processes which use a consistent majority voting algorithm (Paxos, Consistency, Partition Tolerance) to replicate the state of the cluster. Since it's lightweight, you can run 5 or 7 of them, meaning 2 or 3 can die. This is more than enough Availability in practice, per customer input.

it isn't beating of the CAP theorem. It is decreasing the Partition Tolerance and thus getting more Availability while preserving Consistency in full accordance with CAP.

seabee · on Oct 13, 2011

I wrote my thesis on an immutable data store as a means to beat CAP. I didn't get quite as far as your other observations though, as mine was in the context of continuation web frameworks.

Very nice write-up however.

nathanmarz · on Oct 13, 2011

Thank you. Got a link to your thesis anywhere? Would love to read it.

seabee · on Oct 14, 2011

It's not in a state I would like to make public yet. Sufficient to get my degree, but the quality is poor due to personal circumstances. However I sent you a link via your contact form.

Uchikoma · on Oct 14, 2011

Beating the CAP theorem by proofing it by dropping the requirement for consistency? Sorry to be conservative and following Aristotelean logic, but this title does not make sense.

lmm · on Oct 14, 2011

Indeed, it's not quite "beating". "How to get full availability and eventual consistency (the best you can do under CAP) with no effort in the application layer" might have been a truer title. Nevertheless, this is something useful.

iandanforth · on Oct 13, 2011

A few days ago I posted my frustration with Cassandra counters, this article articulates a solution elegantly:

'if you make a mistake or something goes wrong in the realtime layer, the batch layer will correct it.'

The resiliency of this protection however is limited by the time it takes to do a full re-processing of the historical data. If there is a processing mistake in your real-time layer, it is likely to exist in your batch layer as well, thus invalidating the pre-computed results as well.

nathanmarz · on Oct 13, 2011

But you can always correct the mistake and recompute things. Unlike systems based on mutable data, mistakes aren't permanent.

ww520 · on Oct 14, 2011

If the transaction log is kept for the mutable system, you can always replay the transaction to arrive at the correct state. Most systems truncate the transaction log after a checkpoint to save space. If space is not an issue, you can always keep the transaction log since the beginning.

pierrebai · on Oct 13, 2011

How does the system deal with real-world data that has cross-datum consistency requirements? Many system mnodeling real-world have this issue: accounting, inventory, ...

I'm also annoyed by the repeated mantra that you must have partition tolerance. In fact, most naïve systems are designed without it, with some central authority that must always be accessible. (Any system that is stricly a tree with an essential root node is like that. Any partition that does not include the root node cannot do any work, thus the system is not partition tolerant.) You can then have CA (by requiring that all requests and updates must propagate to the root.)

IgorPartola · on Oct 13, 2011

You then do not have the A part of CA. Your root node is not accessible => your entire system is not accessible.

nathanmarz · on Oct 13, 2011

The system computes functions on all the data you have. That describes every possible data system.

In the realtime layer where you compensate for the last few hours of data, you'll need an incremental strategy to maintain that "cross-datum" consistency.

jgavris · on Oct 13, 2011

'How to tolerate CAP, according to application / system specifications..'

dr_rezzy · on Oct 13, 2011

Great write up. You really nail the essence of a true data driven approach (DDA) to solving problems. Simplification of the data world view can lead to great things. The key to your batch computation example is proper data modeling IMO. This leads to flexibility which is crucial for a design to be successful. I feel people will stumble with the content of this post because of the lack of proper understanding of DDA.

chrisdew · on Oct 13, 2011

This looks like Event Sourcing / CQRS. Is that correct?

nathanmarz · on Oct 13, 2011

There are similarities. It's really closer to functional programming than anything else.

gfodor · on Oct 14, 2011

Nice post! To me, the logical next step is a mathematical framework for proving that all major relational operators (and in partition intolerant mode, ACID) can be implemented using this approach -- if such a model can be cleanly laid out it could drive the future of database systems.

curveship · on Oct 14, 2011

Thanks for the long and interesting article. Two comments, one an observation I'd be interested in hearing your response to, the second a bit of a devil's advocate question.

First, in your write-up, you describe data as singular, but of course, it's often a tuple. To take a concrete example, let's say that Sally, in order to make her move from Atlanta to Chicago, goes to U-Haul's website and reserves a 14' truck for Monday. She thinks a bit more and decides that she needs a bigger truck, so she goes to the site and changes her reservation to a 17' truck. After even more thinking, she decides she really won't be ready by Monday, so she changes her reservation date to Tuesday.

Let's call these three moments t0, t1 and t2. So her order has changed from:

t0: 14' truck, Monday

t1: 17' truck, Monday

t2: 17' truck, Tuesday

BUT, it so happens that U-Haul has implemented a datastore like you suggest, one that is only "eventually consistent", and when Sally made her change at t2, the node she was talking to hadn't heard yet about the switch from a 14' to a 17' truck. So the actual tuples that were recorded were:

t0: 14' truck, Monday

t1: 17' truck, Monday

t2: 14' truck, Tuesday

Now you're in a pickle: how can you tell that the record stored at t2 was partially out of date, verses the possibility that maybe Sally really did want to change back to a 14' truck at the same time she changed the date to Tuesday.

To solve this, you need to do two things: the table that records new items is different from the ones that record updates, in that the new item tables contain full tuples, while the update tables are radically normalized, to the extent that they'll usually have just two columns: a key and an atomic replacement value. So what we'd actually record is:

t0: 14' truck, Monday in the "new reservation" table

t1: 17' truck in the "reservation item change" table

t2: Tuesday in the "reservation date change" table

I don't have any knee-jerk opposition to a schema like that, but it seems to me you'll run into two issues:

- all the usual issues with extreme normalization, like performance and massive and complicated joins. And since, as I understand it, your realtime iterative db is consuming the same data stream as your main store, it will face this same performance penalty, at least in its input.

- your schema darn well better be in BCNF, or enforcing data constraints across the hyper-normalized update tables is going to get very hairy very fast. Like here, let's say that U-Haul has a policy that it never rents 17' trucks on Tuesdays (I don't know why, maybe they wash them on Tuesdays). Since when Sally made her switch to Tuesday, the node she was talking to thought she was still renting a 14' truck, it was happy to record the change. But when your batch process comes along and tries to resolve the two updates, what does it do? There's no obvious answer (going with the last consistent state -- 17' for Monday -- is certainly wrong, as we know that's no longer what Sally wants). In other words, you're back in the "read repair" complexity and hell you thought you'd avoided.

Second point ... eh, I think I've well overspent my word budget. In summary: a devil's advocate would say that the realtime db you've called just an "optimization" is, in fact, the real db, since if the realtime and batch dbs differ, at least at any time other than when a new batch update lands, then you'll go with the value from the realtime db (if, during normal operation, batch says a count is 4 but realtime says 5, you report 5). So what you've really done is attached a fairly massive auditing system onto the same-old database design. Its job is to periodically check that no inconsistencies have been introduced between nodes. It's a clever auditing system, in that the immutable model means it may have a leg up on deciding what the right result is, but still just an auditing system.

tl;dr: won't this lead to hyper-normalization, and is it really a new db design or just a new kind of auditing system

edit: formatting, tl;dr

nathanmarz · on Oct 14, 2011

I would store the reservation date and truck length as separate data objects in the batch layer. You would tie them together with the reservation id. It isn't a big deal to do those joins since the batch system is good at that kind of large-scale computation and you've already accepted high latency in the batch layer.

I wrote an article about graph schemas on the batch layer here: http://nathanmarz.com/blog/thrift-graphs-strong-flexible-sch...

The realtime db only accounts for the last few hours of data. Unless you're doing a query that only needs the last few hours of data, you need to do a merge between the batch view and the realtime view. We constantly flush old data out of our realtime db so as to keep it small and fast.

relaunched · on Oct 13, 2011

While I don't have an answer on how to beat CAP theorem...I know someone who is working on it as we speak, as well as many other issues associated with the cloud. View his tweet below:

@eric_brewer Eric Brewer I will be leading the design of the next gen of infrastructure at Google. The cloud is young: much to do, many left to reach. 10 May via web

heyrhett · on Oct 13, 2011

Wow, this article is full of clear, coherent, well thought out points... and references!

lenary · on Oct 13, 2011

I love that research is going into disproving already superseded theories.