CAP, as stated, is a tautology. If you want to understand this better, please read our description of the CAP theorem (http://hyperdex.org/extras/). As formulated by Gilbert and Lynch, the CAP theorem really says, "If you must always read the latest value, and you must have any server serve any request, and clients are collocated with servers, then you cannot survive a partition." This is in the same class as, "If you separate all clients from servers, the clients cannot contact the servers."
If, instead, you make assumptions about what kind of failures are actually possible, then you can do more. HyperDex assumes that if you have less than f nodes fail at any give point in time. A failure could be the process crashing, the machine being turned off, or a partition occurring within the datacenter. So long as this assumption is true, then HyperDex is able to survive the partition, make progress (remain available), and provide strong consistency while doing so.
By making additional assumptions, we are able to make additional guarantees. This is not unlike what other engineers do (warning: massive simplification). A engineer building a bridge can assume the bridge can hold only so many cars and that each car has an upper bound on its weight. The product of these gives the entire weight the bridge must withstand. If the engineer builds the bridge to withstand the expected weight, plus some extra, then the bridge will work in all anticipated scenarios. It will not, of course, withstand bumper-to-bumper traffic consisting of 18-wheelers filled with lead bricks. But again, it doesn't have to.
I was guilty, for a time, of flinging 'CAP' around. Matters are, I believe, both simpler and far more complex.
Here's the simplest way I can imagine a scenario. This is very similar to what my company deals with all the time.
Given two, network distant data centers. Requests can land on either datacenter. A request can be a PUT or a GET.
We deal with phone calls, so we have to provide an exceptionally reliable and consistent service.
The problem: a PUT lands in datacenter A. All subsequent GETs have to be consistent with this PUT. Even if they land in datacenter B.
The rest of the problem: datacenters have to be able to function independently. This means that we can't block a PUT while it sync's to the other datacenter. The other datacenter might be down, or there might be some network delay, or whatever.
In my mind, it's simply impossible to have complete consistency and complete reliability when the redundant pieces are WAN connected.(1) One can only approach and approximate this, with more and more hardware, circuits and development complexity.
Is my assertion incorrect? I'm unclear if HyperDex helps me with this problem.
By the way, I'm pretty excited about this product overall. If I wasn't in catch-up mode at work, I'd be hammering out a proof of concept project on it right away. It seems pretty compelling.
And thanks for this offering!
1) I believe a 'thick client' can, almost, solve this problem. Consider some Javascript web app. It can be developed to transparently handle various network partition situations. More or less. But many of our requests allow absolutely no code on the client side. They are standard REST calls.
What CAP really says:
- if you cannot constrain your faults in any way,
- and your requests can be directed at any server,
- and you insist on serving every request,
- then you cannot possibly be consistent.
No shortage of systems that preemptively give up on C, A,
and P, especially C
End quote.
I think that's exactly correct.
Which leaves me with a nearly impossible problem. But it sure is fun to work on!
HyperDex performs best in a single-datacenter environment. You are right that synchronous replication between datacenters can (and likely will) be slow.
There's much work on this (my favorites are Walter and COPS from SOSP 11) that is on-going. We'll likely be throwing our hat into the ring soon as well.
You are right that pushing some logic to the "thick client" is one of the viable solutions.
Feel free to contact us as you work on your PoC. We'd love to hear about positive results and help with any rough patches you encounter.
Thanks for explaining. Your project looks very interesting. I will be definitely following it.
Can you explain some more about how node failure are handled? And whether there is a way to tune that. I.e more about the phrase "less than f nodes fail at any give point in time".
Is that tunable and how dynamic is that. (Set once or dynamic setting? Also set per running cluster, per name space, per key-value pair?)
Also, how is this f number related to the total number of machines in the cluster and how is it related to the number of data replicas?
f is the number of failures that the system can tolerate. Typical systems can tolerate f failures with either (f+1) or (2f + 1) replicas.
HyperDex can tolerate f failures with f+1 replicas. This failure threshold is per region of the hyperspace. Within each region, servers are arranged in a chain. As servers fail they are removed from the chain. As new servers come into the cluster they are appended to the end of one or more chains. The "less than f nodes fail at any given point in time" translates to "at least one node in the chain is alive". Notice that chains are of length f+1 while only f nodes can fail, guaranteeing that one node is alive.
This will be tunable parameter. Right now, it is set when a new space is created, but in an upcoming release this will be totally dynamic and changeable on a per-region (of the hyperspace) basis.
I'd recommend f values of at most three. If failures are totally independent, this will be sufficient to last for millions of years. Thus, it is independent of the number of machines (once you have a sufficient number of machines).
Most of the decent providers will help you improve failure independence. I know first hand that Linode is very accommodating for this.
I mention independent failures because f=3 is sufficient for this scenario for most. If there is correlation between failures and you cannot move servers around then you can compensate with a higher f.
In a future release (within the next six months), we'll be adding support for consistent snapshots that guarantee that HyperDex can withstand more than f failures with a bounded amount of data loss.
It's always possible to retrieve all data with an empty search and manually dump it into another form.
If, instead, you make assumptions about what kind of failures are actually possible, then you can do more. HyperDex assumes that if you have less than f nodes fail at any give point in time. A failure could be the process crashing, the machine being turned off, or a partition occurring within the datacenter. So long as this assumption is true, then HyperDex is able to survive the partition, make progress (remain available), and provide strong consistency while doing so.
By making additional assumptions, we are able to make additional guarantees. This is not unlike what other engineers do (warning: massive simplification). A engineer building a bridge can assume the bridge can hold only so many cars and that each car has an upper bound on its weight. The product of these gives the entire weight the bridge must withstand. If the engineer builds the bridge to withstand the expected weight, plus some extra, then the bridge will work in all anticipated scenarios. It will not, of course, withstand bumper-to-bumper traffic consisting of 18-wheelers filled with lead bricks. But again, it doesn't have to.