CAP, as stated, is a tautology. If you want to understand this better, please re...

Diederich · on June 25, 2012

I was guilty, for a time, of flinging 'CAP' around. Matters are, I believe, both simpler and far more complex.

Here's the simplest way I can imagine a scenario. This is very similar to what my company deals with all the time.

Given two, network distant data centers. Requests can land on either datacenter. A request can be a PUT or a GET.

We deal with phone calls, so we have to provide an exceptionally reliable and consistent service.

The problem: a PUT lands in datacenter A. All subsequent GETs have to be consistent with this PUT. Even if they land in datacenter B.

The rest of the problem: datacenters have to be able to function independently. This means that we can't block a PUT while it sync's to the other datacenter. The other datacenter might be down, or there might be some network delay, or whatever.

In my mind, it's simply impossible to have complete consistency and complete reliability when the redundant pieces are WAN connected.(1) One can only approach and approximate this, with more and more hardware, circuits and development complexity.

Is my assertion incorrect? I'm unclear if HyperDex helps me with this problem.

By the way, I'm pretty excited about this product overall. If I wasn't in catch-up mode at work, I'd be hammering out a proof of concept project on it right away. It seems pretty compelling.

And thanks for this offering!

1) I believe a 'thick client' can, almost, solve this problem. Consider some Javascript web app. It can be developed to transparently handle various network partition situations. More or less. But many of our requests allow absolutely no code on the client side. They are standard REST calls.

Diederich · on June 25, 2012

Ok, I found these slides and I believe we are on the same page, so to speak:

http://hyperdex.org/slides/2012-04-13-HyperDex.pdf

Quote:

What CAP really says: - if you cannot constrain your faults in any way, - and your requests can be directed at any server, - and you insist on serving every request, - then you cannot possibly be consistent.

No shortage of systems that preemptively give up on C, A, and P, especially C

End quote.

I think that's exactly correct.

Which leaves me with a nearly impossible problem. But it sure is fun to work on!

rescrv · on June 25, 2012

HyperDex performs best in a single-datacenter environment. You are right that synchronous replication between datacenters can (and likely will) be slow.

There's much work on this (my favorites are Walter and COPS from SOSP 11) that is on-going. We'll likely be throwing our hat into the ring soon as well.

You are right that pushing some logic to the "thick client" is one of the viable solutions.

Feel free to contact us as you work on your PoC. We'd love to hear about positive results and help with any rough patches you encounter.

rdtsc · on June 25, 2012

Thanks for explaining. Your project looks very interesting. I will be definitely following it.

Can you explain some more about how node failure are handled? And whether there is a way to tune that. I.e more about the phrase "less than f nodes fail at any give point in time".

Is that tunable and how dynamic is that. (Set once or dynamic setting? Also set per running cluster, per name space, per key-value pair?)

Also, how is this f number related to the total number of machines in the cluster and how is it related to the number of data replicas?

rescrv · on June 25, 2012

f is the number of failures that the system can tolerate. Typical systems can tolerate f failures with either (f+1) or (2f + 1) replicas.

HyperDex can tolerate f failures with f+1 replicas. This failure threshold is per region of the hyperspace. Within each region, servers are arranged in a chain. As servers fail they are removed from the chain. As new servers come into the cluster they are appended to the end of one or more chains. The "less than f nodes fail at any given point in time" translates to "at least one node in the chain is alive". Notice that chains are of length f+1 while only f nodes can fail, guaranteeing that one node is alive.

This will be tunable parameter. Right now, it is set when a new space is created, but in an upcoming release this will be totally dynamic and changeable on a per-region (of the hyperspace) basis.

I'd recommend f values of at most three. If failures are totally independent, this will be sufficient to last for millions of years. Thus, it is independent of the number of machines (once you have a sufficient number of machines).

DennisP · on June 25, 2012

"If failures are totally independent"...that's sometimes a big if, especially on cloud providers.

I didn't see anything in the documentation about backups...what are the options?

rescrv · on June 25, 2012

Most of the decent providers will help you improve failure independence. I know first hand that Linode is very accommodating for this.

I mention independent failures because f=3 is sufficient for this scenario for most. If there is correlation between failures and you cannot move servers around then you can compensate with a higher f.

In a future release (within the next six months), we'll be adding support for consistent snapshots that guarantee that HyperDex can withstand more than f failures with a bounded amount of data loss.

It's always possible to retrieve all data with an empty search and manually dump it into another form.

DennisP · on June 25, 2012

Cool. A related question...what if >f machines go down, but you can bring them back up? Do you lose data or just lose availability for a while?

rdtsc · on June 26, 2012

Thanks again for explaining and great work. Looks like a very exciting project!