Thanks for explaining. Your project looks very interesting. I will be definitely...

rescrv · on June 25, 2012

f is the number of failures that the system can tolerate. Typical systems can tolerate f failures with either (f+1) or (2f + 1) replicas.

HyperDex can tolerate f failures with f+1 replicas. This failure threshold is per region of the hyperspace. Within each region, servers are arranged in a chain. As servers fail they are removed from the chain. As new servers come into the cluster they are appended to the end of one or more chains. The "less than f nodes fail at any given point in time" translates to "at least one node in the chain is alive". Notice that chains are of length f+1 while only f nodes can fail, guaranteeing that one node is alive.

This will be tunable parameter. Right now, it is set when a new space is created, but in an upcoming release this will be totally dynamic and changeable on a per-region (of the hyperspace) basis.

I'd recommend f values of at most three. If failures are totally independent, this will be sufficient to last for millions of years. Thus, it is independent of the number of machines (once you have a sufficient number of machines).

DennisP · on June 25, 2012

"If failures are totally independent"...that's sometimes a big if, especially on cloud providers.

I didn't see anything in the documentation about backups...what are the options?

rescrv · on June 25, 2012

Most of the decent providers will help you improve failure independence. I know first hand that Linode is very accommodating for this.

I mention independent failures because f=3 is sufficient for this scenario for most. If there is correlation between failures and you cannot move servers around then you can compensate with a higher f.

In a future release (within the next six months), we'll be adding support for consistent snapshots that guarantee that HyperDex can withstand more than f failures with a bounded amount of data loss.

It's always possible to retrieve all data with an empty search and manually dump it into another form.

DennisP · on June 25, 2012

Cool. A related question...what if >f machines go down, but you can bring them back up? Do you lose data or just lose availability for a while?

rdtsc · on June 26, 2012

Thanks again for explaining and great work. Looks like a very exciting project!