HyperDex: A searchable distributed key-value store

latch · on June 25, 2012

FWIW, I find your website suffers the same problem as Riak's, which is to say, it is entirely too focused on the product rather than how _I_ use the product. It's like it's written for people who'd be interested in writing their own data store, rather than someone, like most of us, who just want to CRUD in some shape or another.

I feel like I'm reading a white paper.

Redis' website is the exact opposite. Similarly, MongoDB's website, while a far cry from Redis', is still much, much better than Riaks.

egs · on June 25, 2012

Thanks, this is a valid point.

Do take a look at our basic tutorial: http://hyperdex.org/doc/tutorial/

Then the more advanced one showing asynchronous operations, atomic operations, and fault tolerance: http://hyperdex.org/doc/tutorial-advanced/

And finally the rich DB interface that supports lists, sets and dictionaries, each with supporting asynchronous and atomic operations: http://hyperdex.org/doc/tutorial-datastructure/

to see which, if any, of the features and API might be a good fit for your specific needs.

rdtsc · on June 25, 2012

Was trying to figure out where it lies in the CAP trade-off space. Found this set of slides

http://hyperdex.org/slides/2012-04-13-HyperDex.pdf :

---

* Consistent: linearizable. GET returns latest PUT, always

* Available: in the presence of < f failures

* Partition-Tolerant: for partitions with < f nodes

---

So it seems they have strong C and a tunable trade-off between A and P ?

Then they take a jab at the popular interpretation of the CAP theorem and claim they have a work-around:

----

   Working around the [CAP] theorem:
     * Constrain the failure size
     * Redirect clients to majority partition
     * Proﬁt: Retain all of C, A, P
   Realistic for a modern data center

   CAP misses the point The real tradeoff is between C, A, and Performance.

---

Not sure what they meant here. Anyone understood this better?

rescrv · on June 25, 2012

CAP, as stated, is a tautology. If you want to understand this better, please read our description of the CAP theorem (http://hyperdex.org/extras/). As formulated by Gilbert and Lynch, the CAP theorem really says, "If you must always read the latest value, and you must have any server serve any request, and clients are collocated with servers, then you cannot survive a partition." This is in the same class as, "If you separate all clients from servers, the clients cannot contact the servers."

If, instead, you make assumptions about what kind of failures are actually possible, then you can do more. HyperDex assumes that if you have less than f nodes fail at any give point in time. A failure could be the process crashing, the machine being turned off, or a partition occurring within the datacenter. So long as this assumption is true, then HyperDex is able to survive the partition, make progress (remain available), and provide strong consistency while doing so.

By making additional assumptions, we are able to make additional guarantees. This is not unlike what other engineers do (warning: massive simplification). A engineer building a bridge can assume the bridge can hold only so many cars and that each car has an upper bound on its weight. The product of these gives the entire weight the bridge must withstand. If the engineer builds the bridge to withstand the expected weight, plus some extra, then the bridge will work in all anticipated scenarios. It will not, of course, withstand bumper-to-bumper traffic consisting of 18-wheelers filled with lead bricks. But again, it doesn't have to.

Diederich · on June 25, 2012

I was guilty, for a time, of flinging 'CAP' around. Matters are, I believe, both simpler and far more complex.

Here's the simplest way I can imagine a scenario. This is very similar to what my company deals with all the time.

Given two, network distant data centers. Requests can land on either datacenter. A request can be a PUT or a GET.

We deal with phone calls, so we have to provide an exceptionally reliable and consistent service.

The problem: a PUT lands in datacenter A. All subsequent GETs have to be consistent with this PUT. Even if they land in datacenter B.

The rest of the problem: datacenters have to be able to function independently. This means that we can't block a PUT while it sync's to the other datacenter. The other datacenter might be down, or there might be some network delay, or whatever.

In my mind, it's simply impossible to have complete consistency and complete reliability when the redundant pieces are WAN connected.(1) One can only approach and approximate this, with more and more hardware, circuits and development complexity.

Is my assertion incorrect? I'm unclear if HyperDex helps me with this problem.

By the way, I'm pretty excited about this product overall. If I wasn't in catch-up mode at work, I'd be hammering out a proof of concept project on it right away. It seems pretty compelling.

And thanks for this offering!

1) I believe a 'thick client' can, almost, solve this problem. Consider some Javascript web app. It can be developed to transparently handle various network partition situations. More or less. But many of our requests allow absolutely no code on the client side. They are standard REST calls.

Diederich · on June 25, 2012

Ok, I found these slides and I believe we are on the same page, so to speak:

http://hyperdex.org/slides/2012-04-13-HyperDex.pdf

Quote:

What CAP really says: - if you cannot constrain your faults in any way, - and your requests can be directed at any server, - and you insist on serving every request, - then you cannot possibly be consistent.

No shortage of systems that preemptively give up on C, A, and P, especially C

End quote.

I think that's exactly correct.

Which leaves me with a nearly impossible problem. But it sure is fun to work on!

rescrv · on June 25, 2012

HyperDex performs best in a single-datacenter environment. You are right that synchronous replication between datacenters can (and likely will) be slow.

There's much work on this (my favorites are Walter and COPS from SOSP 11) that is on-going. We'll likely be throwing our hat into the ring soon as well.

You are right that pushing some logic to the "thick client" is one of the viable solutions.

Feel free to contact us as you work on your PoC. We'd love to hear about positive results and help with any rough patches you encounter.

rdtsc · on June 25, 2012

Thanks for explaining. Your project looks very interesting. I will be definitely following it.

Can you explain some more about how node failure are handled? And whether there is a way to tune that. I.e more about the phrase "less than f nodes fail at any give point in time".

Is that tunable and how dynamic is that. (Set once or dynamic setting? Also set per running cluster, per name space, per key-value pair?)

Also, how is this f number related to the total number of machines in the cluster and how is it related to the number of data replicas?

rescrv · on June 25, 2012

f is the number of failures that the system can tolerate. Typical systems can tolerate f failures with either (f+1) or (2f + 1) replicas.

HyperDex can tolerate f failures with f+1 replicas. This failure threshold is per region of the hyperspace. Within each region, servers are arranged in a chain. As servers fail they are removed from the chain. As new servers come into the cluster they are appended to the end of one or more chains. The "less than f nodes fail at any given point in time" translates to "at least one node in the chain is alive". Notice that chains are of length f+1 while only f nodes can fail, guaranteeing that one node is alive.

This will be tunable parameter. Right now, it is set when a new space is created, but in an upcoming release this will be totally dynamic and changeable on a per-region (of the hyperspace) basis.

I'd recommend f values of at most three. If failures are totally independent, this will be sufficient to last for millions of years. Thus, it is independent of the number of machines (once you have a sufficient number of machines).

DennisP · on June 25, 2012

"If failures are totally independent"...that's sometimes a big if, especially on cloud providers.

I didn't see anything in the documentation about backups...what are the options?

rescrv · on June 25, 2012

Most of the decent providers will help you improve failure independence. I know first hand that Linode is very accommodating for this.

I mention independent failures because f=3 is sufficient for this scenario for most. If there is correlation between failures and you cannot move servers around then you can compensate with a higher f.

In a future release (within the next six months), we'll be adding support for consistent snapshots that guarantee that HyperDex can withstand more than f failures with a bounded amount of data loss.

It's always possible to retrieve all data with an empty search and manually dump it into another form.

DennisP · on June 25, 2012

Cool. A related question...what if >f machines go down, but you can bring them back up? Do you lose data or just lose availability for a while?

rdtsc · on June 26, 2012

Thanks again for explaining and great work. Looks like a very exciting project!

mamcx · on June 25, 2012

Look interesting. Is like a mongodb + redis in one?

If i understood it well, if the coordinator die then everything die?

Any plan in provide pub/sub?

And/or capped collections (by number (keep 1000) or TTL (keep 24h))

egs · on June 25, 2012

Great, let me answer your questions one by one:

HyperDex is a next generation NoSQL store, so it kind of resembles traditional NoSQL stores like Mongo, Cassandra, Riak, and Dynamo, with a rich interface similar to that of Redis, but it offers much stronger properties than all of these systems. It differs from Mongo in that it provides much stronger consistency and fault-tolerance guarantees. Mongo's default will gladly pretend that an operation was committed even before it was seen by any server. HyperDex differs from Redis in that it shards its data across a network and is generally designed from the ground up for a networked environment. Both Redis and Mongo can and will return stale results following a failure whereas HyperDex will always return consistent results -- it guarantees something called "linearizability" for key-based operations, which roughly means that time will never go backwards from the point of view of any client. And in spite of offering stronger properties, HyperDex is faster than both Mongo and Redis on industry-standard benchmarks.

The HyperDex coordinator sounds like a singular entity, but is in fact a redundant, replicated service. The Paxos algorithm (provided by the ConCoord implementation here: http://openreplica.org) ensures that the coordinator overall can survive failures of some of the coordinator replicas.

Building a pub/sub system on top of HyperDex would be an excellent project.

HyperDex does not support "capped collections" out of the box, but it would be trivial to implement these with a background thread that prunes the database. The data store supports sorted queries, so you can say "return the top-1000 objects sorted by insertion time" or whatever else metric you liked to sort by. And you can delete groups of objects. These operations are implemented efficiently.

Hope these help.

crazygringo · on June 25, 2012

This looks interesting, but can someone explain the benefits over memcache/Redis?

Is "hyperspace hashing" storing multiple copies, so it's like storing records on multiple shards of a database?

And "enables lookups of non-primary data attributes": how useful is this, actually? Is this a big step forward for NoSQL?

The site doesn't seem to be giving me a real-world case where HyperDex solves existing problems better than other things... maybe I just can't find it?

rescrv · on June 25, 2012

Memcache is a caching solution that stores binary blobs. HyperDex stores data persistently and offers a wide variety of types such as lists, sets, and maps.

Redis offers many datastructures as well, but it has a limited architecture. If you want to run multiple Redis instances, you must run them in a master-slave configuration with no guarantees when the master fails. HyperDex can withstand such faults while guaranteeing linearizable semantics. Further, we've got some changes in the pipeline that will enable us to support every datastructure operation Redis supports (BRPOPLPUSH, I'm looking at you) with horizontal partitioning across multiple machines.

Hyperspace hashing stores multiple copies, but this is configured by the user. In our applications we've had at most two copies.

You can find more about looking up non-primary data attributes in the tutorial (http://hyperdex.org/doc/tutorial/#creating-a-new-space). It enables you to search over attributes of an object without having to maintain secondary indices.

We've built a sample threaded-discussion application on top of HyperDex. We're also in talks with some astronomers who would like to use it to analyze data from a radio telescope. The geometric nature of hyperspace hashing makes it easy to perform queries such as, "retrieve all objects in this cone through space." For any application where you perform secondary attribute search, HyperDex is the way to go. For other applications where you want high throughput and low latency with strong consistency, HyperDex is a good bet.

DennisP · on June 25, 2012

I'd be very interested in your threaded-discussion sample. Couldn't find it on your site, do you have it online somewhere?

egs · on June 25, 2012

Here it is http://gibbr.org/. An undergrad implemented this on HyperDex in about half a semester.

rescrv · on June 25, 2012

We have a demo of an earlier version deployed on http://gibbr.org/. We currently are not releasing the source, but we may make an example application using the same design we use in the real app.

DennisP · on June 25, 2012

Cool. An article on how it's designed might be just as good.

snarfly · on June 25, 2012

Ever thought about using some kind of locality sensitive hashing or maybe a neural network to map your properties into the table space? Intuition is telling me your partitioning scheme would still work and similarity search could be improved?

gizzlon · on June 25, 2012

Would be more believable if you gave some examples of the tradeoffs that went into the design, example use-cases, strengths and weaknesses, etc..

As it stands, it just reads: "It's faster!! It's better!! It's webscale!"

rescrv · on June 25, 2012

We have a full 14 pages (http://hyperdex.org/papers/hyperdex.pdf) describing the tradeoffs that went into the design and where the system's strengths and weaknesses lie.

Our statements that it is faster derive directly from our observations and evaluation in the paper.

gizzlon · on June 25, 2012

That may be so, but people might dismiss everything as bs and before getting to the paper.

egs · on June 25, 2012

A real hacker would never be dismissive of code and claims backed by an open git repo and documentation.

That said, HyperDex outperforms and provides stronger guarantees than previous key-value stores due to two architectural differences.

First is a new way to distribute data called "hyperspace hashing," whose description requires a picture, and can be found here: http://hyperdex.org/about/. This is quite different from the simple bucketing performed by memcached, Dynamo, Cassandra, Riak, MongoDB and others. See the latest slide set for an illustration of differences.

Second, HyperDex maintains replicas through a technique known as "value-dependent chaining." This enables the system to keep the replicas in sync without having to pay high overheads for coordination.

gizzlon · on June 26, 2012

"A real hacker would never be dismissive of code and claims backed by an open git repo and documentation"

Time is limited, the stream of new NoSQL projects is less so.

I'm sure the project is interesting, but the webpage comes off as bombastic.

egs · on June 26, 2012

You asked for a technical description, and also mentioned that you were unwilling to read a long, detailed technical description, and asked for a "TL;DR." I hope the short summary was useful. If you have technical feedback, we'll be very happy to engage further.

gizzlon · on June 28, 2012

Thanks, I see that I came off as overly negative, sorry.

The things you link to, and a login; article (that I haven't read yet) put things in another light :)

I still think your webpage needs some work though. I twice dismissed hyperdex as "just another nosql/storage project full of s"#¤". I'm probably not the only one.

It's difficult to put my finger on why it comes off as it does, but it might be the density of buzzwords in the first few sentences.

What I actually want to know when looking at new projects is:

  - Where does it fit in the nosql "space"? Is it k.v.? graph? something else? 
  - Why is it different? Why do we need yet another .. [1]

(Just to be clear, I'm not asking you to answer this here, but IMO you should answer i on your homepage)

edit: [1] http://nosql-database.org/

rescrv · on June 28, 2012

Thanks for the positive feedback. We've been working on improving the documentation (and soon, the website). Your feedback certainly helps.

Ixiaus · on June 25, 2012

Additionally, this looks like a repost anyways: http://news.ycombinator.com/item?id=3622888

DennisP · on June 25, 2012

How hard is it to add APIs for other languages? (In my case I'm thinking JVM, since I'm getting into Clojure.)

rescrv · on June 25, 2012

We have complete Java bindings (https://github.com/rescrv/HyperDex/commit/1c4e98d88517a63bf0...). Does this help for Clojure?

Our base bindings are in C. Any language which can wrap C can make bindings.

DennisP · on June 25, 2012

Awesome, yes.

daemon13 · on June 25, 2012

Do you have examples of use in production?

Could not find on your site. Adding such info would be beneficial.

rescrv · on June 25, 2012

We have worked with LinkedIn to use HyperDex for some of their analytics work.

Ixiaus · on June 25, 2012

This looks very interesting; however, I do wish they would stop doling out the koolaide on (almost) every page. Instead of telling me why they are "this much faster than xyz" all I see are some graphs and numbers, which isn't particularly helpful because the benchmarks could be flawed!

rescrv · on June 25, 2012

The code is available on GitHub. If the benchmark is flawed, help us fix it.

Here are some ways it could be flawed:

* MongoDB and Cassandra were set to default consistency levels. They are not configured to wait for the majority of replicas to respond. HyperDex waits until data is committed at all replicas. This clearly biases the benchmark away from HyperDex.

* MongoDB and Cassandra operate on the key for the search benchmark, while HyperDex operates (solely) on secondary attributes. Another bias against HyperDex.

Most of our gain in the benchmarks comes from high GET throughput and high SEARCH throughput, although PUTs are competitive as well.

We are faster than MongoDB because they do idiotic stuff they don't have to. When I was investigating why HyperDex is faster, I looked solely at client libraries (since MongoDB's default config just writes to socket buffers, or buffers it in userspace). HyperDex has one function to create the request packet, one function that enqueues it with a constant number of operations, and one function to flush the queue. Once a request is created it is not copied until the kernel moves it to the a socket buffer. MongoDB, on the other hand, bounces through half a dozen different layers, some of which perform memmove to compact the data, keeping it contiguous in memory. While I've not examined the whole MongoDB code base, I suspect that it's more of the same. I can tell you first hand that the same diligence paid to making the HyperDex client efficient was paid at all layers of the HyperDex stack.

Edit: Trying to get bullets to work

moe · on June 25, 2012

all I see are some graphs and numbers

What the hell?

You are free to disagree[1] with their results or point out flaws in their methodology. But how on earth can you complain about them providing real numbers and the code used to generate them?

This is rare enough, most other databases only provide the kool-aid, without any numbers whatsoever (cf. MemSQL).

[1] https://groups.google.com/d/msg/redis-db/N4oy3lCngsU/NQUwf12...

egs · on June 25, 2012

Thanks for the note. We were flabbergasted to read the parent comment complaining about "graphs and numbers."

Note that the issues raised by the Redis developer boil down to the following:

* The benchmark measures a primitive that Redis does not provide, so Redis looks slow: This may be true. The strength of a system lies in how well its primitives handle unanticipated uses. Undoubtedly, a system with a "do_benchmark()" interface would achieve the best results, but this is not a good way to build systems. For the record, HyperDex's interface is at the same level of abstraction as Redis's in this case.

* The benchmark compares "single-core Redis" to "multi-core HyperDex." It is true that HyperDex was designed from the ground up for a networked, multi-core system. Redis is not sharded and seems to work best when co-located on the same host as its clients. If your data fits on one host and clients are on the same machine, you should probably use Redis and not HyperDex. As for the complaint, we would have used something other than "single-core Redis" if such a thing existed. Our emails to Salvatore asking for an alternative binary went unanswered -- he chose to respond with blog entries instead of code.

* The benchmark is not real: The benchmark in question is the Yahoo Cloud Serving Benchmark. It's not something we made up. One can only imagine the kind of criticism we would get if we had actually constructed our own benchmarks. YCSB is an industry-standard benchmark commonly used to evaluate the performance of key-value stores.

These kinds of issues are really easy to resolve, without having to recourse to noise on blogs and HN: We urge everyone to test their own apps against the git repo. We worked hard to make HyperDex the best key-value store out there with the strongest properties, and we hope you find it useful.

rescrv · on June 25, 2012

Just a further note: Although HyperDex is multithreaded by default, when benchmarking it against Redis, we disabled all but one thread from serving network requests.

Edit: reword for clarity.