TAPIR: A new high-performance, transactional key-value store

matthewaveryusa · on Jan 14, 2016

Great to see fresh ideas! One thing I don't like in the presentation is that tapir is presented as doing better than what is out there without stating the conditions. First off there's quite a bit of hand-waving when it comes to leadership bottlenecks -- please assume that sane sharding is occurring. I'm not entirely sure transactions spanning partitions is something unique to say paxos. The abort rate vs. contention seem off. looking at the paper all the nodes are in a single datacenter. I would love to see these numbers where tapir is spread across larger geographical regions. My suspicion is that at higher latency will negatively impact the abort rate more so than with a strong leader. What about poor clock synchronization conditions? What about testing with a variety of client latencies? Since the client is effectively acting as a leader in tapir, the client is, in some ways, contending with other clients and the abort rate may be correlated to client latency. I don't think high-latency clients observe this same correlation than with a strong leader. I wish more of the compromises were presented.

iyzhang · on Jan 14, 2016

The leader bottleneck will continue to exist even if there great sharding because the leaders simply process more messages than the replicas, so TAPIR will allow each shard to support more throughput.

The paper has an evaluation for multi-data-center replication in Figure 12. We assume that the clients are web servers, so they are always close to one of the replicas, but not all of them. The result we found is basically that TAPIR performs better in the multi-data center case except when the leader is in the same data center as the client. So it depends on whether you can always guarantee that the leader is in the same data center as the client.

The abort rate continues to essentially track the latency needed for commit. So, TAPIR reduces the abort rate compared to OCC because it reduces the commit latency. At very high contention, locking is likely to make slightly more progress, but no systems with strong consistency will be able to provide high performance. If you are interested in some other ways to optimize for the high contention case, take a look at our work on Claret: http://homes.cs.washington.edu/~bholt/projects/claret.html

We also tested with high clock skew. The paper notes, "with a clock skew of 50 ms, we saw less than 1% TAPIR retries." Since the clients can use the retry timestamps to sync their clocks, it only adds an extra round-trip, so it still leaves TAPIR with the same latency as a conventional system, even in cases of extremely high clock skew.

nickpsecurity · on Jan 14, 2016

Yes, assumptions and pre-conditions would be quite useful. So much academic stuff has failed in the real-world due to mismatch between what they expected and actually occurred. Or even what its users expected and what it was built for.

jph · on Jan 14, 2016

The key insight: "... existing transactional storage systems waste work and performance by incorporating a distributed transaction protocol and a replication protocol that both enforce strong consistency. Instead, we show that it is possible to provide distributed transactions with better performance and the same transaction and consistency model using replication with no consistency."

The full paper is here: http://delivery.acm.org/10.1145/2820000/2815404/p263-zhang.p...

kluck · on Jan 14, 2016

Interesting. Can you elaborate on this? I am not sure what "inconsistent replication" means.

dsp1234 · on Jan 14, 2016

An analogy that immediately comes to mind is the difference between running a vpn in 'tcp over tcp' mode vs 'tcp over udp' mode.

tcp itself is a reliable transport over unreliable media. So running tcp on top of tcp means running two sets of algorithms for reliability, ultimately doing more work than is needed. Running tcp over udp (where udp is unreliable) means you still get the reliability over the tcp overlay, but don't need to be worried about the udp layer since it can fail and the tcp overlay algorithms will fix up the data stream.

ComNik · on Jan 14, 2016

It means that operations don't get executed in the same order on every replica. If the systems state depends on the ordering of operations, then each replica might be in a different state for the same set of operations.

"Consistent replication" would be using a protocol like Paxos to have the replicas decide on a single order of operations.

amelius · on Jan 14, 2016

> If the systems state depends on the ordering of operations, then each replica might be in a different state for the same set of operations.

Which is almost always the case. Except of course, if you build your data as a growing "collection of knowledge", where the order of facts doesn't matter. But this is cheating, since you're implicitly bolting an ordering-mechanism on top of the system in this case.

TallGuyShort · on Jan 14, 2016

Huh... isn't that required for being "transactional"? I should probably read the paper...

iyzhang · on Jan 14, 2016

No, that was our key insight. Basically, you can think of it as, using something like Paxos requires all replicas to run the same code. So every replica checks for transaction conflicts. However, you actually only need one replica to detect a conflict, so there is no reason for every replica to run the same checks -- this is wasted work. Instead, we just make sure that each conflict is checked by at least one replica.

The other interesting conclusion is that there are other workloads like this. For example, it is possible to build a reliable and better performing lock server in this way as well (and there is an implementation in the github repo). So, you'd get something similar to Chubby, but where the latency to the server is only a single round-trip in the common case.

pjc50 · on Jan 14, 2016

"An error occurred while processing your request"

proksoup · on Jan 14, 2016

I found this video a couple links in, the speaker I think is Irene Zhang, listed for many of the commits on that repo: https://www.youtube.com/watch?v=yE3eMxYJDiE

tremon · on Jan 14, 2016

It would be helpful to mention that the talk is about this topic as well, not just who's speaking.

mhw · on Jan 14, 2016

And also the poster of this item, presumably.

pavpanchekha · on Jan 14, 2016

Yep, that's Irene.

nickpsecurity · on Jan 14, 2016

"enabling TAPIR to provide the same transaction model and consistency guarantees as existing systems, like Spanner, with better latency and throughput."

Oh, hell yeah! Now that's great stuff. Can't wait to see the next step done by this or another team: building an alternative to the F1 RDBMS that Google built on Spanner. Would give CochroachDB some competition.

msackman · on Jan 14, 2016

I'm the author of GoshawkDB.

I've just been watching the talk on this at https://www.youtube.com/watch?v=yE3eMxYJDiE. GoshawkDB has a very similar design wrt the messaging and replication design. In fact, in some places, it appears that GoshawkDB's design is a little simpler.

There are obviously many differences too: for example GoshawkDB runs transactions on the client, GoshawkDB uses Paxos Synod instead of 2PC, and GoshawkDB clients only connect to one server so there are 2 extra hops, but that's a constant so from a scaling pov, it should behave the same.

One of the biggest differences is GoshawkDB uses Vector Clocks (that can grow and shrink) rather than loosely synchronized clocks.

This TAPIR work does look great - I had no idea that it was ongoing. I'll read through the paper too, but it's great that GoshawkDB has so many design ideas in common.

nickpsecurity · on Jan 14, 2016

Very interesting. I went to search for your stuff only to find I had this bookmark:

https://news.ycombinator.com/item?id=10778946

So, I knew it was interesting and possibly great work but didn't have any time to look at it. I'll move it up the backlog a bit. Maybe at least give the papers and stuff a good skimming tonight. :)

Note: Parallel, ongoing, similar work going unnoticed is a huge problem in both IT and INFOSEC. I have a meme here about how we constantly reinvent the same solutions, make the same mistakes, or move at a slower pace due to lack of connections with similar work. I keep proposing something that's like a free and open combo of ACM or IEEE with members who are academics, FOSS coders, pro's... people that contribute at least in writing. Stash papers, code, and forums there. So, odds of serendipity increase. Thoughts?

msackman · on Jan 14, 2016

Heh, definitely! I could ramble for hours about how there are so many disincentives to share anything until launch - the whole industry really isn't set up to actually make progress. However, multiple independent co-invention of stuff is still valuable from a validation point of view. The worst bits I see are when something from 20 years ago gets reinvented. I have emailed Irene just to see if there's anything or anyway to collaborate.

nickpsecurity · on Jan 14, 2016

Great! It's awesome to see you taking the initiative! Might be an exception in the making.

Im about to leave for work but quick question. What I want to see is an open equivalent of Google's F1 RDBMS: the best one. Does yours already provide its attributes, is it more like Spanner jnstead, or what? Aside from CochroachDB, where is OSS on a F1 competitor?

msackman · on Jan 15, 2016

So GoshawkDB doesn't have an SQL engine currently, so in that way it's probably not comparable with F1. GoshawkDB stores and accesses an object graph. Hopefully the howtos on the website will help people get into the mindset.

I'm not sure if it's worth trying to compare anything to Spanner or F1 because unless I'm mistaken, no one outside of Google can use F1 or Spanner - they're not released. So who knows what the performance of these things actually is? There's no way to verify the claims made in any Google paper.

nickpsecurity · on Jan 15, 2016

"So GoshawkDB doesn't have an SQL engine currently, so in that way it's probably not comparable with F1. GoshawkDB stores and accesses an object graph. Hopefully the howtos on the website will help people get into the mindset."

Gotcha.

"I'm not sure if it's worth trying to compare anything to Spanner or F1 because unless I'm mistaken, no one outside of Google can use F1 or Spanner - they're not released. So who knows what the performance of these things actually is? There's no way to verify the claims made in any Google paper."

I think it's worthwhile for these reasons:

1. They describe it in enough detail in their papers for comparisons or maybe clones to be made.

2. That's led to competitors or open-source clones of tech before. Remember map reduce?

3. They've deployed it operationally.

4. Since when can Google not handle a backend tech it claims to be able to do? I thought their rep was solid on that stuff.

So, Google already as a strongly-consistent, scalable, multi-datacenter RDBMS with NoSQL-like performance. If good on enough workloads, that's the best thing I've heard in that category since NonStop SQL. The GPS thing, while brilliant hack, might be hard for adoption. An improvement area CochroachDB people are already targeting. A full clone or competitor could rock DB world as a great alternative to Oracle clusters or NonStop for mission-critical workloads where some delays were tolerable.

ngbronson · on Jan 14, 2016

Thank you for publishing the code for your paper. It's an excellent contribution by itself, and should help people understand and evaluate the ideas. Are you going to publish the TLA+ specification as well?

gregwebs · on Jan 14, 2016

Reminds me of Hyperdex which has consistent replication through value dependent chaining but only enforces ordering of commits when needed.

imaginenore · on Jan 14, 2016

You claim it's high-performance, and you provide no benchmarks.

How does it compare with Redis, Aerospike, Tarantool, Couchbase/MemBase, Memacached, VoltDB, LevelDB, Kyoto Cabinet, Riak, Cassandra, RocksDB, LMDB, Neo4j, HBase, ArangoDB, Voldemont, FoundationDB?

iyzhang · on Jan 14, 2016

Figure 14 in the paper gives comparisons with MongoDB, Redis and Cassandra using YCSB. The git repo has the YCSB bindings for TAPIR, so you can run TAPIR against any of the other systems that also have YCSB bindings.

imaginenore · on Jan 14, 2016

Thanks. If it really outperforms Cassandra, it's pretty impressive.

brazzledazzle · on Jan 14, 2016

I don't think there's any claims that it did. The paper and the talk both made it clear that Cassandra outperformed it. I think MongoDB is the only one it beat. But I think the point is that it's able to perform well while being consistent, not that it's faster than "weakly consistent" storage systems like Cassandra.