John here from VoltDB. We really enjoyed working with Kyle on this project. We’ve got some content on our website related to this work if you’re hungry for more, including a blog post, a FAQ on transactions and consistency and more detail on the issues Kyle found in VoltDB 6.3 that have been fixed in 6.4. It can all be found here: https://voltdb.com/jepsen
There are also a number of us here to answer any questions about this Jepsen work or about VoltDB generally.
You and your team's resolve and success in tacking these bugs is awesome.
It's interesting to read about some of the relatively "unusual" choices VoltDB has made as well. Like this one:
> network partitions cause minority nodes to shut down permanently; operator intervention is required to restore full service
...combined with requiring majorities of current rather than original clusters to continue. This is, for human operational interactions, really interesting. There's a big history of self-healing systems getting a little carried away and masking or sometimes even triggering bigger issues. Making signoff to rejoin nodes a required operation seems like it makes a whole category of "flapping" problems a non-issue. Kudos for making these "unusual" choices; this is a great horizon to explore :)
I hadn't heard of VoltDB until now, but have been following jepsen for a while. I wish more project did what you did. Congrats on the increased reliability.
It seems like several of the issues aphyr found were due to optimizations the product was making. Fixing them in a point release suggests they weren't huge changes with massive ramifications. I don't know what the key benchmarks for software like VoltDB is. Are these operations which are (relatively) rare in the normal deployments of VoltDB? Did these optimizations make more of a difference in previous versions of VoltDB?
The stale and uncommitted reads was an optimization. It seemed like a harmless optimization to make at the time, and we were wrong. v6.4 allows you to pick strong serializability (default) or the old behavior at startup.
For 100% read workloads, the impact to maximum throughput can be significant. That's a pretty uncommon workload for us though. It's likely a single-digit percentage problem at 50% reads.
Two nice things about VoltDB users: 1) They often are very write heavy, 99% read-write transactions isn't uncommon. 2) Few are using anywhere near the maximum throughput for their machines. Most size their clusters for appropriate storage and redundancy, not throughput.
The lost write issues weren't optimizations, just implementation bugs. Sigh.
Second, the consensus algorithm we use today is used for cluster membership, schema changes and other agreement purposes, but not for the main transactional path. However, in versions before 3.0 it was.
At the time, we wanted to make really different tradeoffs than something like Paxos. Namely, we wanted to be able to run with two copies, rather than the three that Paxos/Raft require. We also wanted crazy throughput and low performance variation, which are hard things to get with Paxos/Raft. The main tradeoff is that our system doesn't work well across high-latency links, which is ok with us.
Ultimately we added a second layer protocol for transactions in 3.0, and this made the transaction path not dependent on clock skew. This layer actually looks a lot like Raft, but does differ in some key ways.
We may even someday switch to Raft for the core agreement stuff, but there's a lot of work to be done for that to happen.
We have done some initial work on TLA+, but it may be a while to finish anything. If someone wants to help, there's probably a publishable paper in it.
Cool. I am always a little wary of distributed algorithms that don't go through that level of validation. But I recognize that few commercial firms do that, although AWS is a stand out. I had the same criticism of ZooKeeper for a long time. Glad to hear you have at least considered it.
why would i use voltdb? serious question. this is the first time i hear about voltdb.
i mean, i can use a postgres or mssql cluster for SQL needs and/or i can use riak/... for kv and/or cassandra/... for column indexed storage and/or ... etc. why should i look at voltdb?
Well, one thing we can say definitively today is that VoltDB offers strong serializability, when almost no other clustered systems do, and when they do, they are slow.
But my more typical answer is the combination of throughput and transactional logic. No other system does both as well as VoltDB. This comes up a lot in policy enforcement, fraud detection, online gaming, ad-tech, billing-support and more.
Potential readers, not sure whether or not to make the slog: Do.
This is the most effective explanation (for me, anyway) of the difference between "serializable" and "linearizable" of any of aphyr's blogs so far. They've been keywords in that little cladistic tree of consistency models he draws for a while now, but with this explanation and the examples, I finally grok what they mean.
I'm not so sure. The definition of strict serializable/linearizable given in this post is weaker than what most people use. Most people require that linearizable execution must happen as-if it matches a global clock. Simply requiring that the effects of a transaction happen within the start/end time of the partition executing the transaction does not guarantee this. Most distributed databases (from what I can tell, including voltdb) only guarantee that for operations which read/write the same keys. Non-conflicting transactions execute without any synchronization overhead - and that's a good thing! But in the presence of side channels, you need a truly global clock (like spanner) to achieve strict serializablility.
Most people require that linearizable execution must happen as-if it matches a global clock. Simply requiring that the effects of a transaction happen within the start/end time of the partition executing the transaction does not guarantee this.
I'm not sure what you mean by "within the start/end time of the partition executing the transaction". Nothing in the informal definition I provided mentioned partitions, or even process-local orders (that'd be sequential consistency) so ... yeah, I dunno where you got this from. I agree that linearizability is a global real-time constraint, and I use that sense in the post and in the Knossos verifier.
Most distributed databases (from what I can tell, including voltdb) only guarantee that for operations which read/write the same keys.
You raise an interesting question: if I verify only that operations on a single key are linearizable, have I also verified that operations on systems of independent keys are linearizable? The answer, as far as I know, is yes: linearizability is a local (or "composable") property. From Herlihy & Wing (https://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf):
Unlike alternative correctness conditions such as sequential consistency [31] or serializability [40], linearizability is a local property: a system is linearizable if each individual object is linearizable. Locality enhances modularity and concurrency, since objects can be implemented and verified independently, and run-time scheduling can be completely decentralized.
This is a commonly cited property in the literature, and has been proven several ways--for instance, see Lin's recent constructive proof (http://arxiv.org/abs/1412.8324). There is research showing linearizable systems vary in the probability distribution of outcomes (https://arxiv.org/pdf/1103.4690.pdf), but this does not affect safety.
However, your comment led me to Charron-Bost & Cori 2003 (http://www.worldscientific.com/doi/abs/10.1142/S012962640300...), whose abstract claims a counterexample system of two linearizable objects whose composed system is nonlinearizable. I haven't found the full text yet, and I'm not familiar with their sense of "The Global Time Axiom", so it's possible their finding is still consistent with "linearizability is composable". Not sure.
In any case, the multi-key tests in this analysis do perform single-key transactions (as well as multi-key transactions), and verify that their composed system is fully linearizable. Because the state space for composed systems is larger, these tests aren't as good at finding bugs--but if composability turns out not to hold, I can use this strategy more often.
But in the presence of side channels, you need a truly global clock (like spanner) to achieve strict serializablility.
As I understand it, Spanner's global clocks are a performance optimization, not a correctness condition. If linearizability required a global clock, Zookeeper (http://static.cs.brown.edu/courses/cs227/archives/2012/paper...) and Raft (https://raft.github.io/raft.pdf) wouldn't be able to provide linearizable semantics. It is, of course, possible that these papers are wrong, in which case I encourage you to publish!
(I've since skimmed Charron-Bost & Cori, and it shows that linearizability is not composable when there does not exist a total order of invocation and response events. This might be of use in relativistic scenarios with... accelerating spacecraft which still need to perform linearizable computation? I don't think it's particularly relevant to clocks down here on the geoid.)
I will update the docs to acknowledge that consistency guarantees may be compromised if the relative speed between servers in a cluster or clients is a nontrivial fraction of the speed of light.
VoltDB is a real fun platform. I've used it around the v1 and v2 era. Incredibly high tx rate (on few-hundred-dollar servers 5 years ago could get 150K tx/sec.) Being able to use SQL is fantastic. The original research papers (Volt's diverged significantly now I understand) are good reads[1].
It's a good fit any time you're considering storing a bunch of data in-memory for performance reasons. (Like telecom routing info.) Instead of writing a custom daemon, just spit it into VoltDB. Get replication, performance, etc. for free! Very neat.
Sadly the open source version isn't very ACID. They dropped the D from community edition. So you can scale out, but if any node dies, you're toast. There's still some uses, where you're running a transient or easily-rebuildable dataset. Or where you can manually run multiple full nodes (though I guess you'd need to implement cluster failover manually).
I guess it shows that it is hard to make a living off of open source products if they're really great. I've heard this from other open-source companies: the product's fantastic, no one pays. But make a taste as open source, basically a demo/trial, and get them to upgrade to commercial.
This post is all about the failings of version 6.3 but it buries the lede:
Version 6.4 includes fixes for all the issues discussed here: stale reads, dirty reads, lost updates (due to both partition detection races and invalid recovery plans), and read-only transaction reordering are all fixed, plus several incidental bugs the VoltDB team identified. After 6.4, VoltDB plans to introduce per-session and per-request isolation levels for users who prefer weaker isolation guarantees in exchange for improved latency. VoltDB’s pre-6.4 development builds have now passed all the original Jepsen tests, as well as more aggressive elaborations on their themes. Version 6.4 appears to provide strong serializability: the strongest safety invariant of any system we’ve tested thus far. This is not a guarantee of correctness: Jepsen can only demonstrate faults, not their absence. However, I am confident that the scenarios we identified in these tests have been resolved. VoltDB has also expanded their internal test suite to replicate Jepsen’s findings, which should help prevent regressions.
I read the top part and thought 'oh, another system that fails to meet their claims' but it's pretty impressive that they did the work to fix it. Nice job.
[offtopic] May I know what was used to build those handwritten-but-not-really font in images? Any program or app that allows uses multiple glyphs for the same character?
There are also a number of us here to answer any questions about this Jepsen work or about VoltDB generally.