> When it is applied to the algorithms, it means that an algorithm with the shortest implementation is simpler.
That is misleading. Kolmogorov complexity is the length of the shortest program (in a pre-defined language) that produces a given object. So, if the shortest program that produces Algorithm A is smaller than the shortest program that produces Algorithm B, then the Algorithm A is less Kolmogorov-complex ("simpler") than Algorithm A.
This does not mean you can take two existing implementations (in C, say) and compare the implementation length and declare one is "simpler," unless you are claiming that both implementations are as short as possible. Since Kolmogorov complexity is not computable, that seems like a tall order.
Maybe they are right that Single-Decree Paxos is simpler (either in the sense of Kolmogorov complexity or in some other sense, who knows), but invoking Kolmogorov complexity here seems totally unwarranted -- it doesn't add anything substantive.
Thank you! I'll rework this paragraph to be correct, I wanted to make an observation that the given data (two attempts to implement key-value storages with keeping the length of a program as short as possible) favour Gryadka but of course isn't wrong to make strong statements based just on one data point.
I would suggest just dropping Kolmogorov in general, and just referencing the human idea that shorter is generally going to be simpler. I wouldn't have a problem sticking with you through that. Sure, I can feed that to my pedant mill as with anything else, but since it's not really the core idea I'll roll with you on it.
Agreed. The section basically boils down to "despite maybe looking simpler, Single-Decree Paxos is still tricky." I don't think something particular and formal like Kolmogorov complexity even fits the tone there.
Right. There's no strict value to higher or lower complexity in the Kolmogorov sense, so you might as well assign the value to something more tangible (like pseudocode terseness or something)
The problem isn't just the incomputable quality of Kolmogorov complexity but that fact that Kolmogorov complexity applies only to finite strings or things that can be meaningfully mapped to them. Especially, Kolmogorov doesn't apply directly to abstract algorithms or programs with multiple implementations.
Really? Do you have an example of a substantial set of programs where that holds? I don't see a priori how it could work unless you can do exhaustive search over the space of all programs and determine if they halt and return the right answer.
Formally, there are quite a few which are quite simple to show. For instance, for the string 'a', the shortest python problem which can produce this string is obviously print('a') since 'a' has only one character.
Additionally, if you limit your language to a recursive language, you can compute complexity (which is no longer Komnogorov) directly. Simply begin enumerating all programs in order of length and then run them checking the output to see if it is that string. While this is by no means efficient, it works for recursive functions since they must halt. For Komogorov complexity, there isn't really a notion of inputs, merely that some particular string should be produced as output.
For recursively enumerable (i.e. Turing complete) languages, the former method will not work because a program might run forever.
Interesting. This way you get an upper bound on the Kolmogorov complexity of any given string. Do you know how good this approximation is in the worst/best/average case? This sounds like the kind of question that would have been investigated.
The hard part here is not finding a set of programs that always halt (primitive recursive functions can compute anything you'd ever want to run on large inputs), but proving that they are correct (I'm pretty sure equivalence of primitive recursive functions is undecidable).
Edit: but if someone gives you the primitive recursive Kolmogorov complexity of a program, you can check it by running all shorter programs on all inputs until you found a counterexample for each of them. So it is semi-decidable.
Edit to edit: This would even work for the general Turing-machine definition of Kolmogorov complexity.
> The hard part here is not finding a set of programs that always halt [...], but proving that they are correct (I'm pretty sure equivalence of primitive recursive functions is undecidable).
This may be true, but:
If you use primitive recursive functions (instead of turing complete) because of practicality, with the same reasoning you can cap the inputs at some insanely large number. Then, these functions still "can compute anything you'd ever want to run on large inputs".
In that setting, equivalence is decidable, because you can simply run both functions over the finite set of all possible inputs.
I have the feeling like this article is just bashing strong master consensus protocols. But the truth is, yes, they incur a penalty for electing a master.
However, this really gets amortized in most workloads if the leader changes only rarely. Additionally, in an environment with a good network connection between nodes (a few ms), you can set the timeout to be much less than a few seconds (could be less than a second actually), this way you have shorter unavailability.
There's another point he touches, about the unavailability of the whole cluster when the leader is down. Really, this isn't something dependent on the protocols, but on the applications. If you have one paxos/raft group per replica, you actually only get a small unavailability. Additionally, even consistent reads do not need a living master to be possible.
It's wrong. If you're fine with stale reads then you don't a living master, but if you want to have a guarantee that the read value is up-to-date then the living master is necessary.
It may not be up to date at the time of actually READING the data, but it will be up to date if the date is the moment the client requested it. You can actually achieve this using timestamps. (So basically, it's semi-stale data, you're right, because you have a guarantee about the timestamp at which it was up to date, where the timestamp can be recent)
(Not talking about etcd, check out the spanner paper)
In order to get a linearizable read in Spanner, you do need to have a heartbeat from the leader of the Paxos tablet for the data you are reading in order to know that all data has been replicated up to the time at which you are taking your read. This is not free, but can be cheaper than active communication. (referred to as t_safe in the paper)
As I understood it, if you are actually reading at some point it time (let's say 100 ms ago), then the non-leader node can check for himself, if he was in contact with the master after this point of time and give you an answer.
TrueTime is a very special case. I have not seen anyone else who has the same hardware infrastructure required to provide guarantees related to wall-clock time. Spanner also trades off transaction latency to create the wall clock time guarantees, so its not magically free.
It isn't, I haven't said it is. The point is, there are disadvantages you are writing about, that aren't really disadvantages of the protocol itself, but disadvantages of the common implementations.
Yes that's true if all the communication between agents goes through the database then the Monotonic Reads consistency level is equivalent to linearizability.
In case there are off the database communications then this level of consistency isn't enough.
It depends on the implementation. If you have a replication system that generates a monotonic sequence ID for each update, and you have all writes block on a majority of replicas acking an update, then you can just read from a majority and pick the result with the highest monotonic identifier. This will not be consistent unless the leadership mechanism demands that leadership is chosen based on the same criteria.
> If you have one paxos/raft group per replica, you actually only get a small unavailability
It isn't a small unavailability, it's an unavailability of the whole replica. If you don't have a lot of data then it's the whole cluster :)
Of cause, we can introduce something like virtual replicas and eventually end up with a replica per key which is almost a Single Decree Paxos but with an overhead on log compaction and snapshotting.
Well, the unavailability is around 1/(number of machines in cluster) with a sufficient amount of replicas. That said, it's small relative to a whole-cluster unavailability and with small leader timeouts, it gets much better than you described in your article.
Btw, thanks for answering my comments actively, I really appreciate it!
It is my understanding that the motivation in seeking out consensus algorithms with strong leaders (or equivalent) as opposed to horizontally weighted peer-to-peer ones is due to the performance penalty imposed by the latter in the general case. Structuring the protocol to be a dissemination from the 'leader' node down to the followers as opposed to a bottom-up approach fares better when your leader is long-lived, circumventing the cost of determining 'consensus' every single time. It's readily apparent that this would lend to a performance penalty in the pathological case, as is demonstrated here, when the leader node is taken down repeatedly - but I'm skeptical if this is true for workloads that systems like coreos/etcd, cockroachdb/cockroach were intended to handle.
The gist of it is that there are realistically 2 quorums to be had: One for read/write to a cluster, and one for leader election. Assuming leader election is a relatively uncommon event (as it is in Paxos), you can require more quorum nodes for leader election while allowing less quorum nodes for quorum read/write. This means you get strong consistency with better performance.
Distributed systems are one of my favorite things. I literally spent part of an evening with a beer reading a small handful of papers like this one. If you want an absolute treasure trove of similar works, visit the blog of Adrian Coyler:
> I literally spent part of an evening with a beer
Just one? :)
Consensus algorithms are _fascinating_! And +1 for https://blog.acolyer.org/ , I'll often read his blog over lunch.
Another great paper is https://research.google.com/archive/paxos_made_live.html which talks about what it's like to realistically run Paxos. There's a ton of corner cases, engineering realities, and optimizations to be done to make Paxos work well in practice.
Another good resource is the Raft GitHub page[1] which links to the paper, has an interactive visualization, and a plethora of talks by various people.
Raft is the backbone of opensource, I'd be curious to hear from any Googlers in the know whether there's deficiencies in Raft that lead to continued use of Paxos, or if it's experience (and already battle-tested code) with Paxos that leads them to continue deploying Paxos-backed systems.
If you've got something battletested, there isn't a lot of value to rip it all up if it works within the given business requirements. Also they figured out how to implement Multi-Paxos, which is known for being difficult.
Btw, Gryadka uses an idea of quorums with non-standard size to change a configuration of the cluster, see http://rystsov.info/2016/01/05/raft-paxos.html#details1. I came up with this idea independently of Howard when misread the Vertical Paxos paper :)
yup, I've been closely following Heidi's work for some time now - non-intersecting consensus groups, still hoping someone beats me to a Go implementation so I don't have to.
+1 for Adrian Coyler's blog, constantly amazed at the sheer volume and depth of his analyses.
The raft paper states that its use of an explicit leader election was chosen for understandability, not performance. Leaderless systems can potentially improve performance, especially when replicas are far apart (so you don't have the extra network hop to the leader before you can get started). The main drawback IMHO to ePaxos (as a hopefully representative leaderless algorithm) is not performance but its complexity, since it intrudes into the application to track dependencies and conflicts between proposed commands.
If I squint at the scheme proposed here, it looks like ePaxos taken to the extreme of a single key/value per paxos group, so conflict tracking becomes trivial. It has demonstrated good performance in the case of uncontended writes; I'd be curious to see how it behaves under contention, or when a downed node rejoins the cluster.
> It is my understanding that the motivation in seeking out consensus algorithms with strong leaders (or equivalent) as opposed to horizontally weighted peer-to-peer ones is due to the performance penalty imposed by the latter in the general case.
Yeah, that was my understanding as well but without having read the paper in question (as of yet) I'm entirely short on the particular technical details of the trade-offs.
I didn't read the whole thing carefully, but...
Leader Election + Terminating Reliable Broadcast is practically equivalent to building the Weakest Failure Detector and is the most feasible way of doing so. So, there are good reasons for doing leader election. This is why almost every serious consensus system ends up doing it whether they know they need to or not.
If you can't afford stale reads then you should ask Etcd to wait for confirmation from the majority of followers before acknowledging a read with ?quorum=true. It makes Etcd do 1 round trip for reads (just like Gryadka).
The same is applicable to other products (you can't relay on time in distributed systems, unless you're Google, consequently you can't relay on read leases)
etcd didn't implement leases; it just assumed that the last Raft election was still good.
You can elect a leader for a set period of time (say, 10 seconds) and serve strong reads for a lesser period of time (5 seconds) if you have reasonable assumptions of how good your local oscillators work and avoid jumps. If you don't trust your local clock to any level of accuracy, why do you trust your local CPU?
Making implicit assumptions about the environment is wrong.
An instance may be running in virtual environment where time freezes are possible. A human may make an error and rollback time to 1970.
It's impossible to eliminate all these factors so yes I don't trust time but I trust CPU.
Maybe CockroachDB is doing it correctly but the terrible default settings make this optimisation negligible because when the leader dies the system hangs for 12 seconds.
When looking for a per-key strongly consistent but still highly available datastore I found Hibari, which uses chain replication. Really interesting rech which tries to solve the same problem: http://www.snookles.com/scott/publications/erlang2010-slf.pd...
Hibari does have a master orchestrator, similar to GFS master server. But it's only needed in reconfiguration events.
I think this overstates the instability of strong leader-based consensus algorithms, or at least over-generalizes Raft's instability to apply to all consensus algorithms with a stronger leader.
In Raft, it's possible for multiple nodes to prevent leader election progress by being overly aggressive when requesting votes. It is also possible for a follower to knock out a perfectly healthy leader by being too quick time out the leader and start a new term.
Both of these limitations stem from the simplicity of Raft's leader election algorithm. To compensate, most Raft implementations I've seen have more conservative follower timeouts that extend the time to detect leader failure and elect a new one.
It's possible for a more optimized algorithm to get sub-second latencies for detecting and re-electing a leader, even in a latent (e.g. geo-distributed) environment. In other words, well within the commit window for the replica set based on network hop latencies.
Also, while the latency for individual writes in single-decree paxos can be closer to strong leader protocols, it is non-trivial to achieve the same level of throughput that is possible in Raft et al when writing to an ordered log, as you cannot start a paxos instance for a new log entry until all prior instances have been resolved. Raft can just add new values in the next append call (or just spam out appends for new messages w/o waiting for the replies for previous ones).
IME, I'd say both single-decree paxos and raft are probably equivalent in terms of understandability, but raft is a better base on which to build a fast high-throughput consensus protocol.
> cannot start a paxos instance for a new log entry until all prior instances have been resolved
It's wrong. In Grydka (Single-decree Paxos) all the keys are independent so it's possible to update them at the same time without blocking.
Grydka's throughput is comparable to Etcd on the same type of machines (4720 vs 5227 rps) and I never optimized for it (my goal was to fit 500 lines) so it's also wrong that it's "non-trivial to achieve the same level of throughput" - I did it by accident.
So I don't understand why Raft is a better base to build a fast high-throughput consensus protocol.
This is true, but comes at the cost of not being able to preserve linearity across arbitrary keys, and you are still limited by the throughput on updates to a single key.
Fair enough, though at that point, you're layering on additional complexity and getting further away from the goal of simplicity. (And to point 3, while this is true, using a log means this problem can be dealt with a lot later, and there are solutions that don't involve giving up linearizability, such as a dedicated set of log replicas apart from data partitions, or implementing something like Calvin.)
My main point was that the deficiencies of strong-leader-based consensus protocols are overstated, and despite a (minor IMO) level of additional starting complexity, a raft-like protocol is going to be quite a bit simpler than a paxos-based protocol of equivalent capability.
1. I am technically pretty strong but I have no idea what this paper is about
2. So many people know this is about that it shot up to #1 on HN
Can someone give a pointer (a link or two) to the lay, interested audience here about what the field IS. Just a sort of intro guide to someone who knows about programming and math, but has never heard the term paxos?
I am curious, and I am sure many others are as well.
Wikipedia is pretty good. The page for "consensus algorithm" has links for Paxos and Raft.
The basic idea is how to coordinate multiple independent agents over an unreliable network. For example, multiple servers trying to manage a shared database. Lots of HN people know about this because it's a key building block of reliable and scalable cloud computing.
These consensus algorithms are awesome! They are the backbones of many highly available systems in modern DC's today. It's how people manage to get any sleep at all when taking care of systems that require very reliable databases.
Raft is a consensus algorithm that is touted as being easy to understand. It is well specified, compared to paxos, which leaves many implementation details up to the creator. Raft, however, is fairly explicitly specified on page 4 of this paper:
https://raft.github.io/raft.pdf
Consensus algorithms allow us to build reliable castles out of constantly failing servers (sand). Systems like zookeeper, etcd, chubby, consul, and others use these algorithms to achieve high availability AND strong consistency (linearizability, the strongest possible) despite up to a majority of the cluster failing.
> For example, instead of using a distributed log as an Event Sourcing backend for building a key/value storage as an interpretation of an ordered stream of updates we can run a distributed log instance per key.
That is a key insight. I often wonder if people who implemented Raft and discarded Paxos right off the bat knew this? Also I think "Paxos Made Live" scared everyone away from Paxos for a long time.
But what is often missing is that Google implemented a distributed log system. Paxos doesn't do a distributed log by default and just deals with reaching consensus on a value. In practice there is often a need for a log, but not always. If a distributed log is not need Paxos becomes less scary.
Good to see more leaderless consensus protocol implementations and that RAFT isn't be all and end all of all consensus problems.
One advantage not mentioned in the article, of leader based consensus algorithms is the ability to more easily implement read leases for faster reads.
Read leases can allow for fresh reads without having to run them through the quorum protocol, by trading off availability (due to leader failure) and also correctness in certain edge cases (since read leases will depend on ability for individual machines to measure time delta with reasonable accuracy, which may not be true on some weird VM scenarios).
I've been curious to know if one could model performance constraints in a model -- or at least the probabilistic bounds -- and have the checker invalidate a design that steps over them.
Paxos's complexity is often overstated. I made a simple implementation of Paxos for a key-value store as a proof of concept to demonstrate how you can simplify Paxos by removing leader election, multi-master management, and log garbage collection.
How heavily have you tested it? I haven't tried implementing Paxos myself, but anecdotally, it's very hard to get it completely right. And when it's being used as key low-level infrastructure, it has to be completely right.
This blog post we're discussing agrees: "I planned to finish it in a couple of days, but the whole endeavor lasted a couple of months, the first version has consistency issues, so I had to mock the network and to introduce fault injections to catch the bugs. [...] Single Decree Paxos seems simpler than Raft and Multi-Paxos but remains complex enough to spend months of weekends in chasing consistency."
Your approach of using many small Paxos instances is very interesting, though! I'd love to see some real world performance comparisons.
Indeed pocdb is not intended to be a fully "ready" implementation. There are likely liveness bugs (that can be mitigated by periodically calling work_state_machine on every write outstanding), but it is unlikely to have safety bugs stemming from protocol misunderstandings (any safety bug is likely a small typo).
My main projects using Paxos are Replicant[1] and Consus[2], both of which have consumed significantly more time to get Paxos correct.
I think that this is cause mainly by the fact, that Leslie Lamport had first written "The one time parliment", which was said to be complex because of the language he used there. I think this is the main cause, as now there are a lot of materials available to understand Paxos.
JS vs sparse C++? Cut comments and change coding style and you can easily get close to 500 lines.
One thing I think is worth touching on is the pipelining ability of multi-paxos that is missing when you have a single register with the Synod protocol. For key-value operations, it's not a problem. For a true replicated state machine, this can hinder performance.
In Replicant[1] I added pipelining to ensure Paxos was unlikely to be a bottleneck to replicated state machines. For complex state machines, the state machine becomes CPU bound.
It's strange that I can't find anything on implementing a CAS register this way. It seems like a relatively straightforward combination of single-decree paxos and the ABD algorithm. Of course, reasoning about distributed algorithms is never simple, and this still isn't...
I didn't get anything out of this blog-post except a bunch of numbers based on the default parameters (such as leader election timeout) of various systems. Here is the link to the
EPaxos paper it purports to discuss though:
I don't have a deep understanding of how ZAB works but based on the documentation (initLimit) and prior experience with ZooKeeper, ZAB has the same issues as Raft when a leader dies.
> When it is applied to the algorithms, it means that an algorithm with the shortest implementation is simpler.
That is misleading. Kolmogorov complexity is the length of the shortest program (in a pre-defined language) that produces a given object. So, if the shortest program that produces Algorithm A is smaller than the shortest program that produces Algorithm B, then the Algorithm A is less Kolmogorov-complex ("simpler") than Algorithm A.
This does not mean you can take two existing implementations (in C, say) and compare the implementation length and declare one is "simpler," unless you are claiming that both implementations are as short as possible. Since Kolmogorov complexity is not computable, that seems like a tall order.
Maybe they are right that Single-Decree Paxos is simpler (either in the sense of Kolmogorov complexity or in some other sense, who knows), but invoking Kolmogorov complexity here seems totally unwarranted -- it doesn't add anything substantive.