Neat Algorithms: Paxos

krat0sprakhar · on Dec 28, 2014

This is up for debate but this[0], IMHO, is pretty much the gold standard of explaining distributed algorithms.

[0] - http://thesecretlivesofdata.com/raft/

nacs · on Dec 28, 2014

Thanks the presentation, helps a lot.

Quick question though. On this slide ( http://i.imgur.com/m02CMxx.png ) it shows a network split condition and shows how the 2 split networks will eventually negotiate and the 3 node split wins because it had a majority while the 2 node side's uncommitted changes are thrown out.

What happens if the split happens right down the middle (3 active nodes on each side instead of the 2 and 3)? Wouldn't both sides elect leaders that both have majorities with committed data?

on Dec 28, 2014

[deleted]

nacs · on Dec 28, 2014

Ah, thanks for the response. I didn't realize the majority encompassed the entire cluster and not just a majority within the split.

Retr0spectrum · on Dec 29, 2014

Surely that means that none of the nodes will reach a concensus, so the cluster cannot be updated?

Chris_Newton · on Dec 29, 2014

Isn’t that what we need with Raft, in order to guarantee reaching eventual consistency after reconciling any pending changes when a division into multiple partitions is healed?

For example, suppose we have two nodes and they are placed into separate partitions. If we allow an exact 50% share to count as a “majority” then both nodes can potentially continue to accept and acknowledge updates during the division, because each is capable of updating a majority of the nodes in the cluster (in this case, just itself, but the same argument applies if we have larger numbers of nodes in each half). However, when the division is healed, a simple check to see which leader node has the higher term count and allow its log to take precedence is no longer sufficient.

The problem is that we have violated the State Machine Safety guarantee described in the Raft paper (that is, we can have different nodes that have different committed entries at the same index in their respective logs). We now have no way to resolve the conflict, because those updates were already committed by some of the nodes in the cluster and have been acknowledged by their respective leader nodes, so now we can’t just append any extra committed log entries from one node to another node’s log to get back to a consistent state.

matchu · on Dec 29, 2014

My first thought for solving that problem would be to always have an odd number of nodes, but I'm interested in how Raft actually handles it, too.

deathanatos · on Dec 29, 2014

> My first thought for solving that problem would be to always have an odd number of nodes, but I'm interested in how Raft actually handles it, too.

You don't need an odd number of nodes. Raft requires that a majority of the cluster be able to communicate in order to make forward progress. In the event that a majority can't communicate, no changes to the cluster can be made. This isn't particularly unique to Raft; any distributed consensus algorithm that wishes to maintain consistency has this: if your cluster is split in such a way that a majority can't communicate, you can't both accept writes and prevent the cluster from becoming "split-brained", i.e., having two states, one on each side of the split. If a majority of nodes can communicate, you know that your side of the split is unique in this regard, and can thus keep going. All other splits, not having a majority, will not be able to accept writes.

A note about majorities: The definition of majority is "_More than half_ (50%) of some group"; in a cluster of size 5, this is at least 3. In a cluster of size 6, this is at least _4_. Because it is more than half (and not just half), even sized clusters are just fine in Raft: in a 3/3 split in a size 6 cluster, neither side has majority by definition.

matchu · on Dec 29, 2014

Yeah, so that's the problem we're talking about: the presentation implied that a split network can still accept writes, because one side will have a majority. But what if neither side does? Is the cluster just in read-only mode?

But, as I rephrase the problem, I think I observe that splitting the network in two is a rather general case of partitioning. If we partition each node to be on its own, then of course nobody has a majority, so an odd number of nodes doesn't solve the general partition problem.

matchu · on Dec 29, 2014

edit: And by "general case of partitioning" I mean exactly the opposite. Whoops.

nacs · on Dec 29, 2014

I assumed the same thing from that response.

Does the cluster just remain in a read-only mode during that time? Maybe someone else knows.

DAddYE · on Dec 28, 2014

I agree, I would also add that Paxos seems exceptionally hard to get it right or at least this was what I saw so far in popular projects (i.e. Cassandra) that use it.

atmosx · on Dec 28, 2014

That's impressive, thanks for sharing.

jackmaney · on Dec 28, 2014

Neat. Thanks!

ahelwer · on Dec 28, 2014

Software engineers love Paxos because it takes something very complex (a distributed system) and makes it equivalent to working with a single machine: you only ever talk to the leader. It gives you redundancy at the expense of performance.

Paxos is used to achieve something called Strong Consistency, where each node sees the same message in the same order. If you think of each node as a deterministic state machine, they are guaranteed to end up in the same state after responding to the same sequence of messages. It's nice and intuitive, but requiring global synchronization on every write is terrible for performance.

Other consistency schemes exist. A popular one is Eventual Consistency, where writes are made immediately at any node (not just the leader) and the system is expected to synchronize in the background and "converge" to the same state. However, this can result in merge conflicts: if you're editing a document in collaboration with other users, what if you edit a word in a paragraph while another user deletes that entire paragraph? Does the system resolve this automatically, or require user assistance? The answer to this question varies according to system requirements. I think most HN users have experienced the joys of resolving merge conflicts.

A newer model is something called Strong Eventual Consistency, which is similar to Eventual Consistency but merge conflicts are impossible by design: every update to the system must be commutative, associative, and idempotent with other updates. It is not always possible to design your system this way. These systems are implemented with Conflict-Free Replicated Data Types (or ad-hoc equivalents) and have excellent liveness/throughput/performance characteristics compared to Strong Consistency.

CRDTs are not as simple as Paxos. You're forced out of the cozy one-system world and your system must deal with two nodes concurrently holding different values. For most applications, magic Paxos dust is all you need. For others, CRDTs are an excellent tool.

steventhedev · on Dec 28, 2014

I strongly suggest Shapiro's paper[0] on CRDTs. In a nutshell, the only really problematic data type are sequences such as arrays or strings. There are some specialized approaches specifically for those, and you can always fall back on a LWW conflict resolution.

In general, I like to think of Paxos as an approach that uses LWW for all value types.

[0] - http://pagesperso-systeme.lip6.fr/Marc.Shapiro/papers/RR-695...

ahelwer · on Dec 28, 2014

Shapiro's name is on about seven different CRDT papers :) it makes citations difficult for the Wikipedia article. Personal opinion, this[0] 2011 paper is probably the best one to read, where Shapiro's and Baquero's teams finally joined forces and put out a good comprehensive paper on the subject. The one you linked focuses a bit too heavily on TreeDoc in lieu of good treatment of CRDT theory. Their survey of known CRDTs[1] is also worth reading.

[0] https://hal.inria.fr/file/index/docid/609399/filename/RR-768...

[1] https://hal.inria.fr/inria-00555588/document

steventhedev · on Dec 29, 2014

Oops. I had meant to link the survey. His lecture[0] on the subject is quite approachable, for those who prefer visuals to papers.

[0] http://youtu.be/ebWVLVhiaiY

otis_inf · on Dec 29, 2014

IMHO 'Eventual consistency' and related 'consistencies' are not really giving a consistent state, as they're more about a promise that the system will reach a consistent state 'eventually', but when that happens is unknown: in a volatile database, the system could never reach a consistent state, due to the lack of linearizability. See: http://hackingdistributed.com/2013/03/23/consistency-alphabe...

ahelwer · on Dec 29, 2014

Wow! Great blog post. I've been looking for something along these lines for a while. The author is correct that Wikipedia's coverage of consistency is difficult to follow, and I don't yet have a deep enough understanding to contribute.

nwmcsween · on Dec 28, 2014

Can you prove authorship with CRDTs? scuttlebutt[0] moved from a CRDT implementation to a log based structure of signed messages[1].

[0]: https://github.com/dominictarr/scuttlebutt

[1]: https://github.com/ssbc/secure-scuttlebutt

ahelwer · on Dec 28, 2014

I'm not sure what you mean by proving authorship, could you explain?

emin-gun-sirer · on Dec 28, 2014

This well-illustrated post is technically about the core Synod agreement protocol in Paxos. Building a consistent distributed service on top requires additional scaffolding and infrastructure. Typically, people layer on a system that implements a "replicated state machine (RSM)" on top, which maintains the illusion of a single consistent object, even though it is composed of distributed replicas.

Also keep in mind that Raft, Zab, and View-Stamped replication (in reverse chronological order) are alternatives to the Synod protocol in Paxos. These protocols differ from Paxos by employing a different leader-election mechanism and slightly different way of maintaining their invariants.

There have been many Paxos variants. This site [1] shows the various Paxos variants over a timeline and points out their contributions.

Those of you interested in building replicated state machines using Paxos should take a look at OpenReplica [2]. It is a full Multi-Paxos implementation that takes any Python object and makes it distributed and fault-tolerant, like an RPC package on steroids.

[1] http://paxos.systems/

[2] http://openreplica.org/faq/

ahelwer · on Dec 28, 2014

It looks like you are one of the developers of OpenReplica?

emin-gun-sirer · on Dec 28, 2014

Yes, it's an open-source project from my research group at Cornell.

amelius · on Dec 28, 2014

Also interesting: [1]

> Raft is a consensus algorithm that is designed to be easy to understand. It's equivalent to Paxos in fault-tolerance and performance. The difference is that it's decomposed into relatively independent subproblems, and it cleanly addresses all major pieces needed for practical systems. We hope Raft will make consensus available to a wider audience, and that this wider audience will be able to develop a variety of higher quality consensus-based systems than are available today.

[1] https://raftconsensus.github.io/

SamReidHughes · on Dec 29, 2014

However, see http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-857.pdf about Raft's pitfalls in environments where the network is being difficult.

samirmenon · on Dec 28, 2014

This is a great walkthrough of Raft:

http://thesecretlivesofdata.com/raft/

ash · on Dec 29, 2014

> Honest-to-goodness real-life implementations of Paxos can be found at the heart of … Google’s magnificent Spanner database…

I'm not sure about Spanner and Paxos. Sebastian Kanthak said during his Google Spanner talk:

"If you've been to the Raft talk this morning, our Paxos implementation is actually closer to the Raft algorithm than to what you'd read in the Paxos paper, which is… if you haven't read it, don't read it, it's horrible." (at 7:43)

http://www.infoq.com/presentations/spanner-distributed-googl...

Animats · on Dec 29, 2014

Nice animations.

How do you keep a broken or hostile node from advancing the sequence number to the end of the sequence number space?

There's an algorithm from one of Butler Lamson's grad students at MIT which fixes this, but it seems to require one more message per cycle. (http://pmg.csail.mit.edu/~castro/thesis.pdf) That paper later appears as a Microsoft Research paper on how to make an NFS-like file system with this consensus properly. Did Microsoft ever put that in a product?

lordnacho · on Dec 28, 2014

Having looked for a few minutes, it really reminds me of the routing protocols used for distributing routes in networks. (Also Layer 2 stuff IIRC). There you also find heartbeats, elections, etc.

Is there a connection?

Also, does it have anything to do with the Byzantine Generals Problem?

ahelwer · on Dec 28, 2014

Heartbeats (is this system up?) and leader election (which server should I talk to?) are common components of any distributed system. Byzantine generals is a much different problem. Whereas Paxos gets the majority of nodes (a quorum) to agree on a certain value, Byzantine Generals deals with unanimous agreement in a system with network link failures. Furthermore, in Byzantine Generals, some of the generals can be traitors actively working against the others. Unanimous agreement isn't all that useful or interesting, but production distributed systems should handle Byzantine failures. This covers innocent events such as corrupted data/messages and software bugs all the way up to hackers trying to take over your system through a compromised node. Variants of paxos exist which handle or mitigate Byzantine failures.

neilc · on Dec 28, 2014

> production distributed systems should handle Byzantine failures

I don't think this is true, at least not as stated. You should certainly be thinking about more than crash-stop failures, but full-blown Byzantine fault tolerance is rarely warranted in practice. Empirically, the number of systems that use non-Byzantine vs. Byzantine agreement is probably on the order of 100:1.

ahelwer · on Dec 28, 2014

Good point. Depends on your threat model. Most systems can get away without full-blown Byzantine fault tolerance. You're probably right about the 100:1 ratio for production systems.

sinwave · on Dec 28, 2014

Well for one, both Paxos and "Byzantine Generals" were authored by Leslie Lamport (latest Turing award winner IIRC).

But Paxos has gentler assumptions about adversaries than in the Byzantine Generals problem.

subbu · on Dec 29, 2014

The animations are a bit fast. Would've been great if the reader could control them. Easier to learn.

_almosnow · on Dec 28, 2014

> Side note: it’s important that no two proposers ever use the same sequence number, and that they are sortable, so that they truly reference only one proposal, and precedence between proposals can be decided using a simple comparison.

They are moving the core problem into a different domain. Worst explanation of PAXOS ever... nice animations though.

Edit: 'Worst explanation' is just an exaggeration, obv. It is nice, but doesn't explain really important issues.

zerker2000 · on Dec 29, 2014

The requirement is that they are sortable, nothing about the numbers reflecting the actual order of proposals (inasmuch as uncommitted actions in a distributed system can be considered to be ordered). Concatenating system time and node number upholds this property: 4:01-A and 4:01-B produced by nodes A and B respectively are distinct and numerically sortable, and both in turn "predate" 4:13-C attempted later.

fitshipit · on Dec 28, 2014

> Know Paxos? Stealth-mode big data startup is hiring founding engineer

lol