The Bug in Paxos Made Simple

benschulz · on Nov 18, 2021

Quoting from the paper's description of Phase 1:

> (b) If an acceptor receives a prepare request with number n greater than that of any prepare request to which it has already responded, then it responds to the request with a promise not to accept any more proposals numbered less than n and with the highest-numbered proposal (if any) that it has accepted.

It says an acceptor must respond with the highest-numbered proposal (if any) that it has accepted.

How is acceptor C going to do that after step 9? That's where the bug is introduced, I think, not anywhere in the paper.

ignoramous · on Nov 18, 2021

mjb (author) links to an SO post which goes:

> If you look at _Lamport's paxos proofs_ he treats an accept as a promise... But this is not pointed out in _Paxos Made Simple_. In fact, it appears Lamport took great pains to specify that an accept was not a promise.

> The problem is when you combine the weaker portions of both variants; as the OP did and several implementations do. Then you run into this catastrophic bug.

That is, Lamport's proofs and Paxos Made Simple contradict each other ever so slightly to trip everyone all at the same time:

> What's more I looked through several proprietary and open-source paxos implementations and they all had the bug submitted by the OP!

https://stackoverflow.com/questions/29880949/contradiction-i...

Lamport himself acknowledges as much: https://www.youtube.com/watch?t=4398&v=8-Bc5Lqgx_c

benschulz · on Nov 18, 2021

Lamport acknowledges the report, not the error. I'm still not convinced there even is an error.

The state after step 9 should be the same as after step 7, i.e. `A(-:-,100) B(100:b,100) C(100:b,-)` because C needs to retain the "highest-numbered proposal" it accepted, not the one it accepted last. That means the nine steps outlined in the post/on StackOverflow do not, by themselves, demonstrate any problem.

So what additional steps are missing/what alternative steps actually produce an inconsistency/divergence?

ignoramous · on Dec 1, 2021

There isn't an "error" / "bug" in either papers when one looks at them in isolation. Together, they seem to have a very subtle difference in the algorithm to trip most distributed systems practioners.

nvarsj · on Nov 18, 2021

It's a misunderstanding of the paper I think. The blog post author and SO post both make the mistake of altering the Paxos Made Simple algorithm to allow sending phase 2 messages to any majority set of acceptors. But that doesn't work unless you also alter the P2b wording so that an acceptor also treats prior accepts as promises. I can see there is a slight ambiguity in the reading of the PMS paper itself that would make you think this is okay, but saying it is a bug in the paper is a bit of stretch.

masto · on Nov 18, 2021

Knowing nothing about P or modeling distributed systems, I wish the author had shown their code and an example of how the model checker found the problem.

mprime1 · on Nov 18, 2021

Paxos made simple is a human-friendly condensed recap of the original Paxos paper which came with a 20-page TLA+ spec.

As one can imagine, shrinking something down and making it easy is a lossy process. The text of the ‘recap’ paper can be ambiguous.

Calling this out as a ‘bug’ seems like sensationalism.

jorangreef · on Nov 18, 2021

I like the notation:

  A(-:-,100) B(100:b,100) C(100:b,-)

Just reading to that line, it was clear: what was going to go boom, and what the solution(s) would be.

newbie789 · on Nov 18, 2021

Wait the author says that this isn’t a bug in Paxos but one that he intentionally and arbitrarily introduced? The ‘bug’ here is maybe a bit of ambiguity from a book?

Edit: I appreciate being corrected on this.

_t4za · on Nov 18, 2021

> The Bug in Paxos Made Simple

Emphasis mine, the bug is not in the algorithm but in the way it is described in the paper. I can see how the author's choice of words is a bit confusing here.

erikerikson · on Nov 18, 2021

Actually, it is not in the paper either. Instead it is in the blog post author's reading of the paper.

> So if you don't quite follow the letter of the text about acceptor selection, and then do follow the text about how acceptors handle accept messages, then you end up with the bug

Note the "don't quite"

masklinn · on Nov 18, 2021

But per the linked SO thread this is a very easy mistake to make. If the paper is not perfectly clear and does not call attention to this subtetly, I would say that is still an issue with the paper.

Then again I don’t subscribe to the idea that doxumented misbehaviour is fine do ymmv.

newbie789 · on Nov 18, 2021

Ah, thanks. I was confused by the title.