Hacker News new | past | comments | ask | show | jobs | submit login
The Bug in Paxos Made Simple (brooker.co.za)
98 points by mjb on Nov 17, 2021 | hide | past | favorite | 13 comments



Quoting from the paper's description of Phase 1:

> (b) If an acceptor receives a prepare request with number n greater than that of any prepare request to which it has already responded, then it responds to the request with a promise not to accept any more proposals numbered less than n and with the highest-numbered proposal (if any) that it has accepted.

It says an acceptor must respond with the highest-numbered proposal (if any) that it has accepted.

How is acceptor C going to do that after step 9? That's where the bug is introduced, I think, not anywhere in the paper.


mjb (author) links to an SO post which goes:

> If you look at _Lamport's paxos proofs_ he treats an accept as a promise... But this is not pointed out in _Paxos Made Simple_. In fact, it appears Lamport took great pains to specify that an accept was not a promise.

> The problem is when you combine the weaker portions of both variants; as the OP did and several implementations do. Then you run into this catastrophic bug.

That is, Lamport's proofs and Paxos Made Simple contradict each other ever so slightly to trip everyone all at the same time:

> What's more I looked through several proprietary and open-source paxos implementations and they all had the bug submitted by the OP!

https://stackoverflow.com/questions/29880949/contradiction-i...

Lamport himself acknowledges as much: https://www.youtube.com/watch?t=4398&v=8-Bc5Lqgx_c


Lamport acknowledges the report, not the error. I'm still not convinced there even is an error.

The state after step 9 should be the same as after step 7, i.e. `A(-:-,100) B(100:b,100) C(100:b,-)` because C needs to retain the "highest-numbered proposal" it accepted, not the one it accepted last. That means the nine steps outlined in the post/on StackOverflow do not, by themselves, demonstrate any problem.

So what additional steps are missing/what alternative steps actually produce an inconsistency/divergence?


There isn't an "error" / "bug" in either papers when one looks at them in isolation. Together, they seem to have a very subtle difference in the algorithm to trip most distributed systems practioners.


It's a misunderstanding of the paper I think. The blog post author and SO post both make the mistake of altering the Paxos Made Simple algorithm to allow sending phase 2 messages to any majority set of acceptors. But that doesn't work unless you also alter the P2b wording so that an acceptor also treats prior accepts as promises. I can see there is a slight ambiguity in the reading of the PMS paper itself that would make you think this is okay, but saying it is a bug in the paper is a bit of stretch.


Knowing nothing about P or modeling distributed systems, I wish the author had shown their code and an example of how the model checker found the problem.


Paxos made simple is a human-friendly condensed recap of the original Paxos paper which came with a 20-page TLA+ spec.

As one can imagine, shrinking something down and making it easy is a lossy process. The text of the ‘recap’ paper can be ambiguous.

Calling this out as a ‘bug’ seems like sensationalism.


I like the notation:

  A(-:-,100) B(100:b,100) C(100:b,-)
Just reading to that line, it was clear: what was going to go boom, and what the solution(s) would be.


Wait the author says that this isn’t a bug in Paxos but one that he intentionally and arbitrarily introduced? The ‘bug’ here is maybe a bit of ambiguity from a book?

Edit: I appreciate being corrected on this.


> The Bug in Paxos Made Simple

Emphasis mine, the bug is not in the algorithm but in the way it is described in the paper. I can see how the author's choice of words is a bit confusing here.


Actually, it is not in the paper either. Instead it is in the blog post author's reading of the paper.

> So if you don't quite follow the letter of the text about acceptor selection, and then do follow the text about how acceptors handle accept messages, then you end up with the bug

Note the "don't quite"


But per the linked SO thread this is a very easy mistake to make. If the paper is not perfectly clear and does not call attention to this subtetly, I would say that is still an issue with the paper.

Then again I don’t subscribe to the idea that doxumented misbehaviour is fine do ymmv.


Ah, thanks. I was confused by the title.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: