More

nano_o · 2024-09-28T16:18:14 1727540294

The bug in the Paxos Made Simple paper is that Lamport forgot to mention that, upon accepting a proposal, an acceptor also implicitly promises not to accept any proposal in lower ballots. It's discussed at length here: https://stackoverflow.com/questions/29880949/contradiction-i...

nano_o · on April 17, 2023

In your example, you can just add a variable that is incremented at every step and then use it to state your invariant that convergence must happen within 5 steps.

Sometimes you can encode properties that might initially seem hard to state in TLA+ in a similar way. Lamport has a recent paper explaining how to do that for hyperproperties such as information-flow security: http://lamport.azurewebsites.net/pubs/pubs.html?from=https:/...

nano_o · on Nov 4, 2022

With fancy cryptography (e.g. zero-knowledge proofs), you can allow the public to verify that the tally is correct while keeping votes secret. It's called end-to-end verifiable voting, and this is a good introduction: https://www.microsoft.com/en-us/research/publication/end-end...

The idea is that people post their votes publicly, but encrypted, and there's a procedure (based on zero-knowledge proofs) that allows to check that an encrypted vote is correctly accounted for in the tally without decrypting said vote. If sufficiently many voters post their encrypted votes (and if it's not too predictable who will do so), a wrong tally will be detected with high probability.

nano_o · on May 7, 2021

The Stellar Network relies on nodes declaring quorum slices, which can be understood as trust relationships, but it implements a single global blockchain. So I don't think it's very similar.

Edit: one similarity is that token issuers in Stellar can remain authoritative on their token.

nano_o · on May 26, 2020

You might find the Sledgehammer tool for Isabelle quite interesting. It has been using machine learning techniques to find proofs automatically since at least 2013. It uses previous proof to learn how to select facts to send to off-the-shelf automated provers in order to discharge a goal. See e.g. http://isabelle.in.tum.de/~blanchet/mash2.pdf

On issue I have with it, and that automated proof tools based on ML are going to have to solve, is that it's quite unpredictable. Even if it finds stunning proofs from time to time, it is hard to use an unpredictable tool efficiently.

nano_o · on March 18, 2020

Great idea and great work!

A couple nitpicks: it would be nice to see what happens when the leader fails. Optimizing for the case of a stable leader might have impact on recovery time.

Another important aspect for fault-tolerance is whether you can really survive any minority crashing. For example, if only the strictly necessary number of nodes keep up with the leader, then if most of those crash the system will have a really hard time recovering due to the backlog accumulated at slow nodes which now need to catch up for the system to continue operating.

A performance number that does not take those things into account may not be very realistic. Nevertheless the idea is pretty good.

tptacek · on March 19, 2020

Doesn't Multi-Paxos already have stable leaders? My understanding was that the innovation here was to relay prepare/promise/accept/accepted across a random relay network.

nano_o · on March 19, 2020

Yes, it's a nitpick. The comparison to Multi-Paxos seems fair because it makes similar assumptions (unless re-configuring the relay network after a leader failure is somehow difficult, but I wouldn't expect that).

My point is that it would be nice to benchmark protocols that take into account the issues I brought up, and measure what happens in the worst failure scenarios they are supposed to tolerate. Otherwise we get a false sense of what performance can be achieved if one really cares about fault-tolerance.

This small issue does not diminish the main contribution of the paper in any way.

nano_o · on Nov 15, 2016

Also, it seems they proved a behavioral equivalence property: any user-space program has exactly the same behaviors when running on the C+assembly implementation of the OS (6500 lines) and when running on the abstract machine specified by the high-level specification of the OS (450 lines). Their specification also includes liveness properties (e.g. no deadlocks or livelocks).

For comparison, seL4 has verified behavioral refinement between implementation and specification, termination of all syscalls, and several security properties (non-interference and information-flow properties among threads), worst case execution times, and other things; but, seL4 does not support fined-grained concurrency.

nano_o · on Nov 8, 2016

There is an interesting paper that recently explored the de-facto C standard: Into the Depths of C: Elaborating the De Facto Standards, by Memarian et al. http://www.cl.cam.ac.uk/~km569/into_the_depths_of_C.pdf

nano_o · on Nov 24, 2015

The article seems to be based on the following paper, published in 2012: "Cosmic Rays Don't Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design". Newer research on the topic has since been published, for example this year: "Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field" (from Facebook) and "Memory Errors in Modern Systems: The Good, The Bad, and The Ugly".

nano_o · on Aug 27, 2015

It seems similar to the primary-backup instance of the Vertical Paxos family. In the primary-backup Vertical Paxos, one can tolerate f faults with f+1 replicas as long as a reliable external reconfiguration master is there to replace failed replicas and make sure everyone agrees on the configuration. Here the external reconfiguration master would be ZooKeeper and the primary-backup protocol the ISR protocol.

http://research.microsoft.com/en-us/um/people/lamport/pubs/v...

mjb · on Aug 27, 2015

Yeah, there are lots of similarities to Vertical Paxos. It also shares a lot van Renesse and Schneider's classic Chain Replication paper (https://www.usenix.org/legacy/event/osdi04/tech/full_papers/...), with the use of a configuration master to configure a chain that can't solve uniform consensus by itself.

That's not a criticism at all, though. There are a lot of good design and systems ideas there, and they are well explained in this post.

fpj · on Aug 27, 2015

I'm glad you enjoyed the post. The chain replication work indeed relies on a paxos-based master service for reconfiguration, but I believe that's the only similarity, the replication scheme is different otherwise.

fpj · on Aug 27, 2015

Thanks for the comment, the vertical paxos reference is indeed relevant here.