Verifying distributed systems with Isabelle/HOL

mjb · on Oct 12, 2022

> For this reason, formal proofs of correctness are valuable for distributed algorithms.

I have a massive amount of respect for Martin and his work, but I think that this emphasis is the wrong one (if our goal is to increase the correctness of deployed distributed algorithms).

Instead, I like to think (building off work of folks like Ankush Desai, an AWS colleague and creator of P) that the process of specification, and it's products, are more valuable than the creation of a proof. Model checking - either exhaustive or stochastic - is valuable for checking the properties of specifications, but pushing the last step into full proof is often not worth the additional effort. Instead, I think making proof the focus and goal tends to turn most programmers off the whole topic.

I think we should be saying to folks "If you pick up these tools you'll end up with a crisp specification that'll help you move faster at development time, extremely clear documentation of your protocol, and a deeper understanding of what your protocol does" instead of "if you pick up these tools you'll end up with the mathematical artifact of a proof". This has been a big shift in my own thinking. When I first picked up Promela, Alloy, and TLA+/TLC/TLAPS, my own focus was on proof (and remained so until 2014 ish, when I spent a lot of time talking to Chris Newcombe and Leslie Lamport as we were writing the "Formal Methods at AWS" CACM paper). Over time I've found the journey much more valuable than the artifact, and tools like PlusCal and P which make specification more approachable more valuable than tools with less familiar syntax and semantics.

We shouldn't be surprised that few people can make progress when we ask them to understand both Paxos and Isabelle at the same time! Maybe if that's your focus as a PhD student, then it's OK.

> The real challenge in verifying distributed algorithms is to come up with the right invariant that is both true and also implies the properties you want your algorithm to have. Unfortunately, designing this invariant has to be done manually.

This is true. Invariants are hard. But maybe I'm more optimistic: I think that there are a lot of very real systems with straightforward invariants. Lamport's three invariants for consensus (section 2.1 of Paxos Made Simple), for example: https://lamport.azurewebsites.net/pubs/paxos-simple.pdf Similarly, invariants like Linearizability, and even formal definitions of isolation.

The research I would love to see is work to synthesize invariants from formalisms like Adya's or Crooks' work on database isolation levels. I think that's a very tractable project, and would be super useful to practitioners.

martinkl · on Oct 13, 2022

Hi there! Martin here.

I agree that for most engineers, getting into formal proof is intimidatingly hard and perhaps not a good use of time, because lighter-weight specification languages and model-checking can help develop a good understanding of a system's behaviour with a much lower time investment.

But the intended audience for this post was not engineers working on production systems; it was intended for researchers studying distributed algorithms (e.g. consensus algorithms), and researchers who already know proof assistants and want to know how to use their skills to verify distributed algorithms. Note that this post appeared on a blog called "Machine Logic", not "Practical Distributed Systems Engineering"!

I have personally got a lot of value out of formal verification because the distributed algorithms I work on (CRDTs) are sometimes so subtle that without a proof of correctness, I simply don't believe that they are correct. I have several times developed algorithms that I believed to be correct, only later to find out that I was wrong. Here's a case study of one such example: https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-969.pdf

For people like myself, who are designing such algorithms, I believe that formal proof is worth the effort. But most people are not in this situation.

comfypotato · on Oct 12, 2022

Are your focus and Dr. Kleppmann's mutually exclusive? I ask because it seems like the Isabelle formulation emerges from the protocol itself. The invariant is the only added work, yes?

I'm effectively asking if the proof generation could replace the model checking as a way of verifying that you have appropriately specified your protocol. Big information exchange protocols can be confusing!

(Maybe I'm misunderstanding what an invariant is... I'm thinking it's the property proved true by induction (and would go something along the lines of "if and only if the good has been sold, then the merchant has received payment for the good"))

Nokurn · on Oct 12, 2022

This is great stuff. I did something similar in a 2020 paper[1] using Coq. I did not find Coq particularly suited for this type of work. Isabelle/HOL looks promising.

One thing to note is that these types of verifications can become difficult when the systems are layered. Even the simplest real world systems are layered, when you factor in the underlying transport for the messages. The TLC paper provided a way for distributed systems to be composed in a way that reduces the complexity of the proofs for each distributed component.

[1] https://dl.acm.org/doi/10.1145/3409005

joaogui1 · on Oct 13, 2022

What difficulties did you face using Coq?

javcasas · on Oct 12, 2022

Martin Kleppmann verifying distributed systems? That's as good as it can get.

eatonphil · on Oct 12, 2022

With a proof assistant written in Standard ML. :)

https://isabelle.in.tum.de/

javcasas · on Oct 12, 2022

Yeah, a distributed system with proofs of it working under some specific assumptions. This is the future. It's a future hard and complicated because the tools and the expertise are not here yet, but definitely the future.

mcguire · on Oct 12, 2022

"Even though in reality processes may run in parallel, we do not need to model this parallelism since the only communication between processes is by sending and receiving messages, and we can assume that a process finishes processing one event before starting to process the next event. Every parallel execution is therefore equivalent to some linear sequence of execution steps. Other formalisations of distributed systems, such as the TLA+ language, also use such a linear sequence of steps."

Ha-hah! Proving that you can treat "processing one event" as globally atomic was the funnest thing in grad school.

renox · on Oct 13, 2022

> Proving that you can treat "processing one event" as globally atomic was the funnest thing in grad school.

Uh? I guess it depends on what 'atomic' means but in the real world that isn't the case.

preseinger · on Oct 12, 2022

My perspective from industry is that we're not really suffering for a lack of formal verification, or anything at the specification level. Basically all of the meaningful problems that need help and attention are with the implementations. I've been on both sides, and I can say it's much harder to implement a bug free distributed system for large scale production use than to specify one.

allanrbo · on Oct 12, 2022

For context, Kleppmann is the author of the book Designing Data-Intensive Applications (DDIA). https://dataintensive.net/

spinningslate · on Oct 12, 2022

Enjoyed this. The underlying formalism seems based (I think?) on concurrent, independently-executing processes that are stimulated by messages. That has some similarities with the Actor model [0], albeit Actors support concurrent, intra-process message handling whereas this seems sequential at the individual process level. So maybe more like Communicating Sequential Processes [1].

For anyone more knowledgable on the topic: is that right? And is that model common across other proof systems like TLA?

Second (bonus) question: why separate messages from user input?

Thanks.

[0] https://en.wikipedia.org/wiki/Actor_model [1] https://en.wikipedia.org/wiki/Communicating_sequential_proce...

mjb · on Oct 12, 2022

> And is that model common across other proof systems like TLA?

Yes, it's pretty common, and is a really nice abstraction.

TLA+ doesn't use that model, although a lot of TLA+ code is implemented that way. TLA+'s model is basically "shared memory" where any step can access or change any state in any way it likes. That's super flexible, but can feel a little too flexible.

P, on the other hand, is based on a model of state machines that communicate with messages. See this example for a clear illustration of how that works: https://p-org.github.io/P/tutorial/twophasecommit/

oa335 · on Oct 12, 2022

I'm interested in learning how to formally express distributed systems but have limited time. Does anyone have any insight into which would be better to learn - TLA+/PlusCal (e.g. Hillel Wayne's courses) or Isabelle/HOL?

ocschwar · on Oct 12, 2022

TLA+ definitely. The lowest rung on the ladder is actually reachable.

whispersnow · on Oct 12, 2022

Martin Kleppmann is the master of any complicated system charts!!

deterministic · on Oct 13, 2022

I believe Isabelle/HOL was used to prove seL4 correct. So it is indeed an industrial strength tool.