> What Formal Specification Is Not Good For: We are concerned with two major classes of problems with large distributed systems: 1) bugs and operator errors that cause a departure from the logical intent of the system, and 2) surprising ‘sustained emergent performance degradation’ of complex systems that inevitably contain feedback loops. We know how to use formal specification to find the first class of problems. However, problems in the second category can cripple a system even though no logic bug is involved. A common example is when a momentary slowdown in a server (perhaps due to Java garbage collection) causes timeouts to be breached on clients, which causes the clients to retry requests, which adds more load to the server, which causes further slowdown. In such scenarios the system will eventually make progress; it is not stuck in a logical deadlock, livelock, or other cycle. But from the customer's perspective it is effectively unavailable due to sustained unacceptable response times. TLA+ could be used to specify an upper bound on response time, as a real-time safety property. However, our systems are built on infrastructure (disks, operating systems, network) that do not support hard real-time scheduling or guarantees, so real-time safety properties would not be realistic. We build soft real-time systems in which very short periods of slow responses are not considered errors. However, prolonged severe slowdowns are considered errors. We don’t yet know of a feasible way to model a real system that would enable tools to predict such emergent behavior. We use other techniques to mitigate those risks.
Formal methods including TLA+ also can't/don't prevent or can only workaround side channels in hardware and firmware that is not verified. But that's a different layer.
> This raised a challenge; how to convey the purpose and benefits of formal
methods to an audience of software engineers? Engineers think in terms of debugging rather than ‘verification’, so we called the presentation “Debugging Designs” [8]
. Continuing that metaphor, we have
found that software engineers more readily grasp the concept and practical value of TLA+ if we dub it:
Exhaustively testable pseudo-code
> We initially avoid the words ‘formal’, ‘verification’, and ‘proof’, due to the widespread view that formal
methods are impractical. We also initially avoid mentioning what the acronym ‘TLA’ stands for, as doing
so would give an incorrect impression of complexity.
Isn't there a hello world with vector clocks tutorial? A simple, formally-verified hello world kernel module with each of the potential methods would be demonstrative, but then don't you need to model the kernel with abstract distributed concurrency primitives too?
> What Formal Specification Is Not Good For: We are concerned with two major classes of problems with large distributed systems: 1) bugs and operator errors that cause a departure from the logical intent of the system, and 2) surprising ‘sustained emergent performance degradation’ of complex systems that inevitably contain feedback loops. We know how to use formal specification to find the first class of problems. However, problems in the second category can cripple a system even though no logic bug is involved. A common example is when a momentary slowdown in a server (perhaps due to Java garbage collection) causes timeouts to be breached on clients, which causes the clients to retry requests, which adds more load to the server, which causes further slowdown. In such scenarios the system will eventually make progress; it is not stuck in a logical deadlock, livelock, or other cycle. But from the customer's perspective it is effectively unavailable due to sustained unacceptable response times. TLA+ could be used to specify an upper bound on response time, as a real-time safety property. However, our systems are built on infrastructure (disks, operating systems, network) that do not support hard real-time scheduling or guarantees, so real-time safety properties would not be realistic. We build soft real-time systems in which very short periods of slow responses are not considered errors. However, prolonged severe slowdowns are considered errors. We don’t yet know of a feasible way to model a real system that would enable tools to predict such emergent behavior. We use other techniques to mitigate those risks.
Delay, cycles, feedback; [complex] [adaptive] nonlinearity
Formal methods including TLA+ also can't/don't prevent or can only workaround side channels in hardware and firmware that is not verified. But that's a different layer.
> This raised a challenge; how to convey the purpose and benefits of formal methods to an audience of software engineers? Engineers think in terms of debugging rather than ‘verification’, so we called the presentation “Debugging Designs” [8] . Continuing that metaphor, we have found that software engineers more readily grasp the concept and practical value of TLA+ if we dub it:
> We initially avoid the words ‘formal’, ‘verification’, and ‘proof’, due to the widespread view that formal methods are impractical. We also initially avoid mentioning what the acronym ‘TLA’ stands for, as doing so would give an incorrect impression of complexity.Isn't there a hello world with vector clocks tutorial? A simple, formally-verified hello world kernel module with each of the potential methods would be demonstrative, but then don't you need to model the kernel with abstract distributed concurrency primitives too?
From https://news.ycombinator.com/item?id=40980370 ;
> - [ ] DOC: learnxinyminutes for tlaplus
> TLAplus: https://en.wikipedia.org/wiki/TLA%2B
> awesome-tlaplus > Books, (University) courses teaching (with) TLA+: https://github.com/tlaplus/awesome-tlaplus#books
FizzBee, Nagini, deal-solver, z3, dafny; https://news.ycombinator.com/item?id=39904256#39938759 ,
"Industry forms consortium to drive adoption of Rust in safety-critical systems" (2024) https://news.ycombinator.com/item?id=40680722
awesome-safety-critical: