- Helps you understand complex abstractions & systems clearly.
- It's extremely effective at communicating the components that make up a system with others.
Let get give you a real practical example.
In the AI models there is this component called a "Transformer". It under pins ChatGPT (the "T" in ChatGPT).
If you are to read the 2018 Transfomer paper "Attention is all you need".
They use human language, diagrams, and mathematics to describe their idea.
However if your try to build you own "Transformer" using that paper as your only resource your going to struggle interpreting what they are saying to get working code.
Even if you get the code working, how sure are you that what you have created is EXACTLY what the authors are talking about?
English is too verbose, diagrams are open to interpretation & mathematics is too ambiguous/abstract. And already written code is too dense.
TLA+ is a notation that tends to be used to "specify systems".
In TLA+ everything is a defined in terms of a state machine. Hardware, software algorithms, consensus algorithms (paxos, raft etc).
So why TLA+?
If something is "specified" in TLA+;
- You know exactly what it is — just by interpreting the TLA+ spec
- If you have an idea to communicate. TLA+ literate people can understand exactly what your talking about.
- You can find bugs in an algorithms, hardware, proceseses just by modeling them in TLA+.
So before building Hardware or software you can check it's validity & fix flaws in its design before committing expensive resources only to subsequently find issues in production.
Is that a practical example? Has anyone specified a transformer using TLA+? More generally, is TLA+ practical for code that uses a lot of matrix multiplication?
It’s really not, TLA+ works best for modeling state machines with few discrete states and concurrent systems. It can find interesting interleaving of events that would leave to a violation of your system properties
> What Formal Specification Is Not Good For: We are concerned with two major classes of problems with large distributed systems: 1) bugs and operator errors that cause a departure from the logical intent of the system, and 2) surprising ‘sustained emergent performance degradation’ of complex systems that inevitably contain feedback loops. We know how to use formal specification to find the first class of problems. However, problems in the second category can cripple a system even though no logic bug is involved. A common example is when a momentary slowdown in a server (perhaps due to Java garbage collection) causes timeouts to be breached on clients, which causes the clients to retry requests, which adds more load to the server, which causes further slowdown. In such scenarios the system will eventually make progress; it is not stuck in a logical deadlock, livelock, or other cycle. But from the customer's perspective it is effectively unavailable due to sustained unacceptable response times. TLA+ could be used to specify an upper bound on response time, as a real-time safety property. However, our systems are built on infrastructure (disks, operating systems, network) that do not support hard real-time scheduling or guarantees, so real-time safety properties would not be realistic. We build soft real-time systems in which very short periods of slow responses are not considered errors. However, prolonged severe slowdowns are considered errors. We don’t yet know of a feasible way to model a real system that would enable tools to predict such emergent behavior. We use other techniques to mitigate those risks.
Formal methods including TLA+ also can't/don't prevent or can only workaround side channels in hardware and firmware that is not verified. But that's a different layer.
> This raised a challenge; how to convey the purpose and benefits of formal
methods to an audience of software engineers? Engineers think in terms of debugging rather than ‘verification’, so we called the presentation “Debugging Designs” [8]
. Continuing that metaphor, we have
found that software engineers more readily grasp the concept and practical value of TLA+ if we dub it:
Exhaustively testable pseudo-code
> We initially avoid the words ‘formal’, ‘verification’, and ‘proof’, due to the widespread view that formal
methods are impractical. We also initially avoid mentioning what the acronym ‘TLA’ stands for, as doing
so would give an incorrect impression of complexity.
Isn't there a hello world with vector clocks tutorial? A simple, formally-verified hello world kernel module with each of the potential methods would be demonstrative, but then don't you need to model the kernel with abstract distributed concurrency primitives too?
- It cannot compile to working code
- Steep learning curve
Opportunities for TLA+
- Helps you understand complex abstractions & systems clearly.
- It's extremely effective at communicating the components that make up a system with others.
Let get give you a real practical example.
In the AI models there is this component called a "Transformer". It under pins ChatGPT (the "T" in ChatGPT).
If you are to read the 2018 Transfomer paper "Attention is all you need".
They use human language, diagrams, and mathematics to describe their idea.
However if your try to build you own "Transformer" using that paper as your only resource your going to struggle interpreting what they are saying to get working code.
Even if you get the code working, how sure are you that what you have created is EXACTLY what the authors are talking about?
English is too verbose, diagrams are open to interpretation & mathematics is too ambiguous/abstract. And already written code is too dense.
TLA+ is a notation that tends to be used to "specify systems".
In TLA+ everything is a defined in terms of a state machine. Hardware, software algorithms, consensus algorithms (paxos, raft etc).
So why TLA+?
If something is "specified" in TLA+;
- You know exactly what it is — just by interpreting the TLA+ spec
- If you have an idea to communicate. TLA+ literate people can understand exactly what your talking about.
- You can find bugs in an algorithms, hardware, proceseses just by modeling them in TLA+. So before building Hardware or software you can check it's validity & fix flaws in its design before committing expensive resources only to subsequently find issues in production.