Hacker News new | past | comments | ask | show | jobs | submit login
Turmoil, a framework for developing and testing distributed systems (tokio.rs)
284 points by zbentley on Aug 17, 2023 | hide | past | favorite | 28 comments



I've been working on implementations of classic algorithms in distributed computing, and used Turmoil for testing correctness in message-passing / HTTP systems [0].

Overall, my experience has been positive. When it works, it's great. A pattern I've been following is to have a single fixture that returns a simulation of the system under a standard configuration, for example N replicas of an atomic register, so that each test looks like: 1. Modify the simulation with something like`turmoil::hold("client", "replica-1")`. 2. Submit a request to the server. 3. Make an assertion about either the response, or the state of the simulation once the request has been made. For example, if only some replicas are faulty, the request should succeed, but if too many replicas are faulty, the request / simulation should time-out.

One of the things I have found difficut is that when a test fails, it can be hard to tell if my code is wrong, or if I am using Turmoil incorrectly. I've had to do some deep-dives into the source in order to fully understand what happens, as the behavior sometimes doesn't line-up with my understanding of the documentation.

[0] https://github.com/kaymanb/todc/tree/main/todc-net


That's great to hear that you've been using turmoil for this type of work. I'm one of the authors and we'd love to hear about your experience and what we can do improve things. Either a github issue or reaching out on discord works great.

We've discussed improving the tracing experience, and even adding visualizations, but it hasn't been prioritized yet.


This seems to be specifically for distributed systems written in Rust. Though it's not entirely clear from the post so not sure?


This is built on the tokio.rs ecosystem so it is Rust-specific.


It may also be some kind of small CloudSim alternative but in Rust instead of Java. I’m not sure.

https://en.m.wikipedia.org/wiki/CloudSim


Always good to see more examples of this sort of project; making it straightforward to test how systems behave with unreliable networks [1] should make building resilient systems easier.

I came across Microsoft's Coyote project [2] a while back, which seems similar to this, though for C#. Has anyone here tried that out? If so, what have your experiences with it been like?

[1] Also known as any network ever.

[2] https://github.com/microsoft/coyote


I used it to test an idea I had to implement the Fast Paxos optimization on top of CASPaxos and it was quite useful for that: https://github.com/ReubenBond/fast-caspaxos (the code is bad, it's just an experiment)


My team used Coyote to test their distributed service against network race conditions. It requires a little bit of setup to ensure all components that typically run on separate machines can run in a single process, and inter-process communication happens through interfaces that can be stubbed out during testing.

I designed a series of workshops to teach these ideas internally at Microsoft. You can find the source code used in the workshop at https://github.com/microsoft/coyote-workshop - the repo can use better README files but sharing it nonetheless in case the setup helps inspire your own use of Coyote.


On first read this looks almost completely useless for actual distributed systems.

So first off, distributed systems use different software for different tasks. These will not be/were not written in Rust, and even if they were, they may not use Tokio (which is a library I could go into a rant about as well...).

Second, the very premise of the library, namely that it simulates the components and serializes everything into a single deterministic thread, is already wrong. Distributed system issues are difficult because they are non-deterministic, the faults happen rarely, and they exhibit Heisenbug behaviour: the more diagnostic you add to the system, the more likely you are to serialize the event sand therefore hide the bug. This library starts off with serialization, effectively hiding most if not all issues that can happen in a real system. What is the point of this?

So how would an actually useful distributed testing framework look like? It would be almost the exact opposite of turmoil: 1. It would handle an actual distributed system with different components written in different languages 2. It would inject payloads that are likely to tease out races/consistency violations 3. It would inject failures at various layers (network, storage etc) to test how the system recovers 4. It would provide useful diagnostics, and a way to reduce the failure scenarios to the extent possible (again, we're talking about non-deterministic failures, so this is not easy)


> distributed systems use different software for different tasks. These will not be/were not written in Rust,

Obviously, that's not always true. Having a way to do deterministic testing in a single process is helpful.

> Distributed system issues are difficult because they are non-deterministic, the faults happen rarely,

You do not want non-determinism in your tests just because the real world non-determinism. How else are you going to write repeatable tests of scenarios involving concurrency?

You can build fuzzing on top of a deterministic simulator, allowing you to reproduce failures. This is a strategy that FoundationDB has used to great effect.

https://apple.github.io/foundationdb/testing.html


Reading about it, the FoundationDB testing does look very good. If you control all of the atomic events and build a fuzzer on top, you can indeed end up with reproducible scenarios testing concurrency, up to the atoms' scope.

In the async scenarios as with Tokio and Flow(which I just learned about, cheers), the atomicity extends to the scheduler yields, which is already pretty good. It cannot test e.g. atomic memory operations/memfences etc but it can test in a fairly fine-grained way.

However, my original point still stands: you can only do this if you control all of the system within your particular scheduler. In real world scenarios you are more likely to encounter:

1. Some database written in C

2. Some storage layer maintained by AWS/Azure/GCP

3. Some messaging layer written in Go

4. Failover modes of the Kubernetes/Hashicorp stack

5. Your node.js webserver named "backend"

6. Some performance-sensitive component written in Rust

This sort of testing just does not apply


> Obviously, that's not always true. Having a way to do deterministic testing in a single process is helpful.

It is, but then you're not testing a distributed system.

> You do not want non-determinism in your tests just because the real world non-determinism. How else are you going to write repeatable tests of scenarios involving concurrency?

You can't, but you can get close. E.g. with fuzzing you can generate the scenarios with some ordering constraints, so if you replay the scenario a couple of times you are likely to reproduce the failure. It's a balancing act, too much serialization and your tests are useless, too little serialization and the state space becomes too large.


FoundationDB has published some informative content on how they test their actor based distributed system in C++, might be worth a watch for inspiration.

https://www.youtube.com/watch?v=4fFDFbi3toc


Cool, will be interested to see how this develops! tokio's loom framework has been a big help in testing some tricky concurrency code I've worked on.

Folks interested in this space might also be interested in the system I spend most of my time working on: Shadow. It also performs deterministic simulation of a network of hosts, but it intercepts network and system interactions at the syscall level via seccomp. As such it can work with binaries compiled from ~any language, usually without any code modification or special compilation. https://shadow.github.io/


See Also: Loom (https://github.com/tokio-rs/loom), a concurrency permutation testing tool.


I used https://docs.rs/shuttle/latest/shuttle/ (which was directly inspired by Loom) for testing a concurrent garbage collector I wrote in Rust. It was really useful to find pathological thread orderings that broke GC invariants, and I'd definitely use a similar system for any concurrent or distribute code I'd write in the future - these types of exercising systems and deterministic reproducers are great.


Used it to check lock- and wait-free algorithms. Great tool! One really tough bug however needed TLA+ to be found, running loom over multiple days was apparently not enough.


Leslie didn't steal his Turing award ;)


While deterministic execution for concurrent/distributed systems is great, there's also value in fuzzing the timing and order of asynchronous events to reveal race condition bugs in testing.


How does this compare to Jepsen?


Nice library!

I found myself struggling with this in python with asyncio, is there something similar in that space?

Every time I wish I was using rust


With asyncio you can run multiple servers/clients/whatever in the same event-loop/thread/process already. Although, I don't understand what this library does at all, I don't think in the asyncio case you would need anything special. I've written tests for multiple distributed applications by using the asyncio approach in a relatively straightforward manner.


> Turmoil is a framework for testing distributed systems [written in Rust]. It provides deterministic execution by running multiple concurrent hosts within a single thread. It introduces "hardship" into the system via changes in the simulated network. The network can be controlled manually or with a seeded rng.


Very cool - the Tokio team just keeps cranking out good stuff! Thank you!!!


Can someone explain what this does under the hood? Did find the text on their website very helpful. (Also, I don't know Rust)


I also thought the example could do with fleshing out, but from:

> Testing distributed systems is hard. Non-determinism is everywhere (network, time, threads, etc.)

I assume that, at a very high level, what it does is: introduce random delay between components, give them slightly different fake times when they ask for system time, put them in different threads & perhaps introduce (additional) pre-emption etc., and anything else to simulate 'etc.'.



I, in turn, was reminded of the (quite fun, for 1 play-through) 2016 game about mining oil: https://en.wikipedia.org/wiki/Turmoil_(2016_video_game)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: