I wrote the initial version of Toxiproxy back in 2014, but Jacob Wirth took it way beyond during his internship at Shopify. It came out of a need for writing integration tests for resiliency work we did at Shopify back then. [1] We didn't want someone to suddenly re-introduce a hard dependency on e.g. Redis on Shopify's storefronts. The initial prototype was a shell script that used lsof(1) and gdb(1) to close the file descriptor of the various connections. But, besides being dodgy, we needed to also simulate latency and make sure it worked on everyone's MacOS laptop's (otherwise e.g. tc(1) would have been intriguing). I wrote a little bit more of the history of Toxiproxy on Twitter. [2] It's stable and has proxied everything in dev and CI at Shopify for over half a decade.
Anyone know of anything like this but at the TCP level? I would love to have a way of simulating network partitions and different message delays for distributed algorithms implemented in Elixir. In an ideal world I'd be able to hook the elixir send/receive primitives to intercept messages between processes even in a single node.
Precisely tc and netem. To give an idea how powerful and tuneable netem can be, we had a piece of network gear that lost messages if they were too close in time to each other (less than N microseconds). Probably some 'copy packet during interrupt' stuff. We couldn't change anything in the applications or the network gear. The solution was (on the sender side) to 1) classify with tc the specific packet sequence and 2) delay the second message in the sequence. It's a large command-line, it does the job perfectly. It's marvelous.
+1 for tc. Used to run a network testing lab of about 12 racks of network gear and tc was my go-to for simulating network level performance impacts for both small and large test scenarios.
The dummynet system facility permits the control of traffic going through the various network interfaces, by applying bandwidth and queue size limitations, implementing different scheduling and queue management policies, and emulating delays and losses.
...
The dummynet facility was initially implemented as a testing tool for TCP congestion control by Luigi Rizzo, as described on ACM Computer Communication Review, Jan.97 issue.
Ah, I see, my bad. If I understand correctly though it's mostly for hooking client/server architectures? I'm interested more in things like hooking comms between replicas of a replicated database or an implementation of paxos. Would it be useable for that? For example, can I take a list of processes running on different ports and set up a proxy per connection (assuming they are connected all-to-all)? Maybe I should just RTFM :)
Thank you for the offer! If you're looking for something small, I just created issue #1. Other than that, I'm open to new ideas and anything that would make it easier to use.
I have recently used it to simulate a network breakdown between application and the database server to study what the application does. There are various clients you can use with toxiproxy via which you can set "toxics" (i.e. a delay in the network traffic, etc), but I found the cli more suited to my needs. The other clients, (for e.g. nodejs) can help you write unit tests which are meant to test the resiliency.
The other fact that caught my interest was the concept of "gamedays". It really about introducing "problems" in the production system randomly and keeping the support staff that manage application incidents on it toes. (More about in this talk: https://www.youtube.com/watch?v=TTfWpHuCJXk)
I've used Toxiproxy to reproduce so many issues that I can't be thankful enough to authors for this awesome tool! I also found a docker based UI to adding toxics I just wish it ships out of box or as another binary/brew package. While cli is awesome for people who have mastered it for beginners it's one more thing to learn, having a UI just solves that.
This looks great! I have a tool I’m building on one system with flaky network issues and this looks to give us a way to test this on a dev server to simulate the issues without having to try to guess what’s happening or interrupt the production machine.
[1]: https://shopify.engineering/building-and-testing-resilient-r...
[2]: https://twitter.com/Sirupsen/status/1455622640727728137