More

inaseer · 2024-08-01T15:14:58 1722525298

Microsoft Azure Storage | Seattle, WA | Onsite or Remote (US only)

Our team in Azure Storage is working on applying automated reasoning and validation techniques to scalably generate hundreds of thousands of tests to check data integrity and durability issues, and scalably validate a combinatorial number of feature interactions to catch regressions and subtle bugs that only happen under rare conditions. As a Software Engineer in this team, not only will you get an opportunity to use tools to specify and check the integrity, correctness and performance guarantees of a large and critical Azure service, you will also get an opportunity to work on building and improving the tools themselves as well as learn about existing state-of-the-art tools in this space. The lessons you will learn will be broadly applicable to any distributed system and will help accelerate your growth as a disciplined and thoughtful distributed systems engineer.

Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

The official job posting is not live yet, so please reach out to me directly at imnaseer@microsoft.com and include "Automated Reasoning and Validation Position" in the subject.

inaseer · on Aug 18, 2023

My team used Coyote to test their distributed service against network race conditions. It requires a little bit of setup to ensure all components that typically run on separate machines can run in a single process, and inter-process communication happens through interfaces that can be stubbed out during testing.

I designed a series of workshops to teach these ideas internally at Microsoft. You can find the source code used in the workshop at https://github.com/microsoft/coyote-workshop - the repo can use better README files but sharing it nonetheless in case the setup helps inspire your own use of Coyote.

inaseer · on Sept 28, 2022

Check out Coyote (which is an evolution of P) and allows you to "model check" .NET code to find hard to find race condition-y bugs in your implementation. We have used it very successfully on a number of projects at Microsoft. I should probably do more public talks on it to get the word out. Finding subtle bugs in your implementation using tools like Coyote is easier than most people realize.

inaseer · on Sept 28, 2022

TLA+ and Coyote pair quite well actually. Verifying the high level design in TLA+ while checking subtle bugs as you translate that design to an implemention.

inaseer · on Aug 21, 2021

Coyote (concurrency exploration tool for .NET programs) can be used to do something "similar". My team often writes tests which set up focused concurrency between different APIs, the tests use Coyote to explore different ways that concurrency can unfold and write a strong set of assertions and invariants that must be true at the end. It's not TLA+ but it's still quite effective, very teachable as developers are already familiar with writing C# tests and helps catch safety and liveness bugs in our actual code base (as opposed to a model of it). It's not the same, by design, and does a decent job at finding concurrency and distributed system bugs.

inaseer · on April 8, 2021

Visualizations do help a lot when model checkers and concurrency schedule exploration tools like Coyote find bugs. Coyote include the ability to visualize the traces if you express your concurrency using actors (see https://microsoft.github.io/coyote/#concepts/actors/state-ma...)

It also allows you to implement your own "logger" through which you can emit enough information to construct some cool visualizations. I had a lot of fun working on visualizing an implementation of Paxos using Coyote (then P#) (screenshot at https://ibb.co/TTk2hYb)

inaseer · on April 7, 2021

Loom's style of exploration (and that of aws shuttle mentioned below) can be quite effective. Coyote is the equivalent library in the .NET world (formerly known as P#) and comes with a bunch of exploration "schedulers" (from random exploration to probabilistic and even an experimental reinforcement learning based one). They released an animation rich video today (see https://news.ycombinator.com/item?id=26718273). Worth watching even if you're using Rust as Loom and aws shuttle are conceptually similar.

inaseer · on April 6, 2021

There has been a bunch of content on Coyote shared before as well but the quality of animations and how they explain how it works under the hood in this introduction was excellent.

inaseer · on March 21, 2021

Hi HN,

The tweet links to a couple of tutorials showing how to test an extremely simple CRUD service using Coyote to find concurrency bugs. Developers write simple unit tests whose concurrency is explored by Coyote to find bugs. You might be surprised to learn how we can write a number of interesting concurrency tests for even the simplest of CRUD services.

https://microsoft.github.io/coyote/#tutorials/first-concurre... https://microsoft.github.io/coyote/#tutorials/test-concurren...

inaseer · on Nov 4, 2020

Shameless plug: http://imnaseer.net/paxos-from-the-ground-up.html

I worked on an explanation of Paxos where we start with a simple but incorrect implementation of the protocol. The bug is then fixed and the protocol refined. Which leads to another bug. Interestingly, after fixing 6 or 7 bugs we arrive at the actual working implementation of Paxos. The reader, having walked the path of arriving at the protocol (hopefully) understands the nuances better than just reading the description of the protocol from the get-go.

Bonus: The explanation uses simulated visual runs as well, eg. http://imnaseer.net/paxos-from-the-ground-up.html?section=3&...

inaseer · on Nov 2, 2020

Concurrency is hard - this was a great investigation into the bug using TLA+ by the author which suggested simplifications to the code in addition to the bug fix.

This reminded me of another reader/writer concurrency bug posted by Tom Cargill in the c2 wiki caused by surprising behavior of Java monitors, which Hillel Wayne showed how to discover and then fix using TLA+ at https://www.hillelwayne.com/post/augmenting-agile/

We were able to reproduce and fix the same bug using Coyote as well, and documented our experience at https://cloudblogs.microsoft.com/opensource/2020/07/14/extre... I was pleasantly surprised at the finding and was definitely heartened by our experiment where we were able to reproduce the bug in actual working C# code without the need of an additional modeling step. While TLA+ is undoubtedly very powerful when it comes to modeling and catching concurrency bugs, I do hope tools like Coyote get wider adoption and are implemented for more languages so bugs in critical infrastructure code can be caught in a scalable and repeatable way while staying within just one language for implementation _and_ testing.