Monitoring-Driven Development

CraigJPerry · on March 3, 2015

Yeah I agree. Rob Ewaschuk at google has an excellent document called something like "my philosophy on monitoring". It's fresh in my mind because I've been using it as a kind of social proof for improvements I've been making in the monitoring of our application.

Symptom vs cause based monitoring he calls it and I find the terminology works really well.

He doesn't mention what you allude to - this approach works very nicely alongside teams who's development practices see them iterate on features.

henrik_w · on March 3, 2015

I guess this is the document you mean: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa...

nlh · on March 3, 2015

Shameless plug for something I've written recently on the same topic: I spent a few months researching a "philosophy" on monitoring and alerting in production environments and wrote up some (related but different) conclusions:

https://www.scalyr.com/community/guides/zen-and-the-art-of-s...

https://www.scalyr.com/community/guides/how-to-set-alerts

My focus was more on system-level monitoring/alerting vs application-level, but the philosophies are similar.

gatehouse · on March 3, 2015

Similar to what Steve Yegge said in his unintentionally published platforms rant:

- monitoring and QA are the same thing. You'd never think so until you try doing a big SOA. But when your service says "oh yes, I'm fine", it may well be the case that the only thing still functioning in the server is the little component that knows how to say "I'm fine, roger roger, over and out" in a cheery droid voice. In order to tell whether the service is actually responding, you have to make individual calls. The problem continues recursively until your monitoring is doing comprehensive semantics checking of your entire range of services and data, at which point it's indistinguishable from automated QA. So they're a continuum.

https://plus.google.com/+RipRowan/posts/eVeouesvaVX

swah · on March 3, 2015

This was a great rant and I wish I could have access to all those "learnings". Are there any books or blog posts with that knowledge already? The best resource I know are Martin Fowler's posts on the subject...

gatehouse · on March 3, 2015

I'm not aware of anything really in-depth.

There is one paper I like about operational issues in general: https://www.usenix.org/legacy/event/lisa07/tech/full_papers/... . It lists a lot of criteria that must be met for a system to be highly automated.

swah · on March 4, 2015

That material is great, thanks! The style reminds me of c2 wiki.

SEJeff · on March 4, 2015

I've always followed the "metrics driven development" school of thought. If you make capturing "application telemetry" part of the deployment and running of your application, you make it super easy to monitor via a more or less full on integration test using production traffic. Api response time > 300ms for > 5 seconds? Pager duty!

fennecfoxen · on March 3, 2015

Interesting. My tentative 39000-foot-view understanding of Booking.com is that they run what might be thought of as an A/B-test-driven development shop: you know your code is busted when it makes less money. It appears to be working out for them, so there must be something to it.

(Rumor also has it that they're actively hostile towards more traditional automated testing and test-driven development, which I assume is a prime reason they didn't want me to work there, so I can't tell you more...)

bbrazil · on March 3, 2015

I'd tend to agree. I've found many times that when a system is hard to monitor that it's also difficult to manage more generally, and that a rearchitecture is in order.