I'm a developer who's dealt with heavy operational loads for years. Some random things that help:
- Metric-driven design
-- Define your metrics and success criteria up front, as graphs on a dashboard, and build your system so that it's clear from the dashboard that the system is meeting its success criteria
- Testing
-- Unit test your business logic
-- For every logic bug, write a test which fails, fix the bug, and verify that the test succeeds (regression test)
-- Integration test all your APIs and the "success criteria"
--- try to keep the implementation-agnostic
-- Load test for latency-sensitive components
--- The results of load tests can be published as metrics in your metrics dashboard, since latency is often a success criteria
-- Continuous Delivery pipeline
--- Not only does this improve productivity and stability, but it is super useful for emergent deployments, so that you can know quickly if your fix is breaking anything, and you can add regression tests quickly
Before any of this: Make sure you can build, deploy, and rollback your software quickly and easily. I find all too often people leave this to the end, when really it should be the first thing you do.
Then iterate and add all the stuff mentioned above.
> Liberally sprinkle assertions throughout your code (especially if data quality is more important than resilience)
If you want resilience, make an upper layer that will recover your service on failure, deploy it on redundant hardware, add strict alerts for when recovery is impossible; and still sprinkle assertions through your code.
-set up automatic notification of failure conditions in those metrics. People shouldn't have to stare at and interpret graphs to know something is wrong.
-- and make sure, if those notifications are emails, that they are not triggered every single time the error occurs, potentially flooding you with thousands of emails.
What do people use for this kind of thing? I'm aware of Bosun (https://bosun.org/) and Prometheus (https://prometheus.io/). Both can alert based on aggregated metrics, using rich rules such as values moving away from historical averages by a certain threshold.
A quick plug for Pushover, which has completely changed how I deal with alerts like that. I used to have them sent to me via email, but now the critical / urgent stuff actually buzzes my phone & their API is really easy to use:
Stage 4. Acknowledge that your stuff (and software in general) is now and always buggy somewhere cause it's just too complex for humans to write correctly and whatever your efforts it will just malfunction.
State 5. Never change anything or do any useless activities or upgrades unless you really have no other choice and there's clear demonstrated value in the change.
Stage 6. Get someone else do it for you. Like the silly enthusiastic junior dev who is willing to sacrifice his weekend for some stupid problem.
Stage 7. Switch career to something sensible such as a baker.
> State 5. Never change anything or do any useless activities or upgrades unless you really have no other choice and there's clear demonstrated value in the change.
State 6. Realize that painful steps need to be automated and repeated often. Then each upgrade has much fewer changes, and it's easier to locate the source of errors.
I'll second this. At one of the places I currently work at, we have the same codebase split between nodejs 0.10.40 and node 6.2.something, because the devs were scared of the incremental changes and having to fix small things along the way. It's a nightmare now, and transitioning the final stuff off is proving to be a massive wallop of tech debt.
If you have a team that's actively working on the code, incremental upgrades to the dependencies should be pretty comfortable. But if you've got some legacy thing that no-one really knows, then is the time to start being careful about updating little bits.
I once worked with a CIO that had a policy that we should be no more than one version behind current on software. The idea was that most vendors will have an easy migration plan to current, but beyond that you usually had to upgrade twice with all the attendant pain.
It's not fun, but as a consequence you don't get stuck with something that you absolutely can't maintain.
< State 5. Never change anything or do any useless activities or upgrades unless you really have no other choice and there's clear demonstrated value in the change.
---
> Stage 5. Never change anything or do any useless activities or upgrades unless you really have no other choice and there's clear demonstrated value in the change.
"a much more experienced SRE later came in to work with the team on making the same service operate better, and I got to see what he did and what his process for improving things looked like"
Operations is an expertise that you can develop, or you can decide that it isn't what you want to be doing. Just like machine-learning, games, databases, or operating systems internals: everybody can tackle a tiny project, but only a minority of people will like it and be good enough at it to
make it their life-work.
Ops is definitely something you want to learn about from others - you don't want to be learning all the hard lessons yourself, particularly ones about robust data storage...
Some stupid people started the idea that because women obviously back up their own people through everything, therefore women are blind and do not see anything. They can hardly have known any women. The same women who are ready to defend their men through thick and thin . . . are almost morbidly lucid about the thinness of [their] excuses or the thickness of [their] head[s]. . . . Love is not blind; that is the last thing that it is. Love is bound; and the more it is bound the less it is blind. [G.K. Chesterton, Orthodoxy(Garden City, N.Y.: Image Books, 1959), pp. 69–71.]
There's a lot of "doing things the Ops way" stuff that's clearly super-helpful, if not essential, if you're seriously shooting for "five nines". On the other hand, I've seen it go wrong: for example, redundant servers behind a load balancer yielding dramatically worse real-world reliability than a single instance of the backend service managed on its own (while the load balanced setup was also a nightmare to debug...).
Are there any good guides that focus less on "best practices" and more on trade-offs (especially from the point of view of someone who's more interested in a simple route to 4-nines than going much beyond that)?
Build with distributed and fault tolerance in mind. You should be able to knock out a random server and still run with no effect.
Monitoring with alerts when certain are broken
Build a continuous delivery pipeline, and keep working on it till your so confident that deployments become none events, which can happen at any point. see automated testing.
- Metric-driven design
-- Define your metrics and success criteria up front, as graphs on a dashboard, and build your system so that it's clear from the dashboard that the system is meeting its success criteria
- Testing
-- Unit test your business logic
-- For every logic bug, write a test which fails, fix the bug, and verify that the test succeeds (regression test)
-- Integration test all your APIs and the "success criteria"
--- try to keep the implementation-agnostic
-- Load test for latency-sensitive components
--- The results of load tests can be published as metrics in your metrics dashboard, since latency is often a success criteria
-- Continuous Delivery pipeline
--- Not only does this improve productivity and stability, but it is super useful for emergent deployments, so that you can know quickly if your fix is breaking anything, and you can add regression tests quickly
- Logging
-- Be liberal with logging
-- Log internal state changes (cache evictions, refreshes, connections opening/closing, etc.)
-- Add lots of logging in complicated business logic (there will be bugs, and logs will make them easy to find)
- Liberally sprinkle assertions throughout your code (especially if data quality is more important than resilience)
- Be conservative with dependencies
-- Think carefully before using that new sexy library
--- Do the authors make backwards-compatible changes?
--- Is it well-maintained?
--- Is it likely to be abandoned?
-- Think carefully before coupling systems together with code dependencies