What I Learned Managing Site Reliability for Some of the Busiest Gambling Sites

jedberg · on April 4, 2017

I'm going to have to respectfully disagree with a big chunk of this article. Documentation is generally a waste of time unless you have a very static infrastructure, and run books are the devil.

You should never use a run book -- instead you should spend the time you were going to write a run book writing code to execute the steps automatically. This will reduce human error and make things faster and more repeatable. Even better is if the person who wrote the code also writes the automation to fix it so that it stays up to date with changes in the code.

At Netflix we tried to avoid spending a lot of time on documentation because by the time the document was done, it was out of date. Almost inevitably any time you needed the documentation, it no longer applied.

I wish the author had spent more time on talking about incident reviews. Those were our key to success. After every event, you review the event with everyone involved, including the developers, and then come up with an action plan that at a minimum prevents the same problem from happening again, but even better, prevents an entire class of problems from happening again. Then you have to follow through and make sure the changes are actually getting implemented.

I agree with the author on the point about culture. That was absolutely critical. You need a culture that isn't about placing blame but finding solutions. One where people feel comfortable, and even eager, to come out and say "It was my fault, here's the problem, and here's how I'm going to fix it!"

wink · on April 4, 2017

There's stuff where you can either spend weeks or months automating it - or live with the fact that your oncall engineer has to use a runbook a few times per year to fix the problem.

Also your Netflix example - how many people do you have there? Probably the smallest of teams is bigger than my company's whole engineering department. We're running a whole company and not a few services. (I'm absolutely not trying to discourage what you're doing - but I strongly feel it's a different ballgame.) The smaller your team with too many widespread responsibilities (not: make THIS service be available 99.9%, but make ALL services available 90%, then 95%, ...) the more you only automate what happens frequently. And yes, we try to first a basic runbook when something happens - and only if the same thing happens repeatedly, we automate it.

jedberg · on April 4, 2017

We were doing this with a team of sometimes one or two reliability engineers, but we were cheating, because our company culture meant that the engineers who built the systems are responsible for keeping them running, so they would invest their engineering time in fixing the problems along with us.

I personally found that runbooks were even worse for small size teams (like our four person reddit team) because they would get out of data even quicker than at the bigger places due to the rapidly changing environment.

I wrote down thread that if all of your deployment is automated than it is much easier to automate remediation, because you just change your deployment to fix the problem, as long as you can redeploy quickly.

wink · on April 4, 2017

I agree on one level (literally, not figuratively) - if you have awesome (or even bearable :P) deployment for your software stack. We can do that.

On the other level, there's stuff we see as core infrastructure (for example Hardware, or even some parts of OpenStack running on that hardware) - of course there are also downtimes and emergencies and dumpster fires - but they are pretty much unique little snowflakes and the repeating ones happen a few times per year. There simply is nothing "to deploy". Maybe one can argue that "runbook" is not 100% correct, sometimes it's a runbook including debug info.

But it's not turtles all the way down, and I stand to my point there's stuff where the cost to benefit ratio totally ends up at "automate it away" and there's other stuff where it's the opposite.

toomuchtodo · on April 5, 2017

> We were doing this with a team of sometimes one or two reliability engineers, but we were cheating, because our company culture meant that the engineers who built the systems are responsible for keeping them running, so they would invest their engineering time in fixing the problems along with us.

What advice would you give for an org where the engineers who build systems are not responsible for keeping them running, and everyone on a (much smaller comparatively) infrastructure team is (which is slowly turning into an SRE team by necessity)?

Anecdotally, I've found documentation to be useless; despite documentation being of a high quality, no one refers to it, even after iterating to continue to add information, make it more relevant, streamline, etc.

jedberg · on April 5, 2017

My advice would be to push as hard as you can to change the culture, or you'll be drowned. Engineers will not make it a priority to fix anything that causes outages because they will be evaluated on feature velocity, not uptime.

If you can make the company culture focus on uptime, or get engineers involved in remediation, then you'll be better off.

If you can't do that, try to at least push for the Google model: The engineers are responsible for uptime of their product until they can prove that it is stable and has sufficient monitoring and alerting, and then they can turn it over to SRE, with the caveat that it will go back to the engineers if it gets lower in quality.

j_s · on April 5, 2017

Or push to change the culture to match the one given in the article, where documentation is important and kept updated?

toomuchtodo · on April 5, 2017

Thanks for taking the time to reply, really appreciate the guidance.

chimeracoder · on April 5, 2017

> What advice would you give for an org where the engineers who build systems are not responsible for keeping them running

I don't think that's a healthy or sustainable culture for a company, certainly not one that's expecting to grow.

It might be sufficient for a company that has a small technical team and isn't looking to grow (think: the "tech department" for a company in another industry), but not for a company where engineering or technology is the primary focus.

toomuchtodo · on April 5, 2017

Thanks for the reply!

bbrazil · on April 4, 2017

> I personally found that runbooks were even worse for small size teams (like our four person reddit team) because they would get out of data even quicker than at the bigger places due to the rapidly changing environment.

My experience is that you can get regularly updated runbooks, but they're at the wrong level of abstraction.

They'll discuss some odd one-off failure that happened to trigger a given alert, rather than the general class of problems that this alert is trying to catch.

These days I consider looking at a runbook to be an act of desperation, that'll I'll only perform after attempting to debug from first principles.

problems · on April 4, 2017

Depends on the exact case, but I've managed to convert many-step deployments into single-step ansible playbooks in not much longer than it takes to run them once.

Even at our super tiny 3-person operation this has saved a massive amount of time - it's even more valuable when it's panic time because a server went down due to disk or ISP failure and you know you have an ansible script that can get a new server up in 3 minutes flat while you grab the backup.

BuuQu9hu · on April 4, 2017

Documentation is the only defense against tribal knowledge. If your turnover is low, you might not notice this, but that doesn't mean that it isn't biting you.

Not all processes can be automated today. Some can only be automated tomorrow. Some can only be automated after some blocking functionality is added. Documentation is how you plan your automation.

emorse · on April 4, 2017

As far as this discussion goes, I think that code is documentation. If you need to figure out how something works, going directly to the automation code will give you the right answer. If the code is difficult to understand, that's more of a quality issue, and the same problem happens with word documents.

chimeracoder · on April 5, 2017

> As far as this discussion goes, I think that code is documentation.

"Code as documentation" can be an okay answer if the question is "what is the behavior of this system"? But it's a bad answer for the question "what is the intended behavior of this system, and what assumptions does it run under?"

Looking back at a piece of code months (or years) later and not knowing if a particular edge case that you're running into was actually the intended function or not is not particularly fun.

The power of writing documentation is not just in the end product; it's that it serves as a forcing function for the developers to confront their own thought processes and make them explicit. It's possible to write code that makes all of its assumptions explicit and clearly states its contracts up-front, but in practice, it almost never happens without significant external documentation (whether that comes in the form of explicit code docs, whitepapers, or ad-hoc email threads and Slack conversations that need to be pieced together after-the-fact).

rhizome · on April 5, 2017

One of Fred Brooks's principles is that the manual is the spec, that there should not be any behavior in the software that is not in the documentation, and the documentation comes first.[1] A kind of old-school TDD.

1. https://en.wikipedia.org/wiki/The_Mythical_Man-Month#The_man...

robomc · on April 5, 2017

Code as documentation only works if all your infrastructure is managed programatically, there are good tests, the code is good, and overall complexity is low.

I mean, if your downstream partner calls up to say that the custom analytics feed from your xyz service is returning null data, but not erroring, and the guy that implemented that feed (with rolled eyes, thinking it was an inelegant hacky concession to a noisy customer) left in 2015, where do you even start? How much code from how many codebases and configuration management repos are you going to have to read through just to work out what kind of problem you've got?

Some type of high level documentation - what services / products exist, what infrastructure does each use, how is each one managed and tested, what is worrisome about it or what has tended to go wrong in the past - is going to help a lot.

mianos · on April 4, 2017

In the nineties.

zwischenzug · on April 4, 2017

'Documentation is generally a waste of time unless you have a very static infrastructure'

I definitely agree with that, and it's partly a corollary of 'documentation is expensive and requires costly maintenance'.

Run books/checklists are mostly implemented really really badly.

Automation is the ideal, but is costly, and itself requires maintenance.

Most of the steps we had to perform did not lend themselves to automation, also.

jedberg · on April 4, 2017

> Automation is the ideal, but is costly, and itself requires maintenance.

I would contend that the cost of automation is about the same as the cost of documentation plus the cost of having to manually do the work over an over. It's just a cost borne up front instead of over time. But to your point in the article, you have to have a culture that supports bearing that up front cost.

> Most of the steps we had to perform did not lend themselves to automation, also.

I don't understand how that is possible? Could you give an example of a task that can't be automated?

zwischenzug · on April 4, 2017

Your arguments contradict themselves - if the infrastructure is changing then there's little point spending the time automating your response to its failure.

Most attempts I've seen to automate tasks flounder for this reason.

Also, there's no point automating something that happens once a year - the cost will exceed the benefit. Hence the comments about decided which tasks to automate with a backlog and metrics.

Most anything _could_ be automated, eg 'Phone the customer and ask them to engage their network team', 'look for similar-looking strings', 'tcpdump the port, and see if the output is irregular', but is it really worth the effort to? That's where the backlog came in.

BTW I wrote a whole framework for automation

ianmiell.github.io/shutit/

and it's taught me a lot about the hidden costs of automation...

jedberg · on April 4, 2017

> Your arguments contradict themselves - if the infrastructure is changing then there's little point spending the time automating your response to its failure. Most attempts I've seen to automate tasks flounder for this reason.

If everything in your infrastructure is deployed with code, then automation is simply the act of making sure the infrastructure matches the deployment described in the code. It's true that automation is just as difficult if deployments are manual. Then remediation becomes changing the deployment code to fix a previously unknown problem, instead of manually fixing a problem. This then gains you the advantage of the problem being fixed in perpetuity.

So they aren't in contradiction if your remediation and deployment are the same process, because then it by definition is always up to date.

I guess I should add the caveat that deployment should be quick enough to solve problems via redeployment.

eropple · on April 4, 2017

FWIW, I agree with your overall notion of where one should go, but it's worth noting that you're describing an extremely optimistic world for probably 90% to 95% of shops. Which is not to say that it's a bad idea at all to get there, but incremental steps are important and they're what I feel are being described in this article (setting aside that that kind of pervasive automation wasn't really a thing outside of bleeding-edge shops at the time he was describing).

Shameless plug for people who are feeling inferior while reading this thread: as it happens, my consulting gigs revolve pretty heavily around doing exactly what 'jedberg has described, and it's super worth getting to that point. ;) Email's in the profile if you're looking to get there.

geofft · on April 4, 2017

Most of the environments I've worked at have been so far from being deployable with code that it would take you significant time and expense to get there, and in the meantime, you need to keep everything working. You need a reliability engineering team that knows how to triage problems and keep the existing system working while they automate what is feasible to automate. I'd probably agree that this is a separate mindset and skillset from keeping an infrastructure-as-code deployment running, and it's unfortunate that we don't have good terms to differentiate them. A lot of the SRE mindset applies, but not all of it.

For instance, if you've got machines at customer sites where you manage the software, commercial filers or appliances where you don't really have a shell and you certainly can't spin up a VM, etc., your machines aren't fungible. You can't just redeploy a machine, and just about every problem on the machine is going to be a new one. You want the runbook so that it enumerates how to stop and start things around a human intervention (and yeah, as much of that as possible should be automated), but you can't automate the whole thing without changing your architecture.

If you have the option of starting with an architecture where you don't have these problems, by all means take that option! Maybe do cloud-hosted SaaS instead of maintaining on-prem software for customers, or use some fancy cloud storage API instead of a physical old-school filer. But people who have the old-school architectures need to keep things running smoothly, too.

munro · on April 4, 2017

> Your arguments contradict themselves - if the infrastructure is changing then there's little point spending the time automating your response to its failure.

OP said automating runbooks, and any fixes to production--so the code becomes the documentation of how to deploy, and what exactly was done to remedy an error. OP did not say automating failure responses.

Which brings up OP's proposed question...

>> I don't understand how that is possible? Could you give an example of a task that can't be automated?

Could you give a concrete example? There's a vague reference in the article about automating responses for encoding errors which sounds interesting. It sounds like the system generates lots of server errors, and it's easier/cheaper to communicate them to customers directly instead of making code fixes.

ekimekim · on April 4, 2017

> I don't understand how that is possible? Could you give an example of a task that can't be automated?

"Check the logs for service X (they're here <link>) and look for anything related to the issue"

"If the user impact is high, write an update to the status page detailing the impact and an estimated time to recovery"

The value of a runbook is that it can make use of human intelligence in its steps. No-one is arguing that you shouldn't be automating things like "if the CPU usage is > 90%, spin up another instance and configure it".

jedberg · on April 4, 2017

> "Check the logs for service X (they're here <link>) and look for anything related to the issue"

I have a long missive about how logs are useless and shouldn't be kept, but that's for another time. I'll summarize by saying that if you have to look at logs, then your monitoring has failed you.

> If the user impact is high, write an update to the status page detailing the impact and an estimated time to recovery"

I guess technically that would be a step in a runbook, that's fair. Although in my case that was left to PR to do based on updates to the trouble tickets. :)

> The value of a runbook is that it can make use of human intelligence in its steps

I'd rather human intelligence be spent on triage by reading the results of automated diagnosis and coding up remediation software than on repeating steps in a checklist.

Sure, there are uses for checklists of things to check, but even that should be automated through the ticket system at the very least, which I no longer consider a runbook, but I guess some might still consider that a runbook.

always_good · on April 4, 2017

    > I have a long missive about how logs are useless
    > and shouldn't be kept, but that's for another time.
    > I'll summarize by saying that if you have to look 
    > at logs, then your monitoring has failed you.

Eh, I'm extremely wary of people who think they have a silver bullet to everything and can speak in absolutes like this.

I have to say, you have a tendency on HN to chime in from the peanut gallery and be a bit unrelenting and even combative because jedberg does things differently.

jedberg · on April 4, 2017

> I have to say, you have a tendency on HN to chime in from the peanut gallery and be a bit unrelenting and even combative because jedberg does things differently.

That's a fair critique, and thank you for pointing it out. I try to always back up what I say with the reasons for what I say, but sometimes I get lazy or don't have time to write it all out. I too worry about folks who speak in absolutes, although in this case I happen to actually believe it.

The medium isn't always the best way to have a deep technical discussion unfortunately.

icebraining · on April 4, 2017

I have a long missive about how logs are useless and shouldn't be kept, but that's for another time. I'll summarize by saying that if you have to look at logs, then your monitoring has failed you.

How does that work?

jedberg · on April 4, 2017

First we have to make the distinction between logs and metrics. Logs are unstructured or loosely structured text, whereas metrics are discrete items that can be put into a time series database.

If you emit metrics as necessary to a time series database, then you should be able to build alerting based on the time series metrics. Your monitoring systems should be good at building alerts based on a stream of metrics and visualizing the time series data.

Sometimes you might have to look at the visualizations to find something, but ideally you then set up an alert on the thing you looked at so you have the alert for the next time it happens. A great monitoring system lets you turn graphs into alerts right in the interface, so if you're looking at a useful graph you can make an alert out of it.

Sometimes logs can be useful, but only after your monitoring system has told you which system is not behaving, and then you can turn on logs for that system until you've solved the problem, but you shouldn't need access to old logs, because if the problem was only in the past, then it's not really a problem anymore, right? If you have an ongoing problem, then maybe have the logs on for that service while you're investing that problem, but then turn them off again.

But having a ton of logs always generating and being stored tend to be fairly useless in practice with a good time series database at hand.

vacri · on April 4, 2017

Logs have a much, much, much lower barrier to entry than a fully-complete time-series monitoring system that covers everything.

Likewise, turning logs on only after you've seen a problem means you miss out on troubleshooting the root cause of it - if there was a spike of badness this morning but you don't have logs for it, you're missing out on diagnostic information that may have protected you from repeats of that spike in future.

I've also had business guys want to analyse things like access logs in ways that they didn't know previously. Logs provide a datastore of historical activity, which in smaller shops is a cheap data lake.

Perhaps the 'no logs' thing works for your setup, but I think it's bad general advice. And your position is not that logs are useless ("turn on logs for that system until you've solved the problem"), but that retaining logs are useless - quite a significant difference between the two.

jedberg · on April 4, 2017

> And your position is not that logs are useless ("turn on logs for that system until you've solved the problem"), but that retaining logs are useless - quite a significant difference between the two.

That's an important distinction, one that I agree with, and I should make clearer.

Logs do have a purpose, but I'm not sure that retaining them does.

Sure, for a very small shop, throw them on a disk, use awk, sed, grep, and perl to look through them, and call it a day. But once you get to the point of "spinning up a cluster of log servers" or something like it, I'd say you're probably better off investing in monitoring instead.

jsmthrowaway · on April 4, 2017

Having a corpus of live "test" requests for rebuilding an API is a compelling enough use case to justify infinite retention, in my opinion. Perhaps relevantly to your prior gig: "did we handle that weird Sony Bravia API call from five years ago in this redesign? Yes we did." I'm implying a little more structure to logging, though, because it sounds like you're thinking about random strings emitted from binaries (which I think are also approachable, to an extent). This overlaps with PII and such, though, and there is no one-size-fits-all answer.

More than once I've run the entire corpus of requests to a system, ever, through a dummy rebuild as a pretty great integration test. It's a powerful SRE tool. Spelunking through all historical data is just icing on that cake, honestly. As the author says, SRE is basically just an information factory; I'd be haaaaaard pressed to agree with you on throwing away a lot of information -- you don't know what you don't know until you want to know it -- and betting all-in on monitoring. Retaining logs is not the hardest problem SRE deals with, either, but SREs turn around and force unrealistic latency requirements on the query side (I see a lot of ELK deploys running into this).

You have to look at it as an Oracle. Oh, great Oracle of a pile of meaningless logs, cook off this map/reduce and tell me an interesting number that I can put in a Keynote for executives. Definitely not dashboarding from logs data. That's an impedance mismatch that Google gets away with because of the nature of their logging.

user5994461 · on April 5, 2017

Splunking is cool.

Nonetheless, splunk is the most expensive software license on the planet. More than Oracle, yes.

jsmthrowaway · on April 5, 2017

I meant spelunking, where they got their name. But yes, you're not wrong. (I've heard of $millions.)

zwischenzug · on April 5, 2017

Some envs require logs to be kept for regulatory purposes.

Interestingly, we never used a cluster of log servers. I was always skeptical of their utility. It was grep, plus some hand-rolled utility scripts to interrogate. One was a thing of beauty I spent years on, which saved us a ton of time.

user5994461 · on April 5, 2017

> Logs have a much, much, much lower barrier to entry than a fully-complete time-series monitoring system that covers everything.

A monitoring system has a lower barrier to entry.

http://datadoghq.com/ => will do ALL of that and much more. You can deploy it in a few hours to thousands of hosts, no problem.

Direct competitor: http://signalfx.com/

Have no money to pay for high quality tool? graphite + statsd will do the trick for basic infrastructure. However it's single host, doesn't scale and only basic ugly graphs are supported.

qohen · on April 5, 2017

only basic ugly graphs are supported.

That's what Grafana [0][1] is for -- i.e. creating nicer displays for Graphite.

However it's single host, doesn't scale

It may take some effort, but it can be done, and much of the heavy-lifting seems to have been done and been made available as open-source.

Here's a blog post from Jan. 2017 [2] from a gambling site about scaling Graphite.

And here's a talk [3] from Vladimir Smirnov at Booking.com from Feb. 2017 about scaling Graphite -- their solution is open-source (links in the talk and slides available at the link):

This is our story of the challenges we’ve faced at Booking.com and how we made our Graphite system handle millions of metrics per second.

(And this [4] is an older, but more comprehensive, look at various approaches to scaling Graphite from the Wikimedia people with the pros and cons listed).

[0] https://grafana.com/

[1] https://github.com/grafana/grafana

[2] http://engineering.skybettingandgaming.com/2017/01/13/graphi...

[3] https://fosdem.org/2017/schedule/event/graphite_at_scale/

[4] https://wikitech.wikimedia.org/wiki/Graphite/Scaling

user5994461 · on April 5, 2017

Funny. Quoting companies from my home town.

It can be done but at what costs? Better get a tool that gets the job done out of the box and does it well.

For starters, if you're operating in the cloud, you cannot get servers with FusionIO drives and top notch SSD. That limits your ability to scale vertically.

vacri · on April 5, 2017

Datadog for thousands of hosts costs tens of thousands of dollars per month. That is not a low barrier. But if you're arguing for external vendors, then Papertrail is to logs what Datadog is to metrics. And having tried both Papertrail is easier and quicker to set up (though Datadog isn't difficult).

Similarly, basic infrastructure health is not giving you the same sort of information (what the software is actually doing) that logging does. In order to do time-series monitoring of your software rather than your system, you need to spend time thinking about what metrics you need to track and how you're going to obtain them.

I run both an ELK stack and a Prometheus stack, and I find they're good for different things.

otisg · on April 6, 2017

Any modern monitoring service will provide you with volume discounts if you have hundreds or thousands of servers.

Since you mention Papertrail specifically in the context of costs - Papertrail is actually a bit pricey relative to the competition. For example, compare https://papertrailapp.com/plans to https://sematext.com/logsene/#plans-and-pricing . I think Sematext Logsene is 2-3 times cheaper than Papertrail.

Lastly, I was at a Cloud Native Conference in Berlin last week. A lot of people have the same setup as you - ELK for logs + Prometheus for metrics. We're running Sematext Cloud where we ship both our metrics and our logs, so we can easily switch between metrics and logs much more easily, correlate, and troubleshoot faster. Seems a bit simpler than ELK+Prometheus...

user5994461 · on April 5, 2017

thousands of hosts costs hundreds of thousands of dollars. a few thousands more is a low barrier.

Sumologic, papertrail, logentries (and many more) for cloud logs. Graylog or ELK or Splunk for self hosted logs.

However, logs should never be send to the cloud, they are too sensitive information to outsource. Server metrics + stats are more reasonable.

Agree, stats and logs cover different things. Need both.

otisg · on April 6, 2017

Both metrics and logs are needed for the full picture (plus alerts and a few other things, like RUM and error/exception capture, both on front-end and backend).

Saying that logs should never be sent to the cloud is overly absolute. Some logs should indeed stay behind firewall, but lots of organizations have logs that can be shipped out to services whose features derive all kinds of interesting insights from logs.

jakozaur · on April 6, 2017

Sumo Logic does both logs and metrics at reasonable cost. The log part is state of the art, the metric is recent addition.

Discalimer: I work at Sumo Logic and enjoy it.

spyrosg · on April 4, 2017

Given the low cost of long-term storage, if you have any auditing requirement, it's worth it to store lots of logs.

jedberg · on April 4, 2017

What about the cost of moving the logs to storage, and the infrastructure required to move them around and put them in storage? Especially if you have a micro services architecture.

Also the cost of the infrastructure to search the logs and view the logs.

Zalastax · on April 4, 2017

Remember that there are different sizes of deployments. For sites like Reddit or Netflix, logs becomes a burden since there's so much data. For smaller deployments it's possible that all logs can be aggregated on a single machine quite easily, and aggregating the logs is far more enjoyable than SSH:ing to each separate machine.

> Sometimes logs can be useful, but only after your monitoring system has told you which system is not behaving, and then you can turn on logs for that system until you've solved the problem, but you shouldn't need access to old logs, because if the problem was only in the past, then it's not really a problem anymore, right?

Some things happen rarely, but can still have large impact. E.g. Imagine a once a day job of moving files which fails twice a month, rendering those files inaccessible.

otterley · on April 5, 2017

Many shops don't keep logs around because we want to. We keep them because we have to. You might see things differently someday if you host data for customers and have to follow SOC2 and HIPAA requirements...

deskamess · on April 4, 2017

Automated diagnosis sounds like a time saver... What is your source material for automated diagnosis? i.e., how was it trained? Are there cases where an automated diagnosis could not be made for an incident and if so was manual recourse possible? How would you retrain the 'diagnosing app' to handle the new case?

jedberg · on April 4, 2017

> What is your source material for automated diagnosis? i.e., how was it trained?

Incident reviews. If something happened that wasn't covered, then it is added as an outcome of the incident review.

> Are there cases where an automated diagnosis could not be made for an incident and if so was manual recourse possible?

For sure. Manual recourse was to dig in and figure it out either with the command line or the monitoring system or whatever else.

> How would you retrain the 'diagnosing app' to handle the new case?

In most cases the "diagnosing app" was a dashboard on the monitoring system, with a set of relevant graphs, so you would add a new graph. There was also a tool that correlated graphs, so you could add a new graph and correlation.

gdulli · on April 4, 2017

> I'm going to have to respectfully disagree with a big chunk of this article. Documentation is generally a waste of time unless you have a very static infrastructure, and run books are the devil.

I estimate there's maybe 25% of the industry that's ignorant of "best practices", 70% that follow them dogmatically, and 5% that use them as guidelines, evaluating each situation on its own merits and choosing what makes sense.

I feel like it's more frustrating to deal with the 70% than the 25%. For some people documentation can only sound like a virtuous thing. But documentation can do harm as soon as it gets out of sync with reality.

jsmthrowaway · on April 4, 2017

> You should never use a run book -- instead you should spend the time you were going to write a run book writing code to execute the steps automatically.

While I hesitantly agree despite it seeming counterintuitive (you make a good case), I'd contend the code can take a form that looks runbook-y. I've had success in my organizations with Jupyter notebooks with documentation mixed throughout. Sometimes you do need a human, and in those cases having the documentation update live with the state of the world was huge for comprehension, particularly when you're centrally executing the notebook. Each step is something like:

> 0) Blurb about what's going on, warnings, etc.

> 1) Call into your automation code with one well-named entry point, like reboot_all_frontend_servers().

> 2) Display the relevant results immediately under that cell.

Then you can step, yadda yadda. Idempotency of the steps is key. I have a vision for the operations bible taking such a form, with each thing-ops-needs-to-do represented with a notebook, but that might be unattainable -- you might be correct about that. Even still, mixing documentation and code seems to potentially push that barrier just a little further back. As a few people told you elsewhere in this thread (and, I think you know), complete automation is a big ask outside of a subset of maybe a dozen valley companies, even at a small property level. Until then, giving humans the tools to reliably do their job, like mechanized dynamic (not static) documentation, might be useful enough to not discard entirely.

vacri · on April 4, 2017

> Documentation is generally a waste of time unless you have a very static infrastructure

I find that while documentation suffers from 'detail rot' quite fast, it does help with higher-level stuff like the rough outline of how things are organised and where they live, or why decisions were made to do -foo-.

joncrocks · on April 4, 2017

I have a question. The fact that an incident review occurs suggests that there was some kind of incident, and one that presumably wasn't anticipated and automatically handled.

What does one do in that kind of situation where one has a novel problem and needs to do something about it?

Should the organisation keeps all of the details of how to do things and how things work in their heads (and presumably ensuring that enough people are available whenever you might have an issue), or work out how things work/what does what on the fly? (or maybe a mix of the two?)

jedberg · on April 4, 2017

Well, ideally the person who wrote the code is fixing it, and should know how it works. Or at least a close teammate.

But if you don't have that luxury, then it means you're hoping that the documentation is up to date, but chances are that if you don't have enough resources to have on call coverage from someone very familiar with the area that's having a problem, then there is a good chance they didn't have time to write or update documentation.

So ideally the person who wrote it has it in their head, or you're figuring it out on the fly and hoping you have good comments in the code and good metrics.

zwischenzug · on April 6, 2017

> Well, ideally the person who wrote the code is fixing it, and should know how it works. Or at least a close teammate.

What if the code is 15 years old, and everyone involved with writing it has left?

closeparen · on April 4, 2017

A skilled operator reading a runbook would seem less likely to exacerbate the mess than a naively written script. The operator can think, "wait a minute, this isn't right." Maybe they won't when they're sleeping off a night of drinking at 3am, but they could. The Python interpreter can't.

To make the script less naive, you'd need a development environment with the same infrastructure components as production, and the freedom to make them fail so you can develop your script against their failure states.

dastbe · on April 4, 2017

Then have that environment? If you know what you need and it isn't unrealistic, then get/make it.

Spooky23 · on April 4, 2017

I'd argue that your incident review/debrief process is where you are doing your documentation after the fact.

aanm1988 · on April 5, 2017

I'm working on a system with a stupid amount of poorly documented/undocumented moving parts. It's all tribal knowledge and blocking trying to hunt down the right person to bug.

Thaxll · on April 4, 2017

Documentation is really important especially for systems that will run for years, you will get new employee that have no ideas about how things work.

CodeWriter23 · on April 5, 2017

In my experience, trusting five year old documentation is a huge risk in and of itself. Having been in the position you put forth, you can use the documentation as a road map of the developer or architect's intent at a certain point in the past, but you have to start reading code in case it was changed and the documentation was not updated.

zwischenzug · on April 6, 2017

Which is why maintaining it is a critical investment.

CodeWriter23 · on April 7, 2017

How do you know it is up to date? The only thing you can assume is it isn't.

zwischenzug · on April 10, 2017

If the process is correct, and properly followed, then it must be up to date.

otterley · on April 5, 2017

What about making updated documentation a required gate for the release process? Wouldn't that keep it from going stale?

mfonda · on April 4, 2017

I'm a big fan of checklists as well; however, working through a checklist is still a manual process with room for human error. I've gone down the road of creating checklists, then realized that many of the items would be much better automated. For example, suppose a list item was "Ensure command X is called with arguments -a, -b, and -c, then command Y is called". This could all be wrapped into a simple script that calls these commands, eliminating the potential for human error. I've found that as I create checklists, they often turn into a list of things I really need to automate.

Cyranix · on April 4, 2017

Definitely agree — and that's not a knock against checklists.

Externalizing a process that previously existed only in someone's head is a win. Creating a checklist is a straightforward framework for such externalizing, and allows you to separate the question "how do I accomplish my goal?" from the question "how do I express this with code?" (like writing pseudocode before real code).

Whether a process is automated or manual, there's always room for error if the surrounding context changes. When an automated process is annotated _like_ a checklist, I find that I get the best of both worlds: minimal affordance for human error with a clearly described thought process to fall back on in the event of a problem.

(It's also not terribly uncommon to have steps that can't be fully automated, like authenticating to a VPN with 2FA under certain security frameworks...!)

MaulingMonkey · on April 5, 2017

> I'm a big fan of checklists as well; however, working through a checklist is still a manual process with room for human error.

I've had at least one checklist where, by following it religiously, human error was rare - and by paying proper attention, what errors did occur were noticeable and correctable.

Handed off the checklist to someone else (at management's request - it was eating up some of my time) and the human error rate went to nearly 100%. Great guy, good at his main job, just not fastidious about checklist discipline. I ended up creating an incredibly brittle, constantly breaking, broken-as-hell time sinking set of scripts to automate the process on our build servers, because that was less work. Sigh.

rb808 · on April 4, 2017

Great article. Documentation is always tough - I still haven't found the right solution. Its nice to see some SRE people that care, most 1st/2nd line support I've worked seem to just escalate anything non-trivial which is frustrating, but also useful.

If SREs are too good developers lose touch with production and get lazy.

zhengyi13 · on April 4, 2017

Reflecting on my time in frontline support, 1st level folk have limited skills, limited resources in terms of time and tooling, and in particular tend to have pretty tight metrics applied to them. They pretty much have to escalate quickly, or they'll be yelled at, or worse.

Some of them flat out don't belong in Support. But poor management and poor metrics drive behavior in unhelpful directions, too.

zwischenzug · on April 4, 2017

Yeah, we were lucky in that we only had engineers with years of experience on call. They had to have had dev experience, too.

This followed an attempt to run a more 'standard' support service (before my time) with specialist support staff. That failed badly, mostly because customers and devs hated that they added little value to the process.

It was a long time before people were allowed to be recruited direct to support again, and by then the leaders were all former devs->tech leads.

elorant · on April 4, 2017

I wish there were more articles like this for sites that are considered gray areas like gambling or porn.

kbenson · on April 4, 2017

For porn, Youporn has had some technical (or technical PR) submissions multiple times over the years.

"Youporn.com is now a 100% Redis Site" (126 comments, 5 years ago)[1]

"How YouPorn Uses Redis: SFW Edition" (95 comments, 4 years ago)[2]

"YouPorn: Symfony2, Redis, Varnish, HA Proxy... (Keynote at ConFoo 2012)" (49 comments, 5 years ago)

There's more in the HN search.

1: https://news.ycombinator.com/item?id=3597891

2: https://news.ycombinator.com/item?id=6137087

3: https://news.ycombinator.com/item?id=3750060

systems · on April 4, 2017

i am always surprised, that gambling sites and similar business with questionable morality

dont face hard time recruiting top engineers , i always though this would be a major concern

emidln · on April 4, 2017

I used to do work in this space. As a freelancer, everything comes down to timeliness and money. Adult sites, in my case, always had money, always paid on time, and always had more work. I don't particularly care what you believe in, but I do care that I get paid for my work. For what it's worth, I don't believe adult affiliate networks I worked for were any more shady than the survey/adtech companies I've worked for.

zwischenzug · on April 4, 2017

In the UK it is less of an issue, because the industry is well-regulated and gambling (esp sports betting) is seen as an acceptable pastime akin to social drinking.

Occasionally people expressed disquiet, but since the main alternative in London is working for banks, there wasn't a great deal of choice. Personally I don't see the excessive advertising we are subject to as much better for society than gambling being available, but hey.

user5994461 · on April 5, 2017

> Occasionally people expressed disquiet, but since the main alternative in London is working for banks, there wasn't a great deal of choice.

Did both. Gambling is very similar to finance.

Turns out that finance pays more and treats their employees better.

zwischenzug · on April 5, 2017

Have done both also, with the same experience... we may have worked for the same orgs :)

SilasX · on April 4, 2017

I was just in London, and it looks like they also plaster any gambling ads with "here's how you get help for a gambling addiction".

It also probably helps that the major player is called "ladbrokes"!

brokenmachine · on April 10, 2017

I saw a documentary about Pokies that suggested that this framing of gambling as being an individual's problem instead of a problem with an industry that is specifically designed to use weaknesses in human psychology to extract the maximum amount of money from them was a deliberate tactic to prevent useful legislation being passed, because it's then up to the gambler with the problem rather than the institution of gambling to control their gambling.

Never mind that those people are literally the least qualified to control their gambling addiction.

If anyone's interested, it was called "Ka-ching! Pokie Nation" http://www.abc.net.au/tv/programs/kaching-pokie-nation/

mdekkers · on April 5, 2017

...working for banks...

Arguably worse than gambling or porn

iand · on April 4, 2017

They often have the most challenging problems around scale, security and performance. The "moral" aspect is a continuum and there are tens of thousands of engineers working in fields that some would call questionable (adtech, intelligence/surveillance, copyright enforcement, credit rating, hyper capitalist companies like Uber etc.)

tomjen3 · on April 4, 2017

I am a person employed at what you would call a gambling site.

I don't have a problem selling a good that is honest and that people are free not to consume. I also wouldn't have an issue with working for a company that provided porn or drugs, when they become legal.

I would have a problem working for an isp and selling peoples data.

taude · on April 4, 2017

As opposed to company like Uber being run by a first-class suspect of a human?

fasteo · on April 4, 2017

Honest question.

Do you think that they have "questionable morality" by the business nature (gambling, porn, etc) or for how they run the business (ex. dark patterns) ?

systems · on April 4, 2017

i had business nature in mind

EnFinlay · on April 4, 2017

Many engineers are in the business of providing dopamine hits to users, most aren't in businesses considered "questionable morality".

Gracana · on April 4, 2017

What's a little gambling compared to developing surveillance technology, capturing and selling customer data, writing missile guidance software, you name it... Our industry excels at solving problems, and "there is a moral dilemma separating me from cash" is a problem we're especially good at solving.

Taylor_OD · on April 4, 2017

I did a bit of recruiting for a, not quite porn, porn site in the past. I'm in the Midwest so that probably affected it but I did run into a decent number of people who didn't want to work in the industry or at that company.

clamprecht · on April 4, 2017

If the NSA can recruit some of the best mathematicians and hackers, I don't think porn/gambling sites should have any problem recruiting.

shanemhansen · on April 4, 2017

I suspect more people on this site wanted to grow up to be code breakers and spies than porn purveyors.

jtreminio · on April 4, 2017

Gambling itself isn't immoral.

solotronics · on April 4, 2017

some people think it is

morality is subjective

akvadrako · on April 5, 2017

If morality is subjective is itself up for debate. I personally think morality is absolute. There is only one system of morality which isn't arbitrary and allowing morals to be arbitrary strips the word of any meaning.

brokenmachine · on April 10, 2017

Do a google search for "moral dilemma" and you will find many examples where it is very difficult to come up with simplistic black and white opinions.

Such dilemmas leave me with the idea that it is very naive to think there is some kind of absolute morality.

akvadrako · on April 10, 2017

That's because you are using an inconsistent set of morals. It's like saying "there is a largest number" and "for every x, there exists the larger number x + 1".

So most things people consider morals cannot all be simultaneously morals. Without self-consistanty you will end up with paradoxes like you mention.

brokenmachine · on April 10, 2017

OK, this is sure to be a waste of time, but I'll bite...

http://psychopixi.com/uncategorized/25-moral-dilemmas/

Let's just start at the top. Please give your absolute morality solution that has "self-consistanty" (sic) to "The trapped mining crew".

> Heather is part of a four-person mining expedition. There is a cave-in and the four of them are trapped in the mine. A rock has crushed the legs of one of her crew members and he will die without medical attention. She’s established radio contact with the rescue team and learned it will be 36 hours before the first drill can reach the space she is trapped in.

She is able to calculate that this space has just enough oxygen for three people to survive for 36 hours, but definitely not enough for four people. The only way to save the other crew members is to refuse medical aid to the injured crew member so that there will be just enough oxygen for the rest of the crew to survive.

Should Heather allow the injured crew member to die in order to save the lives of the remaining crew members?

akvadrako · on April 11, 2017

I'm not going to explain my whole moral system here and derive all the consequences out of it relevant to this scenario.

Maybe if you point out what aspect of my claim you take issue with, I could answer it briefly.

brokenmachine · on April 12, 2017

Well I've already said that I think it's naive to think there could be an absolute morality, and I've also given you a specific example of a situation where I suspect your moral system probably falls down.

You've just declined to offer any explanation of what your system would propose in that situation.

So I guess the aspect of your claim that I doubt is the mere existence of your self-consistent and absolute moral system at all?

akvadrako · on April 12, 2017

Okay, there are 2 parts:

1. The consistency: it's just one rule, so it should be able to drive decisions without much ambiguity.

2. The "absoluteness": if it's unique as the only workable self-consistent moral system, that should suffice. Perhaps there are others, though I suspect they have one of these features:

a. they drive equivalent decisions FAPP

b. they are unstable and eventually won't be able to continue; either the hosts die or move onto another system

I admit I haven't proven this to be the case, but I suspect it's true, or at least something along these lines. Certainly the opposite hasn't been proven.

brokenmachine · on April 18, 2017

>1. The consistency: it's just one rule

...Which is?

jldugger · on April 4, 2017

> The team I joined was around 5 engineers (all former developers and technical leaders), which grew to around 50 of more mixed experience across multiple locations by the time I left.

Unfortunately, it's difficult for the audience to determine whether the author's success is attributable to the philosophy in the post, or the 10x growth in staff.

zwischenzug · on April 4, 2017

The quantity of issues dealt with went up more than 10x over that time, and the experience of the people dealing with them dropped.

Whether that was due to me, the things we did, or other reasons is of course open to debate. This is just my view from where I sat.

stephengillie · on April 4, 2017

The author attributes both his success, and the team's growth, to his documentation philosophy. It does make some sense - improving process documentation for the most common issues will help ensure new team members resolve these issues consistently and effectively, both increasing their utility to the team and their personal morale. How much is directly attributable to him is open to debate.

zwischenzug · on April 4, 2017

The team's growth was partly because of its perceived success at managing operations, but also because of large growth in customer base and application sprawl. The business model we had depended on fast releases and customers who refused to pay for much-needed testing or QA, but were happy to pay for the support we provided. Not saying that's a good thing, but it was the plate we were served.