Hacker News new | past | comments | ask | show | jobs | submit login
Incident management at Google – adventures in SRE-land (googleblog.com)
250 points by kungfudoi on Feb 27, 2017 | hide | past | favorite | 43 comments



I find this bit to be particularly insightful:

"Can I handle this? What if I can’t?" But then I started to work the problem in front of me, like I was trained to, and I remembered that I don’t need to know everything — there are other people I can call on, and they will answer. I may be on point, but I’m not alone

It might be because I'm currently training people in this realm, and this is one of their biggest fears, or maybe because it was my biggest fear, but its so true. We're a team. We're here to help. At least if your SRE org is any good. Never be afraid to ask for help, and never be afraid to admit you don't know something or it might be outside of your comfort zone.

I'll take willing to learn and readily able to admit knowledge deficits over someone who doesn't any day of the week. Great book they're working on, great article on this. So many gems, but this one stuck out for me, and its pretty relevant to me right now.


I notice this alot with the team I work with. Our team's tickets are not assigned automatically: there's simply a queue and people are encouraged to grab what they can. Unfortunately, what I've often seen is that people see something in the description that they're not familiar with and refuse to touch the ticket because they don't want to ask for help, which means that it languishes in the queues. The end result is that the same person ends up taking the same kind of ticket over and over because they're the only one who has any familiarity with the program in question.


I've seen this be the case in other places. We used to have a queue based system. Then we went to an auto-assignment process. One of things it has done is open up communication on our team, since when we did it, we implemented as a rule of policy that if you aren't familiar with something getting assigned to you, you first ask for help before trading (we have a formal mechanism for soliciting this).

1. It encourages everyone to be aware that we're here to help

2. It encourages learning, the number of request for help decrease over time once the experience and familiarity ramp up

3. It exposes everyone to different types of work.

There are exceptions of course. Our P0 bucket has a dedicated set of people that handle that, and they are hand picked because those are house on fire situations that need the experience. Its also the one we put the juniors on when they know the ropes and are ready to take on the critical tasks so they can advance themselves (its good experience for career and personal growth I feel).

The other thing i like about this is that as a manager, I can actively encourage behavior by claiming a ticket, or helping someone with a ticket. I really want that culture effect to happen from the top down.


I've had similar experiences in the past. In my situation it was because the Seniors were.... un-helpful to the Juniors, to put it nicely. This is where having a solid group of helpful Seniors can help everyone work better.


It's not mentioned in the article, but there is an underlying point that affects hiring for roles like this: you need people who can and will admit they don't know everything and will ask for help rather than wing it.

"Rock stars" are downright dangerous, as are people who prefer to make things up rather than admit ignorance.

A new SRE doesn't need to know everything (and can't). But he absolutely needs to be curious and willing to ask for help.


It's an interesting dilemma in my mind. I remember reading through the SRE book about how SREs a required to have both depth and breadth. Seems like a nearly impossible target to hit IMO, so how are you supposed to simultaneously reconcile deep/broad abilities with humility when hiring for SREs?


Our litmus test is this, I don't know if it gets to the heart of the question but I feel it does.

We look at essentially, (but not unequivocally) these things

1. Proven experience and desire to learn is a must. One of the best i have ever worked with came from a place where they worked pilot projects, and had to manage all their infrastructure themselves as the developers. No certs, no formal SRE experience (IE, thats not why he was employed previously). One of the best. He loved the work, and I could tell he learned so much doing this. It doesn't have to be this extreme, but having a proven interest in this line of work is top priority.

2. Is their depth better than their breadth? It is correct that you need a LOT of breadth, however I value the depth first. I'd rather someone have say, a medium about of breadth on the different technologies out there and a lot of depth on core subjects, like container management (this happens...everywhere nowadays) or cluster management. I don't need you know every single implementation of this in depth though. I need you to at least know one implementation of this in depth. I can build on that.

3. Because of the first 2, I need someone who is team oriented, as always.


(I'm an SRE at Google. My opinions are my own.)

> core subjects, like container management (this happens...everywhere nowadays) or cluster management

Curiously, these are subjects which most Google SREs won't know much about. One team deals with all that stuff as a service so the rest of us can get on with something else.

What would I pick out as our core skill sets? Ignoring technology-specific details that won't apply anywhere else: troubleshooting a system that you don't understand (reverse-engineering it as you go), and non-abstract large system design.


Mostly it meant SREs tended to be older than other engineers at Google (I was an SRE there for a while), I think by an average of nearly a decade.

Broad experience, depth on a few topics. It's not impossible at all, it just takes time.

(edit: Note this may have changed since I left the company in 2009. It's been a while!)


Depth + breath is another way of saying experience. Program long enough and you'll learn all sorts of little nasty things about garbage collection and permissions and tcp packet headers and faulty JSON parsers. All of it crystalizes into those little moments when you think "I've seen this shit before".


I always reconciled this as a T shaped graph instead of a box. It's important to have depth in the service area you can control and a breadth of understanding across your dependencies since Google builds service oriented architectures.


You don't, you just wait and struggle to find the 0.00x % of people who fits the role.

There is a reason that this is an impossible to fill role.


The hard part is to know what you know and what you dont know. It has nothing to do with "inside a team and feeling safe". If you do not know what you know and what you dont know, then you will make the most naive mistakes eventually.

I myself have deleted the data file of a production mysql server, because I have no idea what I am doing. I need to call teammates in 12am to learn how to take care of the mess I created.


I agree, it can be complicated. Thats why i like the auto-assign ticket system (and not putting absolutely new people on P0 duty until they have warm feet). It allows my team, at least, to experience breadth of issues and if one because of time, or because of efficiency it can be traded in relative real time.

However, having it be policy to ask first then trade or have someone else own it means essentially as a team, we can own that issue with you as you learn. It helps a lot come crunch time.

Mistakes will always happen though. If they are genuine and not from a lack of caring or trying with an earnest/logical thought process behind it, it doesn't matter in the grand scheme. I'd rather have my entire production server go down on an error like that than deal with a situation where the person was just being negligent to their duties and wasn't willing to be humble.

In short, you'll be fine :)


I agree about the willing to admit you don't know something. I've found it's something a lot of people have a hard time doing.

I wrote down some thoughts about this recently: http://zalberico.com/essay/2017/02/21/asking-questions.html


Goes for any job really; know what you do/don't know, and know what you should/shouldn't know.

For some things, you should be able to trust others.

It's funny, because it boils down to "don't lie" (to yourself nor to others).


And this, is why the current US president scares the shite out of me. There are a bunch of jobs with a breadth and depth requirement where this is the reality. The first people liars lie to are themselves. (and I'm not being glib- I've done a heap of fraud cases) How does that person ask for help? Not knowing their limits of competence.


>there are other people I can call on, and they will answer. I may be on point, but I’m not alone

that sounds really nice


The coolest thing I took away from the SRE book was this progression of system operations from manual, to scriptable, to automated, to a fourth category I hadn't even known existed: autonomous. The idea that you can keep moving up this hierarchy of exception management beyond even chef and puppet, and systems will be able to heal themselves, is a pretty cool one.

As a manager, this made the concept of 20% time a lot more clear. These are people with the knowledge and incentive to build a hierarchy of systems that progressively remove risk from their work. This is in fact their primary business objective. And we need to make sure they have time to do that, vs working them to death with manual remediation. It's a great lesson.

Incidentally, Stackdriver contains a simple alerting and incident management tool that's really nice to use. Hopefully it gets more robust as time goes on and larger and more complex orgs move to their cloud. Edit: not Outalator.


If anything it proves how much mismanagement / wasted potential software organizations have had in the last 20 years. Full automation should be the natural progression of our trade but I fear most companies stall after 5ish years due to turnover, brain-drain, re-organizations, acquisitions, management incompetency etc. Google on the other hand has always had a seemingly never ending pool of resources and talent to keep pushing the barrier further. Fortunately they give a lot back to the community in the form of books, talks, and projects such as Kubernetes (a poor man's Borg). However I fear that with all things commercial it will lead to an oligarchy where companies like Google, Facebook, and Uber, are just that far ahead of the curve nobody else will ever catch up.


(I'm a Google SRE. My opinions are my own.)

That's not what our 20% time is for, and 20% is way too small a number for that purpose. "20% time" (the way we use the term) is for personal/career growth/scratching itches.

Time spent on building systems that make our service better is my primary job. Manual remediation ("toil") is something to be tracked as a dangerous antipattern that must not be allowed to take over.

Toil and oncall response should be less than 20% of my time, together. At least half my time should go into engineering projects. If the level of toil is in excess of 50% of team activity then I would expect only percussive intervention to get the team out of this situation.


Great comment, thanks for the clarification. Wasn't trying to say that 20% is a magic number, just that it cemented the idea for me that engineering time, and self-directed engineering time, is incredibly valuable for everyone that can be justified and should be zealously protected.


How do you track how much time you spend on manual vs projects? Is there an in house tracking tool?


I'm a google employee, opinions are my own.

The incident management tool is not Outalator. Outalator is the pager queue management tool. The incident management tool is for manually creating incidents that have much broader visibility than Outlator does.

As someone who has been incident commander a few times, incidents tend to have broader impact beyond your immediate team or owned jobs.


Thanks, my bad. Would you say the stackdriver incidents tool was influenced by SRE experiences?


The Stackdriver incidents tool existed before it Stackdriver was bought by Google. They do have similarities though.


  progression of system operations from manual, to scriptable, to automated, to a fourth category I hadn't even known existed: autonomous.
Completely agree. The first "eureka" moment is when you define all your infrastructure in code in something like Terraform. Magically networking, firewall rules, disks, instances, are all provisioned and dependencies calculated. It is quite a breakthrough from running CLI commands or using the web interface to allocate infrastructure.

Plug: I wrote a blog post on getting started with Terraform and Google Compute Engine for those interetested https://blog.elasticbyte.net/getting-started-with-terraform-...


>> from manual, to scriptable, to automated, to a fourth category I hadn't even known existed: autonomous.

What's the difference between automated and autonomous?



Autonomous does not require any human interaction, not even to trigger a script. In contrast, "automated" systems might still require a human to start the process or have a very simplistic trigger, such as based on the time of day.


FYI - it's linked to in the post, but in case it's not obvious, they have posted the SRE book for free at https://landing.google.com/sre/book.html

I'd highly recommend it if you're in the Ops feild. Probably the best book out there on current large scale Ops practices.


It's a great book, very well written and fair. But the first time I read it I suffered a certain amount of zealotry: "Google is amazing! I should rewrite everything to be more like them!". Really a subcaption everyone should keep in mind is that the book defines how Google built systems for Google. YMMV.


Currently reading the book, and could see the zealotry come out, but they mention, repeatedly, that your own systems may not need the level of service that Google built-in, or your teams may not be big enough to justify, etc. I don't think it is quite fair to indicate that they didn't give that thought. I totally agree that it is a great book and should be encouraged throughout ops orgs, devops roles, and developers who want to plan for the future, but that not every company needs the Google way to 100%, or maybe even 50% (as you indicated as well).


Also keep in mind that even the mighty Google has some rusty and pointy internal tools. There's no reason to mention any crappy things in a book about good practices ;).


Maybe it's just me but I found the constant in-line plugs for the book to be distracting -- footnotes would have been better.

Interesting write-up though


I had a similar reaction. I was a bit irked by it b/c it felt very pushy towards the SRE book and broke me out of the flow of the article a few times. 10/10 on the book though, would recommend anyone to read it.


Is it just me or are we seeing a trend almost before the "new" role SRE has become mainstream that SRE is turning into support technicians because that is what is needed at most places that are not Google scale. The devaluation of the sysadmin took some time, this is happening much faster. What will be the next title when SRE can't get you a decent salary anymore? And why don't we see the same with the SWE role? Is it just that business leaders sees Ops as cost no matter what name it has?


Anyone able to compare and contrast Google's "Wheel of Misfortune" with Netflix's "Chaos Monkey" both in terms of the systems that enable them and the operations that relate to them?


They're unrelated. Wheel of Misfortune is just a role-playing replay of a previous incident as a training exercise. Someone will grab (or simulate) logs and dashboards from the incident and then play GM for the wheel of misfortune at a future team meeting. Someone who isn't familiar with the incident will be designated "on-call". They'll state what they want to do and the GM will tell them or show them what they see when they do those things.

Chaos Monkey is actually taking down production systems to make sure the system as a whole stays up when those individual pieces fall. Google does have (manual, not automatic) exercises doing similar things called DiRT (Disaster Recovery Testing), but it's not related to the SRE training exercise.

(standard disclaimer: Google employee, not speaking for company, all opinions are my own, etc.)


(I'm an SRE at Google. My opinions are my own.)

WoMs are a training exercise, intended to build familiarity with systems and how to respond when oncall. A typical WoM format is a few SREs sat in a room, with a designated victim who is pretending to be oncall. The person running the WoM will open with an exchange a bit like this (massively simpified):

"You receive a page with this alert in it showing suddenly elevated rpc errors (link/paste)" "I'm going to look at this console to see if there was just a rollout" "Okay, you see a rollout happened about two minutes before the spike in rpc errors" "I'll roll that back in one location" "rpc errors go back to normal in that location" ...etc

(Depending on the team and quality of simulation available, some of this may be replaced with actual historical monitoring data or simulated broken systems)

The "chaos monkey" tool, as I understand it, is intended to maintain a minimum level of failure in order to make sure that failure cases are exercised. I've never been on a team which needed one of those: at sufficient scale and development velocity, the baseline rate of naturally occurring failures is already high enough. We do have some tools like that, but they're more commonly used by the dev teams during testing (where the test environment won't be big enough to naturally experience all the failures that happen in production).


Chaos Monkey is more like Google's DiRT

http://queue.acm.org/detail.cfm?id=2371516


and Dust


Not a thing anymore for several years, i.e name consolidation. Also, a good chunk of DiRT is now continuous and automated (not autonomous though).

Disclaimer: I work at Google and ran the DiRT team for a few years incl. incident management itself.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: