Handling Human Error in the Datacenter

lsc · on Aug 11, 2008

a problem everywhere I've worked, especially in places where the rack is 'organically grown' is power cables getting accidentally pulled when other servers are added or removed.

on all servers that I have physical access to, I use zipties on both ends of the power cable. you need a knife to unplug anything. One problem, at least, solved.

Generally, I categorize mistakes as 'mistakes of knowledge' (that is, I did the wrong thing because I believed something that was incorrect.) and 'mistakes of inattention' (where I knew it was the wrong thing to do, but I wasn't paying attention and did it anyhow.)

Generally, you don't make the same mistake of knowledge twice, so I don't worry about them much. They happen, but they only happen once. Learning, we call it.

Mistakes of inattention are much worse, in my opinion. without further action, I will almost certainly repeat a mistake of inattention.

The idea is that every time you make a 'mistake of inattention' you put in place a procedure that will prevent the mistake.

ajross · on Aug 11, 2008

That's hard, though, because a naive application of that philosophy results in a smothering bureaucracy where even simple tasks become hideously expensive. The trick is architecting the process to avoid the possibility of mistakes in the first place (or better: architecting the system being implemented such that it's tolerant of mistakes being made).

What you describe is the operations equivalent of a software guy hacking a one line patch around a bug. Too many of those without any cleanup/refactoring and you start introducing new bugs while fixing old ones.

lsc · on Aug 12, 2008

changing the process is exactly what I am talking about here... not so much changing, though, as creating. Usually, when you tell a SysAdmin to do something (install a server, remove a server, etc...) she just does it. I mean, most of this is pretty simple stuff. my suggestion is that when you see a mistake of inattention, you add process so the problem does not occur a second time. set up a checklist for that process, or as in the power cord example, change your process for installing new servers.

the alternate is to just fire people when they make dumb mistakes... but for me, that is unjustifiably expensive, especially when compared with the cost of adding process.

  Sure, if you control the app you can set things up such that losing one physical server isn't a problem, but that is usually a lot more expensive than a few zipties.  One thing I've learned in my years as a computer janitor is that a fix I can put in place now with materials I have on hand gets done.  A fix I suggest to the devs, even if they agree it is a good idea, rarely gets done.

Process, when it includes humans, is bureaucracy. And yeah, you need to make sure it doesn't get ridiculous, however, having a checklist for, say, installing a server, or replacing a bad disk is still, I believe, in the realm of 'good bureaucracy.' When you get paged at 4am after a late party, you can't count on being sharp and remembering everything. You have to accept that especially for emergency maintenance, sometimes your admins are not at 100%.

you also need to accept that things you don't expect will happen. You can't always rely on process. I'm just saying that you can use process to avoid making the same mistake twice.

seiji · on Aug 12, 2008

Never use zip ties in a data center. Use velcro strips instead.

I don't want people trying to pop zip ties with knives near my cables carrying production traffic.

lsc · on Aug 12, 2008

the difference between zipties and velcro is that velcro is best for organizational binding... zipties are structural, and should be largely considered permanent. You shouldn't cut a ziptie holding a cable that would take down production any more than you should move a rack while production servers in it are still running. If it is the sort of thing you might want to move while the servers are running (like, say, ethernet, or a bundle of cables that goes to more than one server) like you said you should use velcro or something else temporary.

lsc · on Aug 12, 2008

a velcro strip isn't going to keep a power plug from getting jerked out of a server by a careless admin.

sharjeel · on Aug 12, 2008

Also, if your scripts have any dev mode features for testing (such as cleaning up some database values and regenerating, removing some files etc), make sure that you are unable to execute them on production or some sort of confirmation is required.

I had a script on my server that did clustering of stories from different news sources. The script also had some test methods which deleted all the clustered data and rebuilt it. I once accidentally ran the "cleanup method" on prod server and that created disaster because somehow cascaded deletion took place. I had to refer to replay log to get everything back and took hours of efforts plus a lot of pressure. From then onwards I placed a check on each of my script to get a confirmation twice before executing any such test method on prod server.

sysop073 · on Aug 11, 2008

For his last suggestion about coloring the terminal background, it might be easier to just color the name of the machine in the prompt

e.g.: http://i34.tinypic.com/5ecthx.jpg

slackerIII · on Aug 11, 2008

Ah, changing the color of the prompt is a much better idea. Thanks.

a-priori · on Aug 11, 2008

You should also look into software such as Puppet to reduce the amount of manual administration you have to do.