https://twitter.com/DEVOPS_BORAT/status/41587168870797312 "To make error is huma...

IgorPartola · on May 16, 2014

@DEVOPS_BORAT is actually very insightful in about 1/5 tweets. Snide for sure, but there are quite a few good points in there if you read carefully:

"In devops we have best minds of generation are deal with flaky VPN client."

"Single point of failure in private cloud is of usually Unix guy with neckbeard."

These are gold.

Edit: based on the above advice I once grew out a neckbeard while going through a multi-month rollout of a large product. It itched like crazy, but I did work much faster to get rid of it.

joezydeco · on May 17, 2014

"In devops is turtle all way down but at bottom is perl script."

ejain · on May 17, 2014

I thought Perl is what is holding the turtles together?

mst · on May 17, 2014

It's turtles.pl all the way down.

tragic · on May 17, 2014

For sibling post:

"Turtles all the way down" is a "a jocular expression of the infinite regress problem in cosmology posed by the "unmoved mover" paradox."[1]

http://en.wikipedia.org/wiki/Turtles_all_the_way_down

crashandburn4 · on May 17, 2014

So, I feel like I might be being stupid and not getting something, but what is turtle? I can't find a programming language that seems to be related to it.

count · on May 17, 2014

http://en.wikipedia.org/wiki/Turtles_all_the_way_down

jpwgarrison · on May 17, 2014

Count is correct, but I did have a flash to my childhood: http://en.wikipedia.org/wiki/Logo_(programming_language)

crashandburn4 · on May 18, 2014

That was my first thought too but I figured they couldn't be referring to that. :)

Havoc · on May 17, 2014

>best minds of generation are deal with flaky VPN client

So true. I'm on the receiving side of this..."No you can't work on that multi million deadline project of yours...the only way to fix the VPN is to re-image the machine back at head office [an international flight away]". Me..."Could you repeat that?" And thats a Cisco Enterprise VPN...(turns out IT was right...re-image & avoid conflicting software is the only solution). So much for Cisco...

philtar · on May 17, 2014

Are you calling yourself one of the best minds of our generation?

Havoc · on May 17, 2014

Hardly. No I meant on the receiving side of techs trying to fix VPNs.

c0nsumer · on May 17, 2014

Professionally I deal with much of the fallout from problems such as yours, and leading techs doing this kind of work. It really sucks, but for many problems like this the choice becomes spend-four-hours-reimaging-the-machine or spend-unknown-period-of-time-trying-to-fix-new-problem. The latter would be great if it was less than four hours, but it's often not, and until that time you / the user are without a machine.

After an hour or so of troubleshooting it's usually better to go with the reimaging, since all you / the user wants is to get back to working.

Ideally I try to get the entire broken machine captured and the user issued a new, fixed machine because then a fix can be developed and documented, but for those who end up in a new failure mode, it sucks. And with something like the Cisco VPN Agent? That's not uncommon at all...

Havoc · on May 17, 2014

>spend-four-hours-reimaging-the-machine or spend-unknown-period-of-time-trying-to-fix-new-problem

Definitely. In our case its 8 hours minimum though for a re-image. Somehow the FDE makes pulling the old data off the machine slow.

You've got my sympathies though - I'd not like to be the one doing the IT in these cases. Can't be fun troubleshooting IT with that kind of time pressure.

c0nsumer · on May 18, 2014

Thank you. It really, honestly is hard on our tech because they feel the pressure from all sides. Eight hours sounds rough for a reimage. I think ours are... maybe two or three? We've done a lot of work to get the reimage time down, and Win7 (WIMs) have made this really nice.

If this is something that smells of a bigger problem (or has been seen elsewhere) then I push for them to get the user a wholly new machine, capturing the old one for analysis. If the user is given an upgraded machine, then there is usually little resistance, even with the downtime that'll be incurred.

On the upside, if the issue can be reproduced readily, from this we can almost always get root cause and put a systemic fix in place. If it's sporadic... Well... I'm sure you understand how it goes trying to fix something that you can't yet reproduce. ;)

(I'd love to troubleshoot your slow data backup issue... That's the stuff I rather enjoy.)

Havoc · on May 18, 2014

>I'd love to troubleshoot your slow data backup issue... That's the stuff I rather enjoy.

I'm not directly involved with the tech side so I don't know the details. I gather they pull the old data off the disk using some offline low-level tool though (like you would for harddrive damage recovery). Between that and the encryption its somehow very slow. No idea why its like that though.

>get the user a wholly new machine

I wish it was the same here. They just give loan machines :/

Anderkent · on May 17, 2014

Is it that surprising? I'd bet the average devops is in the top quartile, if not the top 10%.

IgorPartola · on May 17, 2014

Is that a statistics joke?

Anderkent · on May 19, 2014

I guess it depends on where your line for 'best minds of the generation' lies. If it's the top 25%, I wouldn't be surprised that many software devs / devops people lie in that category.

Nexxxeh · on May 17, 2014

At the risk of asking a potentially dumb question, why can they only reimage the machine at head office?

Havoc · on May 17, 2014

Not dumb at all. This is a professional service firm, so there is no real head office per se, but rather your "home office" - I just simplified it a bit for hn purposes.

Couple of reasons. Each country rolls their own custom image. Plus I need an office that has the encryption keys for the full disk encryption. Plus only 3 offices globally carry copies of my data (used when they can't pull the data off the hdd).

If I'm flying anyway I might as well go to home office - I know they have all the required stuff for my laptop.

tetha · on May 17, 2014

Same for TheCodelessCode. A lot of these are cryptic and weird, but some are pure gold. Especially since no one understands the koans until they fall flat on their face just like the student does and a huge floodlight turns on.

appplemac · on May 17, 2014

In startup we gamify operations: 3 infrastructure failures, and ops team are out.

brokenparser · on May 17, 2014

> These are gold.

But not English, I can't make sense of them.

toomuchtodo · on May 16, 2014

Devops Borat: This not mistake. This how we test recovery operations in production.

ibisum · on May 17, 2014

This one goes on the first page of my playbook as of today.

shill · on May 16, 2014

Worst mistake is always happen in batch. --Devops Borat

https://twitter.com/DEVOPS_BORAT/status/251885078794366976

wiredfool · on May 16, 2014

The most frightening movie ever for sysadmins -- The Sorcerer's Apprentice.

jerf · on May 16, 2014

My favorite personal variant: "To err is human. To screw up a million times per second, you need a computer."

antmldr · on May 17, 2014

>Frankly, I'm surprised things like this don't happen more often.

They do. This happened to the largest bank in Australia mid 2012[1]. Very similar circumstances. I've been told that SCCM's UI doesn't help here- something about the default action when nothing is selected to apply it to all devices managed by SCCM. Someone more familiar with SCCM may want to correct me here.

[1] http://delimiter.com.au/2012/07/30/disastrous-patch-cripples...

mey · on May 16, 2014

I wish that account was still updating.

misnome · on May 17, 2014

I found an alternative in https://twitter.com/BigDataBorat, not quite as insightful but still occasionally funny

reitanqild · on May 17, 2014

For anyone who hasn't read it yet the last tweet on that account is gold.

ar7hur · on May 16, 2014

"In startup we are practice Outage Driven Infrastructure."

wiml · on May 16, 2014

I'm not sure that's so unlike Netflix's Chaos Monkey.

timClicks · on May 16, 2014

Propogation of a software mistake is what appears to have caused the Gmail outage of 2011 http://youtu.be/eNliOm9NtCM?t=28m49s

laumars · on May 16, 2014

I think it does happen often but isn't as well reported. I certainly know of more than one place that's suffered from this kind of accident (thankfully not places I personally work so I've not had to deal with the fallout. These are places I have friends or family who work there)