"To make error is human. To propagate error to all server in automatic way is #devops."
Frankly, I'm surprised things like this don't happen more often. Kudos for the incident management. Also a big plus for having working backups, it seems.
@DEVOPS_BORAT is actually very insightful in about 1/5 tweets. Snide for sure, but there are quite a few good points in there if you read carefully:
"In devops we have best minds of generation are deal with flaky VPN client."
"Single point of failure in private cloud is of usually Unix guy with neckbeard."
These are gold.
Edit: based on the above advice I once grew out a neckbeard while going through a multi-month rollout of a large product. It itched like crazy, but I did work much faster to get rid of it.
So, I feel like I might be being stupid and not getting something, but what is turtle? I can't find a programming language that seems to be related to it.
>best minds of generation are deal with flaky VPN client
So true. I'm on the receiving side of this..."No you can't work on that multi million deadline project of yours...the only way to fix the VPN is to re-image the machine back at head office [an international flight away]". Me..."Could you repeat that?" And thats a Cisco Enterprise VPN...(turns out IT was right...re-image & avoid conflicting software is the only solution). So much for Cisco...
Professionally I deal with much of the fallout from problems such as yours, and leading techs doing this kind of work. It really sucks, but for many problems like this the choice becomes spend-four-hours-reimaging-the-machine or spend-unknown-period-of-time-trying-to-fix-new-problem. The latter would be great if it was less than four hours, but it's often not, and until that time you / the user are without a machine.
After an hour or so of troubleshooting it's usually better to go with the reimaging, since all you / the user wants is to get back to working.
Ideally I try to get the entire broken machine captured and the user issued a new, fixed machine because then a fix can be developed and documented, but for those who end up in a new failure mode, it sucks. And with something like the Cisco VPN Agent? That's not uncommon at all...
>spend-four-hours-reimaging-the-machine or spend-unknown-period-of-time-trying-to-fix-new-problem
Definitely. In our case its 8 hours minimum though for a re-image. Somehow the FDE makes pulling the old data off the machine slow.
You've got my sympathies though - I'd not like to be the one doing the IT in these cases. Can't be fun troubleshooting IT with that kind of time pressure.
Thank you. It really, honestly is hard on our tech because they feel the pressure from all sides. Eight hours sounds rough for a reimage. I think ours are... maybe two or three? We've done a lot of work to get the reimage time down, and Win7 (WIMs) have made this really nice.
If this is something that smells of a bigger problem (or has been seen elsewhere) then I push for them to get the user a wholly new machine, capturing the old one for analysis. If the user is given an upgraded machine, then there is usually little resistance, even with the downtime that'll be incurred.
On the upside, if the issue can be reproduced readily, from this we can almost always get root cause and put a systemic fix in place. If it's sporadic... Well... I'm sure you understand how it goes trying to fix something that you can't yet reproduce. ;)
(I'd love to troubleshoot your slow data backup issue... That's the stuff I rather enjoy.)
>I'd love to troubleshoot your slow data backup issue... That's the stuff I rather enjoy.
I'm not directly involved with the tech side so I don't know the details. I gather they pull the old data off the disk using some offline low-level tool though (like you would for harddrive damage recovery). Between that and the encryption its somehow very slow. No idea why its like that though.
>get the user a wholly new machine
I wish it was the same here. They just give loan machines :/
I guess it depends on where your line for 'best minds of the generation' lies. If it's the top 25%, I wouldn't be surprised that many software devs / devops people lie in that category.
Not dumb at all. This is a professional service firm, so there is no real head office per se, but rather your "home office" - I just simplified it a bit for hn purposes.
Couple of reasons. Each country rolls their own custom image. Plus I need an office that has the encryption keys for the full disk encryption. Plus only 3 offices globally carry copies of my data (used when they can't pull the data off the hdd).
If I'm flying anyway I might as well go to home office - I know they have all the required stuff for my laptop.
Same for TheCodelessCode. A lot of these are cryptic and weird, but some are pure gold. Especially since no one understands the koans until they fall flat on their face just like the student does and a huge floodlight turns on.
>Frankly, I'm surprised things like this don't happen more often.
They do. This happened to the largest bank in Australia mid 2012[1]. Very similar circumstances. I've been told that SCCM's UI doesn't help here- something about the default action when nothing is selected to apply it to all devices managed by SCCM. Someone more familiar with SCCM may want to correct me here.
I think it does happen often but isn't as well reported. I certainly know of more than one place that's suffered from this kind of accident (thankfully not places I personally work so I've not had to deal with the fallout. These are places I have friends or family who work there)
"To make error is human. To propagate error to all server in automatic way is #devops."
Frankly, I'm surprised things like this don't happen more often. Kudos for the incident management. Also a big plus for having working backups, it seems.