"To make error is human. To propagate error to all server in automatic way is #devops."
Frankly, I'm surprised things like this don't happen more often. Kudos for the incident management. Also a big plus for having working backups, it seems.
@DEVOPS_BORAT is actually very insightful in about 1/5 tweets. Snide for sure, but there are quite a few good points in there if you read carefully:
"In devops we have best minds of generation are deal with flaky VPN client."
"Single point of failure in private cloud is of usually Unix guy with neckbeard."
These are gold.
Edit: based on the above advice I once grew out a neckbeard while going through a multi-month rollout of a large product. It itched like crazy, but I did work much faster to get rid of it.
So, I feel like I might be being stupid and not getting something, but what is turtle? I can't find a programming language that seems to be related to it.
>best minds of generation are deal with flaky VPN client
So true. I'm on the receiving side of this..."No you can't work on that multi million deadline project of yours...the only way to fix the VPN is to re-image the machine back at head office [an international flight away]". Me..."Could you repeat that?" And thats a Cisco Enterprise VPN...(turns out IT was right...re-image & avoid conflicting software is the only solution). So much for Cisco...
Professionally I deal with much of the fallout from problems such as yours, and leading techs doing this kind of work. It really sucks, but for many problems like this the choice becomes spend-four-hours-reimaging-the-machine or spend-unknown-period-of-time-trying-to-fix-new-problem. The latter would be great if it was less than four hours, but it's often not, and until that time you / the user are without a machine.
After an hour or so of troubleshooting it's usually better to go with the reimaging, since all you / the user wants is to get back to working.
Ideally I try to get the entire broken machine captured and the user issued a new, fixed machine because then a fix can be developed and documented, but for those who end up in a new failure mode, it sucks. And with something like the Cisco VPN Agent? That's not uncommon at all...
>spend-four-hours-reimaging-the-machine or spend-unknown-period-of-time-trying-to-fix-new-problem
Definitely. In our case its 8 hours minimum though for a re-image. Somehow the FDE makes pulling the old data off the machine slow.
You've got my sympathies though - I'd not like to be the one doing the IT in these cases. Can't be fun troubleshooting IT with that kind of time pressure.
Thank you. It really, honestly is hard on our tech because they feel the pressure from all sides. Eight hours sounds rough for a reimage. I think ours are... maybe two or three? We've done a lot of work to get the reimage time down, and Win7 (WIMs) have made this really nice.
If this is something that smells of a bigger problem (or has been seen elsewhere) then I push for them to get the user a wholly new machine, capturing the old one for analysis. If the user is given an upgraded machine, then there is usually little resistance, even with the downtime that'll be incurred.
On the upside, if the issue can be reproduced readily, from this we can almost always get root cause and put a systemic fix in place. If it's sporadic... Well... I'm sure you understand how it goes trying to fix something that you can't yet reproduce. ;)
(I'd love to troubleshoot your slow data backup issue... That's the stuff I rather enjoy.)
>I'd love to troubleshoot your slow data backup issue... That's the stuff I rather enjoy.
I'm not directly involved with the tech side so I don't know the details. I gather they pull the old data off the disk using some offline low-level tool though (like you would for harddrive damage recovery). Between that and the encryption its somehow very slow. No idea why its like that though.
>get the user a wholly new machine
I wish it was the same here. They just give loan machines :/
I guess it depends on where your line for 'best minds of the generation' lies. If it's the top 25%, I wouldn't be surprised that many software devs / devops people lie in that category.
Not dumb at all. This is a professional service firm, so there is no real head office per se, but rather your "home office" - I just simplified it a bit for hn purposes.
Couple of reasons. Each country rolls their own custom image. Plus I need an office that has the encryption keys for the full disk encryption. Plus only 3 offices globally carry copies of my data (used when they can't pull the data off the hdd).
If I'm flying anyway I might as well go to home office - I know they have all the required stuff for my laptop.
Same for TheCodelessCode. A lot of these are cryptic and weird, but some are pure gold. Especially since no one understands the koans until they fall flat on their face just like the student does and a huge floodlight turns on.
>Frankly, I'm surprised things like this don't happen more often.
They do. This happened to the largest bank in Australia mid 2012[1]. Very similar circumstances. I've been told that SCCM's UI doesn't help here- something about the default action when nothing is selected to apply it to all devices managed by SCCM. Someone more familiar with SCCM may want to correct me here.
I think it does happen often but isn't as well reported. I certainly know of more than one place that's suffered from this kind of accident (thankfully not places I personally work so I've not had to deal with the fallout. These are places I have friends or family who work there)
Snark and sarcasm aside, I am impressed with the level of detail that the IT department is sharing; it is refreshing to see such a disaster being discussed so openly and honestly, while at the same time treating customers like adults.
At one place I worked, in the days of XP, the 'index server'(?) had a problem and un-installed all the application programs. The basic OS was there but MS Office, MSIE, all the doodads just got removed as each machine logged in.
This was a small college, so the IT guys just went round explaining and told us to log out and log in again. Applications re-installed. No data loss and so no shouting.
Stuff happens. We did 'assignment action planning' that morning: mind maps, essay plans and research ideas. Results better than normal anyway
It's not a standard across the board, but I've found academic institutions to be more honest about their technological mistakes (outside of large-scale breaches) than the private sector.
I worked at a company for several years that provided software to the IT groups in the higher ed vertical. What I found is there are two types of people in higher-ed IT (and they often congregate at different campuses):
First are the people who believe in the mission. They are really good, and are willing to take a cut in pay for some combination of social good, great working environment, etc. These kinds of people tend to be forthright about problems.
The second group are people who would struggle with the demands of the normal corporate world. They are getting paid less in higher ed and are worth what they are getting paid.
When I was in college, I worked for the IT department. While there's politics and bullshit no matter where you go, the politics and bullshit in Ed IT was not that noticeable where I was.
Furthermore, the profs were always happy to see me coming because I fixed their broken stuff without pointing any fingers.
Yes, shit happens. It's how you solve it and make sure that it doesn't happen again that's important. With the way they're communicating, they seem to be on the top of their game
Snark and sarcasm somewhat aside, it's not like they could tell the users about the remediation progress through their intranet, so they probably didn't have many options besides posting it on the Internet for all to see.
Yup. There's reason for some initial amusement but much respect for openly taking care of the problem.
This will win them appreciation and confidence in the aftermath of this disaster.
This reminds me of my undergrad CPSC days. The CPSC department had their own *nix-based mainframe system that was separate from the rest of the University. The sysadmin was a pretty smart guy who was making less than a third of what he could get in industry. Eventually he got fed up and left. About a week or two later the servers had a whole cascade of failures that resulted in everyone losing every last bit of work they'd done over the weekend (This was a weekend near the end of the semester when everyone was in crunch mode).
Long story short, the sysadmin was hired back and paid more than most of the profs. Academia may tend to skimp on salaries for certain positions, but sysadmins probably shouldn't be one of them.
You know what? Fuck them. Fuck higher education completely. Undervalued and underpaid is the name of the game for any important IT roles in that shit hole of an industry.
To be fair, a lot of the people working in support roles in academia are pretty much unemployable in the real world. They show up at 10 am, take constant smoke breaks all day, and leave at 3 pm. When I was in physics (which used the main university servers for most things) we had a sysadmin who was in charge of some printers and a couple of server boxes. He had inherited those boxes from a former student who set them up, but he was functionally illiterate in managing them. At one point I needed a package installed. Not only could he not figure out how to install a package on a ubuntu server on his own, he couldn't do it with emailed instructions either. I had to go up and physically stand over him telling him what to click on and what to type. To make matters worse, he was so hard to actually catch "in the office", that I had to have the department secretary (whose office he was next to) alert me when he showed up. Not surprisingly, the functions of those servers were soon moved to desktop machines in various offices. As far as I know he's still working there though. He's a union employee and it would be a ride through deepest, hottest, hell to get rid of him.
Note: I am not saying all university support staff are like this. Some definitely are though, and they're probably the reason why good people sometimes find it hard to be properly remunerated in academia.
> Note: I am not saying all university support staff are like this. Some definitely are though, and they're probably the reason why good people sometimes find it hard to be properly remunerated in academia.
Certainly not everyone is like him - but I'd wager every university has at least a couple of people like him (we definitely had one, again, in physics)
Reminds me of some emails that went out at my old university during a cluster outage, and got progressively more informal as the night went on, detailing people leaving dinners with extended families, a growing sense of desperation, etc. The last email might as well have ended with "Tell my wife I love her."
It was both direct and funny enough that I was only mildly annoyed that the cluster was down.
> A Windows 7 deployment image was accidently sent to all Windows machines, including laptops, desktops, and even servers. This image started with a repartition / reformat set of tasks.
Wow. That is very unfortunate, to say the least...
> As soon as the accident was discovered, the SCCM server was powered off – however, by that time, the SCCM server itself had been repartitioned and reformatted.
I wouldn't feel bad. I guffawed at most of this story.
Not in a "haha, what a bunch of morons. Serves those jerks right!" kind of way, but more in a "oh dear, that's the worst thing that can possibly happen! Oh no it gets worse??". I've been through IT catastrophes (and caused a couple myself) and I could easily see this happening to me. Still, it's funny as anything.
Sometimes these products get bogged down because of three important problems outside of the control of IT. First off, you need to get in touch with all the upstream vendors to get updates to any sort of custom software that has compatibility problems with newer versions of windows; Vista/7 got a lot more strict about giving admin access, for example, which may cause problems in the updates. Second, you've got to keep in mind the training costs. There are a lot of users who may be brilliant financial minds that can make numbers dance and bow to their whims, but get terribly locked up if an icon changes. Doesn't make them horrible people by any means, but you've got to keep it in consideration when planning a rollout. Finally, you have to keep in mind the petty turf wars. If Joe in Accounting gets the upgrade to 7 before Bob in Legal, Bob in Legal may feel slighted and start raising a holy shitstorm, even if he's scheduled to be upgraded a week later. Upgrades are ugly, no matter when they happen. Sometimes that proactive upgrade project takes many years just because of all the moving parts involved.
There was a similar catastrophe at Jewel osco stores many years ago. Nightly, items added to the store pos were merged back with the main item file at each store location. The format of the merged data was exactly the same as loading a new file, except the first statement would be /EDIT instead of /LOAD.
One of the programmers decided to eliminate some code by combining the two functions, with a switch to control whether /LOAD or /EDIT was used for the first statement.
There was a bug in the program, and the edits were sent down as loads.
A guy I knew, Barry, was the main operator that night. He started getting calls from the stores after around 10 of them had been reloaded with 5 or 6 items.
Barry said it was the first time he got to meet the president of the company that day.
Since a reformat was done to the affected machines, does this mean that researchers' datasets, drafts of papers, and other IP were lost? Or were researchers' machines not affected?
In my experience with campus networks, home directories are never stored locally on any remotely-administered machines. Any specially-configured researcher's machine that stored data locally would not have been subscribed to get the automatically deployed OS images.
>"As soon as the accident was discovered, the SCCM server was powered off – however, by that time, the SCCM server itself had been repartitioned and reformatted." //
If the SCCM server was pushed the "update" then there doesn't seem much hope for other machines? Surely no rule should be able to format the server running the ruleset; seems like a failsafe failure there at least.
None of the storage servers should have been storing the user data on the same volume as the OS the way a client machine would. So the network-mounted home directories should be intact and ready to use once the server OS is reinstalled. And while I don't know how SCCM works, I'd be surprised if this image push was affecting anything other than the primary physical drive (a wipe-all, populate-one recipe would be too obviously wrong and dangerous, right?).
Deletion and formatting doesn't necessarily destroy data, it just destroys the pointers to the data. If they're lucky the data can be recovered via software utilities (undelete) with backfill from backups. If they're unlucky then important and un-backedup data has been written over, and those people are going to be SoL.
Are there actual backup procedures out there that foresee and automate the restoration of wiped drives and partitions ? I might be wrong bu I doubt it should even be considered though.
Yes, there are a lot of options, both commercial and non-commercial, for full drive backups (then you restore the using those and the incrementals). Do that for your provisioning servers and you can redeploy a lot of the infrastructure based on that.
I see, Thank you. Somehow I was under the impression that automating and scaling the tedious work of analyzing and restoring boot sectors and the like couldn't be done. I suppose that it's easier to plan for though if you restore a whole drive from a backup image rather than restoring a random set of files by going on a sector hunt ?
> Findlay University did not release the name of the company that made the mistake but is working with the business' insurance company to pay for it to be fixed.
Findlay apparently doesn't value transparency as much as Emory.
> The university says grass was killed on as many as 54 of the campus' 72 acres.
Makes you wonder what chemicals are going into lawns generally.
One of the lessons from my college days (informally acquired, take with appropriate quantities of salt) was that walking barefoot was as much a risk for chemical exposure as puncture wounds.
Not quite as disastrous but when I was at university the resident administrators configured the entire site's tftp server (everything was netbooted Suns) to boot from the network. This was fine until there was a site-wide power blip and it was shut down. When it came back it couldn't tftp to itself to boot because it wasn't booted yet (feel the paradox!). Cue 300 angry workstation users descend on the computer centre with pitchforks and torches because their workstations couldn't boot either...
Bad stuff doesn't just happen to Windows networks.
* As soon as the accident was discovered, the SCCM server was powered off – however, by that time, the SCCM server itself had been repartitioned and reformatted.
I was just watching the "What’s New with OS Deployment in Configuration Manager and the Microsoft Deployment Toolkit" session from TechED and hit the section on "check readiness" option which MS have added to SCCM 2012 in R2. It sounds like having this in part of the task sequence at Emory would have (at the very least) stopped this OS push from at least hosing all the servers.
Reading that just made me feel sick to my stomach and my heart goes out to the poor gal/guy that pushed "Go" on that one. Shit happens, but a screw up that big can be devastating to ones psyche.
I _very_ nearly did this whilst working for a University back in the early noughties. Luckily I managed to get to the server before the "advert" activated and wiped out everything. It was so easy to do I am surprised that it is stil possible. I feel for their pain, but it does sound like they are doing a good job of mopping up. I did allow myself a snort of laughter when I read the bit about the server being re imaged as well. That is pretty darn impressive carpet bombing the entire campus.
As soon as the accident was discovered, the SCCM server was powered off – however, by that time, the SCCM server itself had been repartitioned and reformatted.
I guess that's how robot apocalypse is gonna look like.
Isn't this more the fault of the system architect than the guy who accidentally fired the bad deploys?
It's similar to a database firehose: If you accidentally start deleting all data you should have a quick working backup ready to quickly bring the dead box up to production.
I don't know. This could very well be a case of not much more than a bad drag and drop in SCCM. Its not quite that simple, but I'm not sure this is some custom process they setup.
Any tool that allows you to easily perform the antithesis of its function without making it abundantly clear what clicking the OK button will do is fundamentally broken.
I've built a few systems for deploying Windows... and the last thing that every one of them did before writing a new partition table and laying down an image was to check for existing partitions and require manual intervention if any were found.
> As soon as the accident was discovered, the SCCM server was powered off – however, by that time, the SCCM server itself had been repartitioned and reformatted.
"As soon as the accident was discovered, the SCCM server was powered off – however, by that time, the SCCM server itself had been repartitioned and reformatted."
I asked my friend attending Emory right now, and he didn't even realize anything was going on. He says that the Emory IT department has a notorious distinction on campus as being regularly terrible, mostly with an unreliable internet connection.
However, it looks like they handled this accident the best they could! Perhaps this accident would not have happened at a more reliable IT department.
Disasters as well as mistakes are unavoidable, such is life. A hallmark of a competent organization is how they handle the situation and recover from disasters or mistakes.
So far all the signs have indicated they are doing great in recovering. I just hope there won't be onerous processes and restriction afterward due to desire on "make sure it won't happen again" stance.
my roommate works at the emory library and has had a fun slow week there of coming home early many days because no one could do work. they were apparently also given laptops as an interim solution, but those somehow also wiped themselves eventually (?).
poor IT people...just as they're starting to get a handle on the actual sitation it starts blowing up on the internet.
Funny how they mention iTunes as one of the "key components" that are restored first, whereas Visio, Project, Adobe application are relegated to a second round.
Presumably iTunes is part of their base system image for all workstations, along with Office, Firefox, Adobe Reader, and the like. In other words, a basic set of software to handle standard officework tasks. iTunes is free and IT would probably rather distribute it everywhere than have people trying to install it themselves (or calling the helpdesk to get someone with administrator rights to do it). They then offer additional applications on an as-needed basis to individuals and departments with specific tasks. So the designers who do print publications and the faculty who teach digital art might get the Adobe suite, while people in Facilities who plan construction will get Project. This keeps licensing costs down and simplifies systems according to their uses.
Generally you're going to keep your base images in SCCM limited to software that's only infrequently updated. Otherwise somebody has to update the entire image every time an update gets pushed out. Instead, you package them up and deploy the apps on top of the base image at install time. It takes a little longer to deploy, but it takes less admin time to manage since the actual installs are automated anyway.
Not hard to imagine. The others likely have specialty licenses and so aren't as easily distributed to everyone. In addition, Adobe software itself wasn't working earlier ;-)
You know how in movies you need at least two people to bring their special secret keys, plug them in, and turn them at once to enable a self-destruct sequence?
That is a real principle in interface design - if something would be really, really bad to activate unintentionally, make it really, really hard to activate.
If you design a nuclear missile facility, you don't put the "launch nukes" button right next to "check email" and "open facebook".
Same way it shouldn't be easy for users to delete or corrupt their data by accident due to some omnipotent action innocently shoved right in between other trivial actions.
I wouldn't blame the person who triggered this re-imaging process. I'd blame those who designed the re-imaging interface, to allow it to happen so easily by accident.
In my experience, the key is that the UI makes it clear exactly what you're doing. What I mean is, instead of a button that says "Start Imaging", it should be "Start Imaging of the 12,600 computers this rule applies to". Of course, that's a lot more work for the programmer, so it's never done.
It also helps to have sensible conventions for naming hosts and groups. If you need to select a subset of machines then you are sometimes going to make a mistake as simple is getting a wildcard pattern wrong. Instead have groups with explicit and obvious names that require no memory to understand.
But then that leads to people automatically clicking "Yes" to the "Are you sure..." dialog. Though even I would pause at "Are you sure you want to reformat 12,500 machines including this one?" ;-)
Event that is hardly enough. There should be a physical (well virtual physical) obstacle towards launching a high stakes command.
The system should be able to assess the scope of a task, and ask you to confirm 10 times if it has to, in blinking red dialogs, to make sure you really want to do what you are doing.
Of course, it's crucial that "clicking 10 times" is not the default behavior for any trivial action. Or boredom and the subsequently formed mechanical 10-click habit of the operator will kill the effectiveness of this approach...
"To make error is human. To propagate error to all server in automatic way is #devops."
Frankly, I'm surprised things like this don't happen more often. Kudos for the incident management. Also a big plus for having working backups, it seems.