Similar thing (catastrophic aircon failure due to a flood in a crap colocated DC) happened to us too before we shifted to AWS. Photos from the colo were pretty bizarre - fans balanced on random boxes, makeshift aircon ducting made of cardboard and tape, and some dude flailing an open fire door back and forth all day to get a little bit of fresh air in. Bizarre to see in 2010-ish with multi million dollar customers.
We ended up having to strategically shut servers down as well, but the question of what's critical, where is it in the racks, and what's next to it was incredibly difficult to answer. And kinda mind-bending - we'd been thinking of these things as completely virtualised resources for years, suddenly having to consider their physical characteristics as well was a bit of a shock. Just shutting down everything non-critical wasn't enough - there were still now critical non-redundant servers next to each other overheating.
All we had to go on was an outdated racktables install, a readout of the case temperature for each, and a map of which machine was connected to which switch port which loosely related to position in the rack - none completely accurate. In the end we got the colo guys to send a photo of the rack front and back and (though not everything was well labelled) we were able to make some decisions and get things stable again.
In the end one server that was critical but we couldn't get to run cooler we got lucky with - we were able to pull out the server below and (without shutting it down) have the on site engineer drop it down enough to crack the lid open and get some cool air into it to keep it running (albeit with no redundancy and on the edge of thermal shutdown).
We came really close to a major outage that day that would have cost us dearly. I know it sounds like total shambles (and it kinda was) but I miss those days.
> have the on site engineer drop it down enough to crack the lid open
Took me four reads to find an alternative way to read it other than "we asked some guy that doesn't even work for us to throw it on the ground repeatedly until the cover cracks open", like that Zoolander scene.
In our defence, he offered. It had hit hour 6 of both the primary and the backup aircon being down, on a very hot day - everyone was way beyond blame and the NOC staff were basically up for any creative solution they could find.
Wait, you didn't mean "he repositioned it a couple levels down on the rack to make some room above so he could unscrew the cover and crack it a bit open it like a grand piano"?
I find it’s much less stressful to rescue situations where it wasn’t your fault to begin with. Absent the ability to point fingers at a vendor, crises like that are a miserable experience for me.
> Similar thing (catastrophic aircon failure due to a flood in a crap colocated DC) happened to us too before we shifted to AWS. Photos from the colo were pretty bizarre - fans balanced on random boxes, makeshift aircon ducting made of cardboard and tape, and some dude flailing an open fire door back and forth all day to get a little bit of fresh air in. Bizarre to see in 2010-ish with multi million dollar customers.
I'd have considered calling a few friends from the fire brigade or the catastrophe protection there.
It's not an emergency, yes. However, if you want a situation for your trainees to figure out how to ventilate a building with the force of a thousand gasoline driven fans without anyone complaining and no danger to any person... well be my guest because I can't hear you anymore. Those really big fans are loud AF, seriously.
And, on a more serious note, you could show those blokes how a DC works. Where power goes, what components do, how to handle uncontrolled fire in areas. Would be a major benefit to the local fire fighters.
We ended up having to strategically shut servers down as well, but the question of what's critical, where is it in the racks, and what's next to it was incredibly difficult to answer. And kinda mind-bending - we'd been thinking of these things as completely virtualised resources for years, suddenly having to consider their physical characteristics as well was a bit of a shock. Just shutting down everything non-critical wasn't enough - there were still now critical non-redundant servers next to each other overheating.
All we had to go on was an outdated racktables install, a readout of the case temperature for each, and a map of which machine was connected to which switch port which loosely related to position in the rack - none completely accurate. In the end we got the colo guys to send a photo of the rack front and back and (though not everything was well labelled) we were able to make some decisions and get things stable again.
In the end one server that was critical but we couldn't get to run cooler we got lucky with - we were able to pull out the server below and (without shutting it down) have the on site engineer drop it down enough to crack the lid open and get some cool air into it to keep it running (albeit with no redundancy and on the edge of thermal shutdown).
We came really close to a major outage that day that would have cost us dearly. I know it sounds like total shambles (and it kinda was) but I miss those days.