Damn, every emergency power supply I have encountered (the big ones with fuel and hundreds of batteries) always fail to start when they have to... Why is that ?
I've seen places that /did/ test their backup power - but they got failures anyway, because of faults the test didn't reveal.
For example you switch off the data centre circuit breakers and everything fails over to generators just fine. Test successful, right?
Then when there's a real outage you have problems because the operations team's computers have all gone off, so they can't migrate load to a different data centre. It didn't happen in testing, because they aren't in the data centre so their kit isn't connected to the breakers you turned off.
Or it turns out the wireless APs aren't on UPSes. Or it turns out there's a switch in a closet somewhere that isn't on a UPS. Or they tested for a single loss of power, but when the mains power toggles on and off every 30 seconds the UPS batteries get run down. Or they need to top up the generator and they discover you can't get fuel delivered at 9pm on a Friday. Or the generator doesn't recharge the UPS, but you have to turn off the generators to refuel them. Or a guy had a standalone UPS for his desktop, but his monitor wasn't connected as the UPS only came with IEC C13 power cables and his monitor needed IEC C5...
One place I worked at found this during the big storm in the UK UPS worked fine and all Telecoms Golds Machines stayed - but they had forgotten to put the Modems on the UPS :-)
While at $bigco we halted testing of generation equipment because it was sending DCs offline more often than it kept them up. Lawyers were involved, things got ugly
I'm completely unfamiliar with electrical generators/power generation, so take this question in the spirit of ignorance:
Is there not a way to test generators without actually having them power the live datacenter infrastructure? I mean, simulate the exact generation and load requirements that the generators will face?
I don't know if it's feasible to dump all that power to ground or whatever, but that way you could test the generator under full load at will and identify issues without impacting the (operationally) live datacenter itself.
This equates a bit in my head with the 'verifying the backup' bit you might get in software, whereas actually using the generator to power the live operational datacenter equipment would be more like 'restore from backup'.
> Is there not a way to test generators without actually having them power the live datacenter infrastructure? I mean, simulate the exact generation and load requirements that the generators will face?
That would exercise transfer switches in addition to generators. Transfer switches are always energized, except for the time an equivalent of a big red mechanical switch is flipped into "OFF" position. When it is in the off position, neither main or generator are going to provide power to the customer. The biggest power consumer is actually a cooling system. If a HVAC system stops functioning in a typical data center, the temperature would quickly rise to the level of Arizona desert, destroying metric tons of equipment. Transfer switches have certain properties where after a flip they may not go back to the correct state on power restore ( say 1% ). Normally it is not a big deal because you really need to have lose power under full load often to get bit by it and if you are losing power that much you should probably address it with the utility company. However, if you are doing your full test once a week, in one year you would introduce 52 power failures.
Say your transfer switch is stuck in a wrong position. Now you need to shutdown all heat generating equipment to when you drop power to fix/replace the failed component you don't melt your data center... Congratulations, your data center now has to go offline.
You can if you have to. But then you're really only doing a fancy simulation.
I accompanied my dad (power engineer) to a water purification plant where they were testing new equipment for the back up generator. There their weekly tests involved moving the entire plant to the diesel generator and running it of back up power for a couple of hours (once you start a big generator you have to let it run or it wont last long).
Potential problems for your generator that a resistor bank wont capture include, power factor (phase shift from a motor or switching supply), harmonics (from switching power), startup transients (from every power supplies' capacitors).
All these things can trip the generator, or worse, burn it out.
So if you can't test with the real load, supersize it!
P.S. Every test is a simulation of reality. At Fukushima the diesel generators flooded. Lesson - the unknown reason that'll knock out your grid can knock out your backup
P.P.S If you can, gently turning the load back on is very beneficial. Don't flip the master switch that controls all your load - flip a part of your load, wait a while for the system to stabilize, and flip part of it back.
P.S. Every test is a simulation of reality. At Fukushima the diesel generators flooded. Lesson - the unknown reason that'll knock out your grid can knock out your backup
well the lesson there was more like, that it's stupid to put your diesel generators deep in the ground when they should sustain burst sea level raises. (well I think it's never a good idea to do that, I've seen special places to put them even deep inside germany, just because some panicful people that might think that it still could overflow with ground water, etc)
The generators were placed low to ensure they would not be disabled by a major earthquake, as shaking intensity increases with height above ground.
The basement was, according to plan, protected by a seawall.
The height of the seawall failed to take into account the fact that on a subduction tectonic plate, as pressure builds, the land-side plate rises, and as the earthquake relieving that pressure strikes, the land falls -- by as much as several meters.
The seawall's height failed to account for this.
That among other elements, but it proved sufficient to kick off the Fukushima disaster, given other aspects.
In hindsight, placing the generators at ground level in an elevated location might have been a better bet. Or locating the entire generating plant further upslope.
> In hindsight, placing the generators at ground level in an elevated location might have been a better bet. Or locating the entire generating plant further upslope.
yeah, well thats what I meant. I mean they were buried deep...
There were even studies that this was dumb:
and basically tepco knew that. they were just too lazy (it was prolly too cost intensive) to do something against it (i.e. placing them higher or creating water bunkers (u-boot style)..
besides the generators there was also the human failure part. (most of the time human failure happens. I mean If I do not automate something, I might do it right 3 times and the fourth time I most often fail hard...)
You know all that cooling equipment data centers tend to have?
Dumping power to ground is also known as an electric furnace.
Now you have twice as much heat to move, but you only have the usual cooling system. Toasty servers are sad servers. Toasty engineers are dead engineers.
Ok your reply makes no sense to me (see first line of my original comment).
I don't understand why you'd be putting a generator test load through the servers in addition to the normal power supply??? Why would you do that???
I would have thought that you have normal operation going on in the datacentre, normal cooling infrastructure, normal power coming in. Then the generators turned on with their in built cooling, then them delivering power to the test area and not the datacenter - which is then the only additional area you'd need to cool.
Seems like dealing with the heat would be not that hard, but perhaps I have too much faith in engineers? :-)
Even just a massive heating element in a tank of water, giant kettle style would work wouldn't it? Big kettle mind you, but big tank too. Seems like a cheap way to test the generator at full load for an extended test?
Your test would only test the generators. It would not test the transfer switches, or whether everything you think is connected to the backup generator is actually connected, and any equipment in between.
Interesting. I've worked at a few hospitals in the UK (not in IT) and all of them do a generator test one a week. However they do it, only one of them lost all power (for a couple of seconds) every week. It was quite disruptive, but I guess cutting the main power and checking the auxillary power kicks in is a better test than simply powering up the generators.
We had a generator that ran a two hour test every Thursday. It ran fine one Thursday, the next day we had a power outage and it failed to start because a capacitor went bad.
Lack of testing; ideally you throw a switch once a month or so to make sure the UPSes work and the generators power on, but I can imagine it's something you'd rather not do, if it doesn't work you've just intentionally borked things for your customers. Still, it is something that datacenters should do, do it a few times before you have any customers, keep doing it. Maybe offer a discount (if it's a new datacenter for an existing provider, like AWS or whatever) for the first couple months, indicating that emergency power has not yet been tested properly.
There's tooling available and running at some of the larger companies (like Netflix) called Chaos Monkey, which triggers random outages constantly to determine if the system is resilient and self-healing.
We always had our generators cycle on once a week for 20 minutes or so without assuming load. You don't want the (in our case propane) engines sitting too long without being run.