OVH Incident in Strasbourg

lode · on Nov 9, 2017

More info on Twitter from OVH's CEO: https://twitter.com/olesovhcom

and on https://twitter.com/ovh_support_en

"SBG: ERDF is trying to find out the default. 2 separated 20kV lines are down. We are trying to restart 2 generators A+B for SBG1/SG4. 2 others generators A+B work in SBG2. 1 routing room is in SBG1, the second in SBG2. Both are down. "

"An incident is ongoing impacting our network. We are all on the problem. Sorry for the inconvenience."

"SBG: 1 gen restarted."

"RBX: all optical links 100G from RBX to TH2, GSW, LDN, BRU, FRA, AMS are down."

jakub_g · on Nov 9, 2017

BTW this seems to be a better status page than the one submitted to HN (which is 404ing)

http://status.ovh.com/

roblabla · on Nov 9, 2017

The status page was down during the outage.

jakub_g · on Nov 9, 2017

If so, then it's just like Amazon's status page during the AWS outage [1].

Pro-tip: self-hosting status page is maybe not the best idea.

[1] https://twitter.com/awscloud/status/836656664635846656?lang=...

dspillett · on Nov 9, 2017

Like the old Red Dwarf episode:

Lister: What's the damage Hol?

Holly: I don't know Dave. The damage report machine has been damaged.

dx034 · on Nov 9, 2017

ETA is now 30 min for RBX

https://twitter.com/olesovhcom/status/928552251458818048

icebraining · on Nov 9, 2017

It said the database was corrupted, so yeah, probably routing configuration.

(edit: parent's post was edited after I posted)

qwerty69 · on Nov 9, 2017

It started with all our SBG servers going down simultaneously. Approximately 1h later all our RBX servers went down as well including the OVH status page and all other OVH web applications. Either their SBG and RBX data centers are somehow connected or those are indeed two independent incidents.

dx034 · on Nov 9, 2017

SBG was a power failure, followed by a generator failure. Not sure if they set up their network infrastructure in a way that this could spread but I find it hard to imagine that the outage in SBG triggered RBX going completely black. Unless of course they store their configuration files in SBG.

blattimwind · on Nov 9, 2017

It seems RBX was an unrelated failure corrupting the router configuration (@olesovhcom).

SBG going down due to a quadruple power failure (both grid connections and both generators) is quite spectacular.

arekkas · on Nov 9, 2017

I moved away from OVH after I paid 3 months advance (~$300) for a server which burned down after 1 1/2 months. They did not issue any refunds (data, blood, sweat and tears were lost that day). I have been an OVH customer for 12 years.

Today, I'm glad to have moved away all my production environments as well.

dx034 · on Nov 9, 2017

I'm still at OVH (support is reasonable and prices cheap) but would never trust one provider with all my infrastructure. In the end it always turns out that there is a single point of failure and if it's just the billing department. Using two providers protects you from that and if you chose some with good peering and free traffic, keeping both in sync is relatively easy.

That's just the problem with services as AWS, traffic is too costly to have production live with another provider as well.

cjsuk · on Nov 9, 2017

Spot on.

Having one AWS account scares the crap out of me as well. It’s never a good thing if all your eggs are in one basket.

My money is on stuff spread across Bytemark, Linode and DigitalOcean with a DR plan involving mostly automatic recovery.

AWS doesn’t get a look in as it is extremely costly to port away from anything that isn’t bare metal and pipes.

z3t4 · on Nov 9, 2017

How do you do fail-over ?

cjsuk · on Nov 9, 2017

Mirror your data onto another provider continuously (log shipping/rsync), Switch DNS.

Ansible works for this stuff as it allows you to can the task of “quick get me a production environment up on Linode!”

If you can afford some downtime you don’t need a hot standby just roll out everything into new provider and you’re done.

I’ve done this on very large scale environments and small ones and it’s achievable for even small organisations. The killer is avoid anything you can’t run on bare metal servers.

therealmarv · on Nov 9, 2017

I also do this approach. Get a decent, from your infrastructure independent DNS provider and take care of your Ansible scripts. This way in emergency you do a one liner, have a new production server running and change DNS settings.

mschuster91 · on Nov 9, 2017

In addition I'd suggest getting a second domain in a TLD operated by another company in another country than the primary TLD is, and teaching your customers/users that both are valid. This protects you from three things:

1) your DNS provider having issues (even Route53 sometimes has them, https://mwork.io/2017/03/14/aws-route53-dns-outage-impacts-l...)

2) legal issues, when one of your domains gets seized or the provider gets pressured into cancelling, just as has happened with Pirate bay, SciHub and friends, gambling or sites with user generated content that may be illegal or frowned upon in some countries or for the latter to the Nazi site Daily Stormer (although I'm glad for it being down, it's a perfect example what can happen in a very short time frame)

3) (edit, after suggestion below) the entire TLD going down because the TLD DNS provider has issues, which also happens from time to time.

cjsuk · on Nov 9, 2017

Good advice. We had some serious trouble when .io went down a few weeks back.

pmontra · on Nov 9, 2017

My experience with OVH VPSes is that they are cheap and reasonably reliable. But when one starts failing support never helped me to get it fixed.

Just fire up another one, install everything on there, switch off the old one and forget about it. This means never paying for more than one month in advance.

therealmarv · on Nov 9, 2017

Just an good advice: Never ever trust a single point/server without backups. Do backups, do snapshots (check if and how you can do that), store them somewhere else. Try to do automated reinstall with e.g. Ansible so that you can move easily. It's an investment at the beginning and during running the server but if the stuff is critical you should always do that.

nicc · on Nov 9, 2017

Whatever.

Servers break. Providers go down.

It happens to all of them. Whenever I read comments like yours I wonder where you moved your servers to, and if you'll move them again when that goes down too.

wiz21c · on Nov 9, 2017

Damn, every emergency power supply I have encountered (the big ones with fuel and hundreds of batteries) always fail to start when they have to... Why is that ?

gyaru · on Nov 9, 2017

People not actually testing emergency equipment.

michaelt · on Nov 9, 2017

I've seen places that /did/ test their backup power - but they got failures anyway, because of faults the test didn't reveal.

For example you switch off the data centre circuit breakers and everything fails over to generators just fine. Test successful, right?

Then when there's a real outage you have problems because the operations team's computers have all gone off, so they can't migrate load to a different data centre. It didn't happen in testing, because they aren't in the data centre so their kit isn't connected to the breakers you turned off.

Or it turns out the wireless APs aren't on UPSes. Or it turns out there's a switch in a closet somewhere that isn't on a UPS. Or they tested for a single loss of power, but when the mains power toggles on and off every 30 seconds the UPS batteries get run down. Or they need to top up the generator and they discover you can't get fuel delivered at 9pm on a Friday. Or the generator doesn't recharge the UPS, but you have to turn off the generators to refuel them. Or a guy had a standalone UPS for his desktop, but his monitor wasn't connected as the UPS only came with IEC C13 power cables and his monitor needed IEC C5...

walshemj · on Nov 9, 2017

One place I worked at found this during the big storm in the UK UPS worked fine and all Telecoms Golds Machines stayed - but they had forgotten to put the Modems on the UPS :-)

dredmorbius · on Nov 10, 2017

This is why chaos monkeys are a good thing. Regular drills too.

But you want to test the things you haven't thought about.

_wmd · on Nov 9, 2017

While at $bigco we halted testing of generation equipment because it was sending DCs offline more often than it kept them up. Lawyers were involved, things got ugly

PuffinBlue · on Nov 9, 2017

I'm completely unfamiliar with electrical generators/power generation, so take this question in the spirit of ignorance:

Is there not a way to test generators without actually having them power the live datacenter infrastructure? I mean, simulate the exact generation and load requirements that the generators will face?

I don't know if it's feasible to dump all that power to ground or whatever, but that way you could test the generator under full load at will and identify issues without impacting the (operationally) live datacenter itself.

This equates a bit in my head with the 'verifying the backup' bit you might get in software, whereas actually using the generator to power the live operational datacenter equipment would be more like 'restore from backup'.

I don't know if it's possible though.

notyourday · on Nov 9, 2017

> Is there not a way to test generators without actually having them power the live datacenter infrastructure? I mean, simulate the exact generation and load requirements that the generators will face?

That would exercise transfer switches in addition to generators. Transfer switches are always energized, except for the time an equivalent of a big red mechanical switch is flipped into "OFF" position. When it is in the off position, neither main or generator are going to provide power to the customer. The biggest power consumer is actually a cooling system. If a HVAC system stops functioning in a typical data center, the temperature would quickly rise to the level of Arizona desert, destroying metric tons of equipment. Transfer switches have certain properties where after a flip they may not go back to the correct state on power restore ( say 1% ). Normally it is not a big deal because you really need to have lose power under full load often to get bit by it and if you are losing power that much you should probably address it with the utility company. However, if you are doing your full test once a week, in one year you would introduce 52 power failures.

Say your transfer switch is stuck in a wrong position. Now you need to shutdown all heat generating equipment to when you drop power to fix/replace the failed component you don't melt your data center... Congratulations, your data center now has to go offline.

sdfafhlska · on Nov 9, 2017

You can if you have to. But then you're really only doing a fancy simulation.

I accompanied my dad (power engineer) to a water purification plant where they were testing new equipment for the back up generator. There their weekly tests involved moving the entire plant to the diesel generator and running it of back up power for a couple of hours (once you start a big generator you have to let it run or it wont last long).

Potential problems for your generator that a resistor bank wont capture include, power factor (phase shift from a motor or switching supply), harmonics (from switching power), startup transients (from every power supplies' capacitors).

All these things can trip the generator, or worse, burn it out.

So if you can't test with the real load, supersize it!

P.S. Every test is a simulation of reality. At Fukushima the diesel generators flooded. Lesson - the unknown reason that'll knock out your grid can knock out your backup

P.P.S If you can, gently turning the load back on is very beneficial. Don't flip the master switch that controls all your load - flip a part of your load, wait a while for the system to stabilize, and flip part of it back.

merb · on Nov 10, 2017

P.S. Every test is a simulation of reality. At Fukushima the diesel generators flooded. Lesson - the unknown reason that'll knock out your grid can knock out your backup

well the lesson there was more like, that it's stupid to put your diesel generators deep in the ground when they should sustain burst sea level raises. (well I think it's never a good idea to do that, I've seen special places to put them even deep inside germany, just because some panicful people that might think that it still could overflow with ground water, etc)

dredmorbius · on Nov 10, 2017

The generators were placed low to ensure they would not be disabled by a major earthquake, as shaking intensity increases with height above ground.

The basement was, according to plan, protected by a seawall.

The height of the seawall failed to take into account the fact that on a subduction tectonic plate, as pressure builds, the land-side plate rises, and as the earthquake relieving that pressure strikes, the land falls -- by as much as several meters.

The seawall's height failed to account for this.

That among other elements, but it proved sufficient to kick off the Fukushima disaster, given other aspects.

In hindsight, placing the generators at ground level in an elevated location might have been a better bet. Or locating the entire generating plant further upslope.

merb · on Nov 13, 2017

> In hindsight, placing the generators at ground level in an elevated location might have been a better bet. Or locating the entire generating plant further upslope.

yeah, well thats what I meant. I mean they were buried deep... There were even studies that this was dumb:

- https://news.usc.edu/86362/fukushima-disaster-was-preventabl...

- https://en.wikipedia.org/wiki/Fukushima_Daiichi_Nuclear_Powe... (section end)

- http://carnegieendowment.org/2012/03/06/why-fukushima-was-pr...

and basically tepco knew that. they were just too lazy (it was prolly too cost intensive) to do something against it (i.e. placing them higher or creating water bunkers (u-boot style)..

besides the generators there was also the human failure part. (most of the time human failure happens. I mean If I do not automate something, I might do it right 3 times and the fourth time I most often fail hard...)

krallja · on Nov 9, 2017

> dump all the power to ground

You know all that cooling equipment data centers tend to have?

Dumping power to ground is also known as an electric furnace.

Now you have twice as much heat to move, but you only have the usual cooling system. Toasty servers are sad servers. Toasty engineers are dead engineers.

PuffinBlue · on Nov 9, 2017

Ok your reply makes no sense to me (see first line of my original comment).

I don't understand why you'd be putting a generator test load through the servers in addition to the normal power supply??? Why would you do that???

I would have thought that you have normal operation going on in the datacentre, normal cooling infrastructure, normal power coming in. Then the generators turned on with their in built cooling, then them delivering power to the test area and not the datacenter - which is then the only additional area you'd need to cool.

Seems like dealing with the heat would be not that hard, but perhaps I have too much faith in engineers? :-)

Even just a massive heating element in a tank of water, giant kettle style would work wouldn't it? Big kettle mind you, but big tank too. Seems like a cheap way to test the generator at full load for an extended test?

ams6110 · on Nov 9, 2017

Your test would only test the generators. It would not test the transfer switches, or whether everything you think is connected to the backup generator is actually connected, and any equipment in between.

spaghetti-guy · on Nov 9, 2017

What you are talking about is 'load bank'. It's basically a massive hair dryer. Many data centres have these on the roof for exactly this purpose.

dm319 · on Nov 9, 2017

Interesting. I've worked at a few hospitals in the UK (not in IT) and all of them do a generator test one a week. However they do it, only one of them lost all power (for a couple of seconds) every week. It was quite disruptive, but I guess cutting the main power and checking the auxillary power kicks in is a better test than simply powering up the generators.

walshemj · on Nov 9, 2017

Sounds like your doing it on the cheap

lloeki · on Nov 9, 2017

Happens: our generators have been tested lately, and both failed. Testing matters, hope is not a plan.

Same goes for backups: you don't have backups until you tried to restore them.

abarringer · on Nov 9, 2017

We had a generator that ran a two hour test every Thursday. It ran fine one Thursday, the next day we had a power outage and it failed to start because a capacitor went bad.

Cthulhu_ · on Nov 9, 2017

Lack of testing; ideally you throw a switch once a month or so to make sure the UPSes work and the generators power on, but I can imagine it's something you'd rather not do, if it doesn't work you've just intentionally borked things for your customers. Still, it is something that datacenters should do, do it a few times before you have any customers, keep doing it. Maybe offer a discount (if it's a new datacenter for an existing provider, like AWS or whatever) for the first couple months, indicating that emergency power has not yet been tested properly.

There's tooling available and running at some of the larger companies (like Netflix) called Chaos Monkey, which triggers random outages constantly to determine if the system is resilient and self-healing.

smoyer · on Nov 9, 2017

We always had our generators cycle on once a week for 20 minutes or so without assuming load. You don't want the (in our case propane) engines sitting too long without being run.

CydeWeys · on Nov 9, 2017

It's a hard problem to solve. We're talking about machines that are required to deliver massive load but that aren't even used once in a given year.

zyztem · on Nov 9, 2017

Survivorship bias. Press releases do not made when equipment work as expected

fvv · on Nov 9, 2017

UPDATE: not all datacenters are down, it seems like that in europe because ovh routing hasn't been updated so from our point of view everythign is down but really it is not :)

dx034 · on Nov 9, 2017

I think they don't know themselves what works. The CEO said GRA is down while I can access it without issues but that could be depending on where you try to connect from.

qeternity · on Nov 9, 2017

Where did he say GRA was down?

dx034 · on Nov 9, 2017

Looks like he deleted the tweet but he wrote earlier that GRA was down with the others while BHS was up. I'd guess he is in RBX and couldn't reach GRA so assumed it was down until they realised that the RBX-GRA line is down.

EDIT: He did indeed delete the tweet, [1] is the url in case anyone knows a website archiving them quickly enough.

[1] https://twitter.com/olesovhcom/status/928536233311076353

onestone · on Nov 9, 2017

I have several servers in GRA which were unreachable for two hours. Might have been due to the routing or DDoS-protection infrastructure.

fxaguessy · on Nov 9, 2017

Network and RBX are UP again: https://twitter.com/olesovhcom/status/928556358353539072 (but SGB's datacenters are still being restarted)

gerardnll · on Nov 9, 2017

No it isn't.

dx034 · on Nov 9, 2017

Servers appear up just the website is struggling (probably everyone logging in at the same time to file a ticket and complain).

therealmarv · on Nov 9, 2017

wow, yesterday I was playing with their public cloud because considering choosing them. I had some connection problem with my private networking there (deleted it more than once) and opened a ticket. If it was me... sorry, haha. Not good advertisement but it can happen to everyone.

PuffinBlue · on Nov 9, 2017

Huh, I was just looking at them too. Contrary to popular opinion, I kinda prefer it when these things happen before I sign up so that in the post-mortem usually whatever architectural failure lead to the outage is corrected and you get a stronger service.

Usually.

r1ch · on Nov 9, 2017

OVH are very open about their outages and root causes. Even routine maintenance tasks are cataloged with updates at http://status.ovh.net

As a more technical user it's nice to have providers that give this information rather than the boilerplate "Issue with an upstream provider" over and over.

cm2187 · on Nov 9, 2017

Not saying this is the case here but that's also sometimes where you spot amateurism and should run away. I remember a host provider a long time ago (15y) who was storing its backups on the same machine as the main data. Guess how I figured out!

PuffinBlue · on Nov 9, 2017

Oh man that sounds bad! You're right, sometimes these events do expose the inability to handle failure and you're right that that means walking away.

Thankfully many times we're reminded that there are good people out there working hard against difficult constraints and they finally get their chance to do things 'correctly' in the wake of the SHTF.

vim_wannabe · on Nov 9, 2017

Wouldn't this require that there be a very limited amount of things that can go wrong?

Unless by stronger you mean some kind of a change in mentality over care for the service, but that would probably have a limited lifespan until it's back to normal.

zaarn · on Nov 9, 2017

Personally, I like to know how a service fails and recovers.

If a service never fails then I don't know how well they can recover. I don't know anything about their failure mode.

In this case, I'm learning how OVH handles failure modes, how well they handle it, etc.

I can observe how they will treat such things in the future.

fapjacks · on Nov 9, 2017

Hey I would suggest not completely disregarding OVH. Their prices are good and their network is usually very reliable and fast. You simply won't find a North American hosting provider with those prices and reliability. You won't find one at double OVH prices, either (and I have tried).

therealmarv · on Nov 9, 2017

no, I'm not completely disregarding OVH, I now about their network and reliability in the past. Will rethink only about my disaster strategy in the future.

i_feel_great · on Nov 9, 2017

I signed up with ovh.com.au last month and testing stuff on it. Down too.

NiklasMort · on Nov 9, 2017

can't wait to read about the detailed followup on this in a few days, it is always interesting to see how such major outages happen

jedisct1 · on Nov 9, 2017

Not "all datacenters". Only 2 of them. They have 22, not counting all the POPs.

dx034 · on Nov 9, 2017

9 by their count (7xRBX+2xSBG). When this was posted, the CEO wrote that everything except BHS (Canadian DC) was offline (tweet now deleted). Presumably he was in RBX and noticed that they couldn't reach any of the other DCs. So it looked like all DCs were down for a while.

jedisct1 · on Nov 9, 2017

9 buildings, 2 locations.

drchaos · on Nov 9, 2017

This affects DNS as well, since domaindiscount24 (a rather large registrar in Germany) happens to host all three of their nameservers with OVH.

Just in case you wonder why your sites don't work, even if you host them somewhere else.

notwedtm · on Nov 9, 2017

This seems like really poor planning on domaindiscount24's part.

vultour · on Nov 9, 2017

This sems lke something I'd expect from someone called domaindiscount24

pmontra · on Nov 9, 2017

The status page is up again http://status.ovh.net/

I paste the report so far:

-------------

FS#15162 — SBG

Attached to Project— Network

Task Type: Incident

Category: Strasbourg

Status: In progress

Percent Complete: 0%

Details

We are experiencing an electrical outage on Strasbourg site.

We are investigating.

Comments (2)

Comment by OVH - Thursday, 09 November 2017, 10:55AM

SBG: ERDF repared 1 line 20KV. the second is still down. All Gens are UP. 2 routing rooms coming UP. SBG2 will be UP in 15-20min (boot time). SBG1/SBG4: 1h-2h

Comment by OVH - Thursday, 09 November 2017, 12:04PM

Traffic is getting back up. About 30% of the IP are now UP and running.

-------------

VPSes are still marked as read in the dashboard. I can't access mine.

pmontra · on Nov 9, 2017

More:

Comment by OVH - Thursday, 09 November 2017, 12:44PM

Everything is back up electrically. We are checking that everything is OK and we are identifying still impacted services/customers.

Comment by OVH - Thursday, 09 November 2017, 13:25PM

Hello, Two pieces of information,

This morning we had 2 separate incidents that have nothing to do with each other. The first incident impacted our Strasbourg site (SBG) and the 2nd Roubaix (RBX). In SBG we have 3 datacentres in operation and 1 under construction. In RBX, we have 7 datacentres in operation.

SBG: In SBG we had an electrical problem. Power has been restored and services are being restarted. Some customers are UP and others not yet. If your service is not UP yet, the recovery time is between 5 minutes and 3-4 hours. Our monitoring system allows us to know which customers are still impacted and we are working to fix it.

RBX: We had a problem on the optical network that allows RBX to be connected with the interconnection points we have in Paris, Frankfurt, Amsterdam, London, Brussels. The origin of the problem is a software bug on the optical equipment, which caused the configuration to be lost and the connection to be cut from our site in RBX. We handed over the backup of the software configuration as soon as we diagnosed the source of the problem and the DC can be reached again. The incident on RBX is fixed. With the manufacturer, we are looking for the origin of the software bug and also looking to avoid this kind of critical incident.

We are in the process of retrieving the details to provide you with information on the SBG recovery time for all services/customers. Also, we will give all the technical details on the origin of these 2 incidents.

We are sincerely sorry. We have just experienced 2 simultaneous and independent events that impacted all RBX customers between 8:15 am abd 10:37 am and all SBG customers between 7:15 am and 11:15 am. We are still working on customers who are not UP yet in SBG. Best, Octave

_pctq · on Nov 9, 2017

Btw, note for those who use ovh ISP like me (this is a thing in France): your connection works, only the DNS's do not.

Fix (debian-like):

    sudo apt-get install bind9

Then put in /etc/resolv.conf, if it's not already there:

    nameserver 127.0.1.1

This runs a local nameserver that you use directly for resolving.

Oh, obviously, you need resolving to install the resolver :) Hope you have a 4g connection available.

Alternatively, you can just use google dns:

    nameserver 8.8.8.8
    nameserver 8.8.4.4

tyingq · on Nov 9, 2017

My OVH dedicated servers seem fine. Webservers, ssh, all working. All ones in Canada.

rkachowski · on Nov 9, 2017

My vserver also seems fine, uptime 528 days. The control panel seems to be down however.

carlop · on Nov 9, 2017

Mine too. All ones in Roubaix.

qeternity · on Nov 9, 2017

All of our dozen or so bare metal boxes are up in GRA as well as all of our cloud instances. However object storage is down.

dx034 · on Nov 9, 2017

They now posted their explanation [1] but I don't buy it. I find it hard to believe that the RBX incident happened shortly after the SGB incident without any connection between these two. They should have redundant networking (at least that's what they say) so one corrupted DB in RBX shouldn't have brought down the whole DC (or 7 DCs according to their system). Maybe they pulled corrupt data from SGB because it was down but I don't believe that at the same time of a power failure, two redundant network nodes got corrupted without any notice. Otherwise wouldn't that mean that one hardware issue can also bring down a whole region?

[1] http://status.ovh.net/?do=details&id=15162&PHPSESSID=7220be2...

dx034 · on Nov 9, 2017

Some servers in GRA still appear to work if that's of any help. All data centres offline at once sounds more like an attack than a power failure in one location. According to them, there was a power failure in SBG but I don't see how that should affect routing in data centres several hundred miles away.

https://twitter.com/olesovhcom/status/928541667283623936

EDIT: Maybe related to the Cisco issue?

https://blogs.cisco.com/security/cisco-psirt-mitigating-and-...

pfg · on Nov 9, 2017

It seems more likely that their data centers aren't quite as isolated as they thought they'd be. The outage also appears to be limited to their locations in Europe.

dx034 · on Nov 9, 2017

I always wondered why they promote having 6 datacentres in Roubaix when google maps shows that they're all within 50m. Can't be too much redundancy there.

pyrale · on Nov 9, 2017

It's their original implantation. The 6 DCs in Roubaix are probably there for more storage capacity, not for redundancy.

dx034 · on Nov 9, 2017

Then they could call it one DC. A DC can have more than one building. But by saying you have several data centres you imply redundancy, similar to AZs at AWS. And (at least to Amazon), two AZs are far enough away from each other so that one building could blow up without affecting the other one.

zaarn · on Nov 9, 2017

The two issues are isolated.

One datacenter suffered massive power failure while the other hand network equipment fail at roughly the same time.

pfg · on Nov 9, 2017

The network equipment failure was due to a software error[1]. I would not rule out some kind of cascading failure triggered by the initial outage until the cause of the bug is known, though OVH seems comfortable with such a statement.

[1]: http://travaux.ovh.net/?do=details&id=28256

pilsetnieks · on Nov 9, 2017

> According to them, there was a power failure in SBG but I don't see how that should affect routing in data centres several hundred miles away.

According to them, they do their routing in SBG, so it's plausible that it could lead to all of their network being down.

dx034 · on Nov 9, 2017

How is that even a thing? Don't you have separate routing for every switch? Otherwise you don't have redundancy? I'd even expect data centres to use different hardware for networks to avoid having a single point of failure.

lultimouomo · on Nov 9, 2017

I have a GRA VPS running fine as well. Can't access the control panel or anything else really, though.

mixedbit · on Nov 9, 2017

Maybe an attack related to the recent OVH price increase: https://news.ycombinator.com/item?id=15596282

dx034 · on Nov 9, 2017

Doesn't look like DDOS though, would have to be a targeted attack on the infrastructure.

Their explanation will be really interesting. All those data centres are pretty useless if they have a single point of failure.

jedisct1 · on Nov 9, 2017

Details here: http://travaux.ovh.net/?do=details&id=28244

Apparently, the root cause of that issue is a critical software bug in Cisco NCS 2000 transponders.

0xbadcafebee · on Nov 9, 2017

> "Diagnosis: All the transponder cards we use, ncs2k-400g-lk9, ncs2k-200g-cklc, are in "standby" state. One of the possible origins of such a state is the loss of configuration. So we recovered the backup and put back the configuration, which allowed the system to reconfigure all the transponder cards."

Their interfaces lost their configuration, and they re-applied configuration, and state came back. This does not equal critical software bug.

> "One of the solutions is to create 2 optical node systems instead of one. 2 systems, that means 2 databases and so in case of loss of configuration, only one system is down. If 50% of the links go through one of the systems, today we would have lost 50% of the capacity but not 100% of links."

This is a crap mitigation. They're still depending on the same hardware and process that led to the first outage, only now there's more of it, so there's more chances to fail.

If they had continuous configuration automation they would have detected when the router's state changed, identified the missing bits, and applied configuration.

"New" routers (as in, since 2011) have APIs and can even run code directly on the router in order to fulfill these requirements. Cisco has multiple white papers, and even provides complete products to manage and certify configuration is applied as desired, even in cloud-agnostic multi-tier networks. Even on old routers, practically all config management solutions out there have plugins to manage Cisco routers.

It's also ridiculous that they had no access to remote hands. This is IT 101.

dorfsmay · on Nov 9, 2017

Not "all"!

Maybe their main DCs, or their largest, but not all of them. I have virtual servers in thier Quebec DC (BHS) and it hasn't gone down since the last time I rebooted it.

ashitlerferad · on Nov 9, 2017

I have 30+ servers on OVH. All are online.

xmichael99 · on Nov 9, 2017

This happens to Internap almost weekly... I always wondered why they never make it in the news.

dredmorbius · on Nov 10, 2017

How Complex Systems Fail

http://web.mit.edu/2.75/resources/random/How%20Complex%20Sys...

ever1 · on Nov 10, 2017

Detailed report https://twitter.com/olesovhcom/status/928904373949919232

perlgeek · on Nov 9, 2017

A website that I host on ovh is up: https://sudokugarden.de/

ovh.com looks down for me too.

You can check it's hosted by OVH:

$ whois $(dig sudokugarden.de +short)

seszett · on Nov 9, 2017

It's in the Gravelines datacenter, which is not actually down contrary to what the initial reports said (only Strasbourg and Roubaix are, and for two different reasons).

fapjacks · on Nov 9, 2017

Huh. I have services active on two dedicated machines from OVH in Canada, and I was logged into both via SSH all night, and didn't have any interruption at all.

r1ch · on Nov 9, 2017

Looks like only their routing / network was down. My servers just came back up and haven't experienced any power outage.

zimpenfish · on Nov 9, 2017

Same here - my server was merely unreachable, not down.

nstricevic · on Nov 9, 2017

I just moved 2 apps to OVH. So this was totally unexpected. My apps are unavailable for more than 7 hours.

Does this happen often with OVH?

askmike · on Nov 9, 2017

My server hosted on OVH had some problems (DNS lookups) but has stayed up and works fine right now.

EDIT: Hosted in EU.

gizzlon · on Nov 9, 2017

Now this status page is down as well. Sucks to be them right now =/ (I'm in Europe)

pavlakoos · on Nov 9, 2017

I'm trying to find ETA for solving the issue, but they didn't post it on Twitter.

Anybody knows ETA?

stemuk · on Nov 9, 2017

https://twitter.com/olesovhcom/status/928552251458818048

ETA: 15 min

gyaru · on Nov 9, 2017

They haven't posted any ETA yet.

pavlakoos · on Nov 9, 2017

They're getting up

pavlakoos · on Nov 9, 2017

30 mins they say

stevenh · on Nov 9, 2017

My OVH servers in Canada and Australia are running fine.

My OVH servers in France are all inaccessible.

throw2016 · on Nov 9, 2017

Seems to be back up. Quite a large disruption for OVH. Hope we get a postmortem

jagermo · on Nov 9, 2017

This has to be one of the least informative status page I have ever seen.

treo · on Nov 9, 2017

Looks like they are starting to come back up. My VPS is accessible again.

windowshopping · on Nov 9, 2017

I had never heard of this company til I saw this post. Shrugged, thought, "huh, wonder who that's affecting."

Opened up Age of Empires II....no connection. Go to website for game servers..."Our provider, OVH, is down...."

Go figure.

roblabla · on Nov 9, 2017

OVH is a giant hosting company in Europe, providing dedicated servers to quite a lot of companies. Half the internet here was borked. A translation service, a few developer tools, etc...

Sure, it's less dramatic than AWS going down, but it still hits, and hard.

zaarn · on Nov 9, 2017

OVH is rather popular in Europe atleast (alongside Hetzner and 1&1).

tyingq · on Nov 9, 2017

The Canadian data center is also a very low cost way to serve the US. I'm an OVH fan. They have their quirks, but the pricing is great. You just make sure you compensate for their quirks with backups and DR plans.

zaarn · on Nov 9, 2017

I also like them from a privacy viewpoint.

My data is with me in Europe and the company that has my data is in Europe with me too. If I was using DO or AWS, then my data may be in Europe but the control over the data is in the US, free for all to the three letter agencies and lacking privacy laws there.

oron · on Nov 9, 2017

not all of them, I have some servers in Canada, working OK

thejosh · on Nov 9, 2017

Sydney is fine.

KeitIG · on Nov 9, 2017

I imagine Mr Good Guy at OVH telling some others:

"guys we have a single point of failure in our architecture with SBG, maybe we should...

- naaah it's fine, we do not have time nor resources"

Then shit happens.

edit: I have no idea what is happening exactly, but OVH being what it is, it seems extremely weird that all datacenters "can" get down at the same time, and it looks like a serious architecture problem to me (or backup systems, like generators, not being correctly tested... whatever). I am really curious about the future explanation with what happened exactly

edit2: Why all the downvotes? Even the status page of OVH is down, do not tell me it is good design. We are not here to be charitable, but realist.

vabene1111 · on Nov 9, 2017

its OVH: The Hardware is good DDoS protection is good The Prices are high

but

support/administration does not work well, i have a lot of really weird story's with them, from them plugging in a keyboard in our server to reboot it (without any reason) to taking down a server for a requested maintenance only to notice after 4 hours of downtime that they did not ask their bosses if they were allowed to even perform the maintenance requested (and then not getting permission to do so after another 2 hours ..)

For me it feels like there are some really deep issues somewhere in the whole administration that make incidents like this no real surprise

Problem is most other providers dont work any better, so ...

Everyone makes mistakes, let's just hope they learn from it.

tyingq · on Nov 9, 2017

>ts OVH: The Hardware is good DDoS protection is good The Prices are high

The prices are high? Compared to what? Cheap is their raison d'être.

r1ch · on Nov 9, 2017

Their prices are pretty amazing for some of the configurations you can put together, especially since you can pay a large setup fee in exchange for a lower monthly rate.

Just the bandwidth alone would cost us 3x what we pay at OVH if we were with one of the big cloud providers.

vabene1111 · on Nov 9, 2017

compared to other dedicated Server Hardware, not talking about business Cloud Infrastructure, no idea about that.

Sry if that caused confusion

dx034 · on Nov 9, 2017

Who's cheaper on the dedicated side? Hetzner can be a bit cheaper than soyoustart but wouldn't be aware of anyone else (with reasonable quality).

zaarn · on Nov 9, 2017

I've tried Hetzner but the Network Peering to Telekom is subpar compared to OVH. On Hetz I got about 40Mbps up/down to my local computer while on OVH I can easily load my DSL 100% without issues.

dx034 · on Nov 9, 2017

OVH is one of the few with direct peering to Telekom, most don't want to pay for that. Otherwise Hetzner is good, their own network is limited but peering works reasonably well. But apart from them I'm not aware of anyone with cheaper dedicated prices than soyoustart

tyingq · on Nov 10, 2017

I was considering their entire line...soyoustart, kimsufi, and ovh. There's not much cheaper than them. Hetzer in some cases, but not all.

Filligree · on Nov 9, 2017

Well, Hetzner for example.

switch007 · on Nov 9, 2017

> - naaah it's fine, we do not have time nor resources"

Yup, been there multiple times in smaller hosting companies.

It's basically how it goes. They don't get serious about outages until revenue is severely affected and the brand damaged, they don't get serious about security until there's been a big breach or sales are lost because of lack of certification.

nolok · on Nov 9, 2017

Calling OVH a "smaller hosting company" (which you did do indirectly) is rather funny.

mbrameld · on Nov 9, 2017

The person you're replying to was relating their experiences at hosting companies that are smaller than OVH. How does that imply that OVH is a smaller hosting company?

abiox · on Nov 9, 2017

to me it's ambiguous; it could be either of

> in smaller hosting companies [like ovh]

> in smaller hosting companies [than ovh]

lampington · on Nov 9, 2017

Unless they meant "smaller hosting companies [than OVH]"?

switch007 · on Nov 9, 2017

Yes, a comma would have made it slightly clear:

> Yup, been there multiple times, in smaller hosting companies.

I find it funny that anyone would think I would or could refer to them as small (and get away with it)

notyourday · on Nov 9, 2017

This happens all the time. Every single thing that you see in software development happens in network engineering and data center engineering, except that where in development in general senior people who write software are capable of at least guestimating complexities to provide a marginally unified front against the unreasonable expectations of execs, it is pretty much never the case in neteng or dcops as those that develop software cannot write their heads around the complexities of working with physical hardware.

It is rather counter-intuitive. In neteng and dcops, "I don't know and I cannot find out. I can only attempt to mitigate what i think might have caused it for next time" is a very reasonable answer to 99% of the "why this happened?" questions because in order to replicate the situation to test the theory one needs to recreate the same problem again on the same scale.

This also means that certain things cannot be tested. Most of generator tests are garbage - turning on generator and running it without production load delivered over the transfer switch does not test anything other than that one can turn on a generator and run it. The problem typically happens not because the generator ( also is there the generator or the first and the second generator? Why is there no generator bank for a non monkey-sized company? ) does not start - the problem is because over time transfer switch develops a problem and unlike generators it is not possible to test a transfer switch where in the event of a test failure the customers won't lose power unless the data center is designed from the beginning to deliver A and B powers over separate circuits to every single customer and every single customer has per system ( not per rack ) transfer switches.

Of course it costs a lot more money, something that companies are reluctant to spend.

Gelob · on Nov 9, 2017

Any decent sr. network engineer or architect should be able to design you a network and explain the pros and cons, risks, and future scalability.

If someone doesn't know why a failure occured and they can't find out then they aren't looking hard enough

jlgaddis · on Nov 9, 2017

I think I'm a pretty decent senior network engineer but I've been hit by firmware bugs more than once. No matter how much redundancy you build in, there's always shit that can go wrong and there will always be things that you just can't foresee (like the bug in this case). The software on these network devices and optical gear is written by humans, like all software, and is not perfect, like all software.

(In one case I experienced, the vendor reassured me that what I reported was not even technically possible -- until one of their engineers flew out and witnessed it firsthand.)

notyourday · on Nov 9, 2017

Please tell me about about this fabulous senior network engineer and where I can obtain a dozen of them.

matthewmacleod · on Nov 9, 2017

That seems a little uncharitable.

dx034 · on Nov 9, 2017

If the SBG issue really triggered the outage for the whole network I find it hard to believe that no one saw that problem beforehand. They probably thought that this was too unlikely to happen or that there are other failovers but never tested them properly.

No expert on the field but that's the first time I can remember that a provider of that size loses connection to most of their data centres at once. That can happen with one product (eg S3 failure) but datacentre switches should work even if the rest is on fire.

pfg · on Nov 9, 2017

> No expert on the field but that's the first time I can remember that a provider of that size loses connection to most of their data centres at once.

It happened to GCE last year[1], though it only lasted 18 minutes.

[1]: https://status.cloud.google.com/incident/compute/16007?post-...

dx034 · on Nov 9, 2017

But that's one product, not the whole DC. As I understand the post, other services worked correctly during that time.

zaarn · on Nov 9, 2017

In this case, I wouldn't be to hard on them. As it appears they lost their main power line, the backup power line and both generators failed and one generator has been restarted now.

dx034 · on Nov 9, 2017

So what? Losing main power is a standard case for any DC. That's why you have generators. Even a generator failure is nothing out of the ordinary. But that no generators in a DC work kind of indicates that they don't test them as often as you would expect.

They just announced that they want to be a "hypercloud" provider on the scale of AWS and Google Cloud. I really hope that a power failure in Virginia couldn't bring down all of AWS.

jbb67 · on Nov 9, 2017

When I've seen things like this before, it's often been the switchover hardware that fails, not the generator as such. it's much harder to test that as you don't want to tell your customers "sorry your sever went down, we were just testing if the switchover worked and it didn't"

hinkley · on Nov 9, 2017

I worked for a company doing mostly on-premises (wireless) carrier software years back but they still wanted our own server room to be 3 nines for reasons lost to me now. They had installed new electrical circuits high on the walls so that we could survive minor flooding incidents.

So our lead Ops guy is unplugging half the redundant power supplies and plugging them into the new circuits, but a few critical servers are single PSU still, on UPS units. This is how he discovers that one of our PSUs has rotted and only has about five seconds of reserve power in it. Big outage, no bueno.

vtsingaras · on Nov 9, 2017

Exactly that, we too suffered a power loss at our DC that was due to faults in the power supervising and switch-over circuits; we test both generators weekly.

matthewmacleod · on Nov 9, 2017

It doesn't matter how much testing you do – unfortunately some things can still fail.

I'm much more interested in why the failure cascaded to other data centres – that's exactly what shouldn't happen.

hinkley · on Nov 9, 2017

Didn’t AWS have a cascading failure last year because of a missing interlock on a script to take part of a data center offline? They discovered that it takes a lot longer than they remembered to reintroduce that many machines to the cluster. They came away from that incident with two or three action items.

zaarn · on Nov 9, 2017

The generators did work but they failed. Both of them.

I can not imagine they weren't tested.

But even the most rigorous testing can never reduce the total failure risk to 0. It seems OVH just got very very unlucky.

hvidgaard · on Nov 9, 2017

What is the point of backup generators if you do not verify that they work every so often? I have a very hard time believing that they actually tested that they worked, because a failure of not one, but both of them.

cjsuk · on Nov 9, 2017

To be fair people do test generators on a monthly schedule usually. Problem you find is it’s getting colder now so any problems are amplified suddenly. Might have been entirely tested a couple of weeks ago.

teapot01 · on Nov 9, 2017

Generators in critical situations like this are usually provided with crankcase heaters which keeps the oil at temperature and makes them easy to start. Also, any emergency power system should be (and usually is) tested monthly if only to prevent the fuel lines going bad.

hinkley · on Nov 9, 2017

Data centers can go down while testing the generators too.

There was a DC in California a few years back that had three generators fail during a scheduled test and one of their server rooms had a blackout because of it.

cjsuk · on Nov 9, 2017

This is a very good point. In all the companies I've worked in we've had more UPS outages than line outages as well.

hvidgaard · on Nov 9, 2017

If your generators are not reliable in the cold, the issue is the placement of them. This is something basic to account for.

matthewmacleod · on Nov 9, 2017

Right, but that was obviously just an example.

Their generators are almost certainly tested frequently. But there could be any number of causes underlying the failure, and unfortunately sometimes failure does happen.

zaarn · on Nov 9, 2017

The problem is usually not "not reliable in cold" but rather, the generators are X years old and the temperature is now changing in Europe from "mostly warm" to "warm over the day and icecold in the night" and finally aiming for "icecold all day", which means any equipment exposed will go through rather severe temperature changes.

While generators are usually able to handle this with sufficiently low failure risk, the risk is increase due to the changing temperature

dx034 · on Nov 9, 2017

It wasn't even freezing overnight at Strasbourg [1], I doubt that the cold could cause any effects at 5C. Maybe they were run once a month but never tested under load? 2x20kv lines failing will have put them at near maximum load immediately, perhaps they weren't designed for that.

[1] https://www.accuweather.com/en/fr/strasbourg/131836/daily-we...

Piskvorrr · on Nov 9, 2017

Improbable. Even in a mostly-normal office building, the monthly generator test consisted essentially of "cut the mains power and see that no impact is perceived as UPS, batteries, and autostarted generators bear the load, in their turn for ~2 hrs total."

hvidgaard · on Nov 10, 2017

Europe sees that cycle every year, and have been for more than a century, so if they cannot handle that, they should be inside in a heated room.

cjsuk · on Nov 9, 2017

Yes and no. What the generator vendor says when they sell you the generator and what actually happens 5 years down the line are two different things.

api · on Nov 9, 2017

I wonder if the generators or switching hardware had some kind of stupid IoT thing in them that required a network connection? That'd be one for:

https://twitter.com/internetofshit

nicc · on Nov 9, 2017

What is SBG? A CDN of some sort?

pfg · on Nov 9, 2017

It's the location of one of their data centers. SBG for Strasbourg.

seszett · on Nov 9, 2017

It's a collection of datacenters in Strasbourg.

nicc · on Nov 9, 2017

Ah, gotcha.

So, that one datacenter caused all other datacenters to die..?

seszett · on Nov 9, 2017

It's supposed to be two separate incidents: power going down in Strasbourg, and fiber network equipment going down in Roubaix (the main center of OVH's network) due to a "software bug".

It's explained here https://twitter.com/olesovhcom/status/928587258583748609 in French, they might post an English-language translation soon.

nicc · on Nov 9, 2017

Thanks!

scriptproof · on Nov 9, 2017

It was a power outage. Their own generators did not work. This explain why all is down, but there is surely something bad in their architecture.

helb · on Nov 9, 2017

A power outage in all locations? They claim to have "22 datacenters on 4 continents"…

EDIT: The title on HN in misleading, summary from their CEO here – https://twitter.com/olesovhcom/status/928592231807713280

contingencies · on Nov 9, 2017

To make error is human. To propagate error to all server in automatic way is #devops. - @devopsborat

chii · on Nov 9, 2017

just in case you followed the wrong devops borat, it's @DEVOPS_BORAT (the other one is a spam bot).

waz0wski · on Nov 9, 2017

Azamat suggest to follow for your Kazakh tech need

@DNS_BORAT https://twitter.com/DNS_BORAT

@InfoSecBorat https://twitter.com/InfoSecBorat

@KanbanBorat https://twitter.com/KanbanBorat

@mysqlborat https://twitter.com/mysqlborat

@NetEng_Borat https://twitter.com/NetEng_Borat

@secure_borat https://twitter.com/secure_borat

@SecurityBorat https://twitter.com/SecurityBorat

@Sysadm_Borat https://twitter.com/Sysadm_Borat

IgorPartola · on Nov 9, 2017

Anyone here remember https://en.m.wikipedia.org/wiki/Bastard_Operator_From_Hell

dredmorbius · on Nov 10, 2017

https://www.theregister.co.uk/data_centre/bofh/

_ugfj · on Nov 9, 2017

I was translating that to Hungarian so long ago

HelloNurse · on Nov 9, 2017

Borat is human, a spambot is devops.

Gigablah · on Nov 9, 2017

Pfft network engineers have been at it way longer :p

y4mi · on Nov 9, 2017

Network failures are generally easier to recover from though. You just need to reconfigure the key nodes and everything is back in business.

If your CD pipeline deploys corrupt apps, even your production database can get compromised,forcing a full restore. No matter how long that might take

Thankfully,I didn't have to witness that yet.

Sami_Lehtinen · on Nov 9, 2017

Title is misleading. Only RBX and SBG were affected.

06:15 UTC SBG serves failed.

OVH network weathermap: http://weathermap.ovh.net

Btw. First post: https://news.ycombinator.com/item?id=15660524

sspiff · on Nov 9, 2017

My system in Lille has also been up without problems, at least since I got in at work.

tmikaeld · on Nov 9, 2017

Sorry, i didn't see the time stamps! You where first! :D

holmb · on Nov 9, 2017

It would seem an ID of 15660524 would be submitted before 15660556, no?

viraptor · on Nov 9, 2017

Likely. But there's also a number of reasons why this may not be the case.

Sami_Lehtinen · on Nov 9, 2017

We're nerds, so how about checking the facts (?). Please explain me if there has been some kind of time anomaly lately or?

My submit timestamp: 07:21:25

Your submit timestamp: 07:28:23

About the title. I did consider the title for a while, because I wasn't sure how bad the situation was. But from my own independent monitoring system I did see that RBX and SBG servers were unavailable. Of course I also did some basic trouble shooting and confirmation work before posting.

Btw. Right now, there's some network traffic present on SBG network. Let's hope that the systems are soon up'n'running.

tmikaeld · on Nov 9, 2017

Sorry, I didn't know how to check the timestamps.

Yeah, I expected this to clear up in a matter of minutes.

Now it seems to be a shitstorm of historic proportions...

Sami_Lehtinen · on Nov 9, 2017

I'm waiting for the postmortem. I'm very curious to see, what was the root cause of all this mess. As usual, there probably were several overlapping causes.

jlgaddis · on Nov 9, 2017

Who cares?

metafunctor · on Nov 9, 2017

Someone with access might wish to update the title of this post, because all OVH datacenters are definitely not down.

fapjacks · on Nov 9, 2017

Yes, very important. I have two dedicated machines in different racks in Canada that were serving requests all night. I was also logged into both machines via SSH all night with no interruption.

vabene1111 · on Nov 9, 2017

exactly, german datacenter is operational without any problems yet

sctb · on Nov 9, 2017

Thanks, we've updated the headline.

dx034 · on Nov 9, 2017

But no one knows which DCs and services are down. They lost their internal network and have no idea themselves.

metafunctor · on Nov 9, 2017

I'm pretty sure they know exactly which services are down.

Even if they didn't, clearly many services are up and running normally, so saying "all datacenters are down" is just a lie.

Hates_ · on Nov 9, 2017

Trending on Twitter with the hashtag #OVHGATE

https://twitter.com/hashtag/OVHGATE?src=hash

api · on Nov 9, 2017

We selected their three data center EU region precisely because they were three separate data centers, so not happy. This is clearly bad design.

I think we're now going to have to look into multi-provider options. The only way to be solidly up is to be hosted by more than one company at more than one data center.

I've also heard stories of billing nightmares where you get locked out of a cloud provider account, so that's another thing.