App Engine down

davidjgraph · on Oct 26, 2012

Before the doom and gloomers come out, this is the first time since leaving beta I can remember it happening.

We left AWS about 18 months ago after one of the outages and switched to GAE. I've counted 3-4 big downtimes for AWS compared to this one on GAE. That's still a good decision (for now)....

acdha · on Oct 26, 2012

One thing to remember: this took down all of app engine for at least an hour. AWS has had only 17 minutes of downtime affecting all of us-east this year (that network glitch a couple days after PyCon) - the rest of it has been a subset of the service amplified by people rediscovering that they weren't as redundant as they thought.

The correct less to draw is that any one point of infrastructure is a risk, so you need to scale wide. This is possible to do with AWS regions, or other providers - even internal bare iron if you're so inclined, but impossible to do with GAE because you're committed to a single-vendor API as well as their infrastructure.

stickfigure · on Oct 26, 2012

GAE applications are distributed across multiple data centers[1], so in theory you get "scale wide" automatically. Unfortunately it looks like there was some sort of flaw in the architecture. I believe this is the first systemwide failure of the HRD.

The real question is: Can you and your ops team build a "scale wide" system better than Google?[2] How much effort are you willing to put into it, when those development resources could be put into making features instead?

[1] For apps which use the high-replication datastore. Old (deprecated) master/slave apps are served out of a single datacenter.

[2] This is a reasonably serious question to ask. In GAE, you're one app among many, so a you-specific scaling solution might be easier and more robust than the generalized one Google builds. But not necessarily.

acdha · on Oct 26, 2012

I completely agree that Google is likely to do better than many teams, modulo your second point (which I completely agree with - generic is much harder than specific). For me it really just comes down to the lock-in aspect: with GAE if you decide that Google isn't taking the platform in the right direction for your business you're looking at something close to rewriting your application. This is far from the most likely outcome - although pricing can be interesting - but it's the kind of thing you really want to consciously acknowledge, similar to the way Netflix has accepted the risks of AWS by investing heavily in failure management tools.

latchkey · on Oct 26, 2012

I keep hearing the lock-in argument over and over and I'm not quite sure that there is a solid basis for it beyond a level of paranoia. It's fine to be paranoid, I just don't want it to hold me back unnecessarily.

Looking at things more closely, the only thing that you're truly locked into with GAE is the esoteric nature of the datastore. This isn't any worse than picking say MySQL vs. Oracle or Riak vs. Mongo. Most applications end up depending on some sort of specific functionality of each database that they are written for. While it would be difficult to migrate to another solution for storing your data, it wouldn't be impossible.

There is no way to predict what direction any product might take in the future. Look at the way that Oracle is treating MySQL now. Tons of vendor lockdown there.

The only reason to migrate away from GAE would be if you find out that your application doesn't work well on it (pricing, scalability, etc) or if Google decides to kill GAE entirely. Hopefully you do the analysis of your application before you decide to use GAE (ie: you can't blame GAE for you deciding to use it) and with a 3 year deprecation promise, I'm pretty confident that it will be around for a while longer.

riffraff · on Oct 26, 2012

> The only reason to migrate away from GAE would be if you find out that your application doesn't work well on it (pricing, scalability, etc) or if Google decides to kill GAE entirely. Hopefully you do the analysis of your application before you decide to use GAE (ie: you can't blame GAE for you deciding to use it)

This seem too simplistic. Off the top of my head, you may leave because:

* you want to do something new in your app that is not possible/effective to do within GAE * you find another option that is cheaper * you find that some of the assumptions and architectural decisions you took on day 0 no longer hold * you did the initial analysis, and it was wrong

I do agree with GAE being awesome and lock-in not being so bad, but I doubt the "you have it all figured out before building it so it's going to work forever" idea.

stickfigure · on Oct 27, 2012

FWIW, I frequently want to do things that are not possible/effective on GAE. And so I do it - several of my GAE apps communicate with services I set up in rackspace cloud (30-40 ms of latency away). It would probably be even lower running in Google's cloud service, but I haven't gotten an invite yet.

GAE is not "all or nothing". You can still run exotic services in other hosts. Or, for that matter, use GAE for specific services in your "other cloud" app. You get two bills. Not much of a downside.

MatthewPhillips · on Oct 26, 2012

> Looking at things more closely, the only thing that you're truly locked into with GAE is the esoteric nature of the datastore. This isn't any worse than picking say MySQL vs. Oracle or Riak vs. Mongo.

Sure it is. If you pick the wrong DB your only locked into that DB. You pick GAE and you're locked into GAE's DB.... and GAE. I can move my MySql db to another cloud provider.

marekmroz · on Oct 26, 2012

@latchkey I think you are missing the point the MatthewPhillips is making. With GAE's DB there is nowhere to move, so if you want to move you would have to migrate your data to a different DB engine. If your hosting/cloud provider uses something standard like MySQL you can find another provider or roll your own if you decide to migrate out.

latchkey · on Oct 26, 2012

Your analogy of moving to another cloud provider is incorrect. Unless you've been 100% database agnostic, you can't just migrate your schema and data from say MySQL to Postgres.

With the datastore on GAE, you can get your data out of it and move it into something like MongoDB. I'd argue that it would probably be less code to move to MongoDB because the datastore has all sorts of esoteric issues that you have to code around (like the way that entity groups and transactions are handled).

In terms of the rest of your application, it is just a standard webapp in whatever platform you choose. The .war file I have for GAE will run just fine in Tomcat. The only real lockin is the way you store your data.

sktrdie · on Oct 26, 2012

With the Java API you can write code that is standard (JDO/JPA for database stuff), that works across any Java environment. So it doesn't lock you in if you write your code in a smart way.

Nitramp · on Oct 29, 2012

Except that writing code using JDO/JPA is a nightmare, at least on Datastore. So that's almost certainly not the smart way. I'd rather rewrite my app three times - go with Objectify (simple, easy, maps well to Datastore concepts) and just deal with the rewrite if it happens.

rbanffy · on Oct 26, 2012

> you're committed to a single-vendor API as well as their infrastructure.

Not quite. You can roll out your own App Engine platform with AppScale:

http://appscale.cs.ucsb.edu/

Sadly, TyphoonAE (http://code.google.com/p/typhoonae/) seems abandoned.

tedroden · on Oct 26, 2012

It's one of the first times the whole service has been down, but parts of the service go down at least once a week. memcache, task queues are "elevated" with regularity and urlfetch is frequently down totally. ("elevated" generally means unusable).

Of course master/slave even has scheduled downtime.

lyddonb · on Oct 26, 2012

Yep. We've never seen anything like this before.

namank · on Oct 26, 2012

Same. Minor updates and issues pop up with GAE but I've never experienced an outage such as this in my two years of use.

Wonder what's happening.

mikegioia · on Oct 26, 2012

I have yet to experience downtime with RackspaceCloud and I've been using them for like 3 years.

dchest · on Oct 26, 2012

I had network issues with my two virtual machines on RackspaceCloud about a year or so ago (something was wrong with their routing). They were resolved quickly, though.

jmsduran · on Oct 26, 2012

Me too.

davis_m · on Oct 26, 2012

I think this is larger than just GAE.

http://internettrafficreport.com/namerica.htm

It seems like large portions of the internet are down.

EwanToo · on Oct 26, 2012

Internet Traffic Report, while a nice concept, is unfortunately very misleading.

Their sample size is extremely small, and most of those are permanently down.

Have a look through their list of north american routers and find one of them where packet loss has gotten worse as their main overall graph for packet loss would suggest - I've just been through them all and couldn't find one.

calinet6 · on Oct 26, 2012

Their baseline values are very misleading.

But their relative metrics can still be useful.

For example, it's very inaccurate to say that 51% of the internet is down.

But it's precise to say that packet loss among the working nodes has increased about 30% in the last 24 hours, and sharply.

EwanToo · on Oct 26, 2012

But that's likely 2 or 3 nodes, not a meaningful sample

obituary_latte · on Oct 26, 2012

Agreed. I was surprised to see only one[1] of the "down" routers with any packet loss over the last 24 hours.

That said, it's still interesting how the overall traffic trends so sharply downwards. I wonder if they have more data than they are showing in the graphs.

[1]http://internettrafficreport.com/history/190.htm

jre · on Oct 26, 2012

It seems like the individual router graphs most recent update is 10/26 00:00 whereas the overall graph has been updated at 10/26 08:30.

untog · on Oct 26, 2012

Yes, Tumblr is experiencing difficulties and AFAIK they don't run anything on App Engine.

neanderdog · on Oct 26, 2012

May not be anything or may be..

I noticed a couple of days ago that some of our dns entries were mysteriously removed from level 3 servers which out of old habit are used for resolution (some of the ip's go back to uu-net/worldcom/mci)

Now the interesting bit is they were for private subnet ip's. They're working fine everywhere else.

Today the last of their dns servers removed the entries so I had users go to google (8.8.8.8) and all's well with our apps.

Level 3's entries for our external stuff is there, just the private subnet stuff is removed.

If others do this too and resolve with level 3....

edit: just found this: http://tracker.outages.org/reports/view/59

X-Istence · on Oct 26, 2012

And here we have another website that does something similar:

http://www.internetpulse.net

And according to them all routes are up and running just fine,not only that bit ping times aren't elevated.

j_baker · on Oct 26, 2012

My money is on Hurricane Sandy: http://www.nhc.noaa.gov/refresh/graphics_at3+shtml/084338.sh...

sparkinson · on Oct 26, 2012

Now those are some interesting graphs.

FiloSottile · on Oct 26, 2012

The internet is burning :O Well, seriously what could be a root cause that affects so many nodes?

bsaul · on Oct 26, 2012

only 51% of north american internet is working ?? Am i reading this correctly ?

EwanToo · on Oct 26, 2012

You're reading it correctly, it's just telling you the wrong thing.

51% of a very small self-selecting set of machines are down, mostly small businesses or home users which have obviously been shutdown or renamed.

If internettrafficreport automatically removed all devices which returned no results for 7 days, they'd be a slightly more useful resource.

seldo · on Oct 26, 2012

As Ewantoo mentioned, ITR is pretty unreliable. We have found it works as a relative measure -- when their packet loss rises, we do indeed see consumer traffic fall, though usually not by the amount their graphs would indicate.

aidos · on Oct 26, 2012

Add Dropbox to that list.

cfontes · on Oct 26, 2012

Great find, I did not know this site.

fidotron · on Oct 26, 2012

It's time we remembered the whole strength of the internet was that it was distributed and we avoided introducing single points of failure. We have ended up using vast amounts of infrastructure for no reason other than developer convenience (often with respect to security), when having local direct connections is often more suitable than shooting everything into the cloud.

magicalist · on Oct 26, 2012

Huh? Historically, almost all content on the internet has had a single point of failure. Moving "into the cloud" is moving to a distributed model, and generally app engine will protect you from single points of failure. As others have noted, if you distribute your own self-managed servers across the globe, you end up with a very similar system, but now you have to manage it.

What appears to be happening here is that you're still vulnerable because the C&C infrastructure ultimately has a single command source, and so can be vulnerable if some code is pushed that affects the whole system. Your homegrown cloud will suffer from the same vulnerability, it may just be more or less easy to manage depending on how specialized your needs are compared to the more general requirements of GAE.

Edit: actually, maybe I missed what you're advocating with "local direct connections". This might make more sense from a user's perspective: if everyone ran their own little cloud, a failure may bring down reddit, but not reddit, heroku, pinterest, etc simultaneously. That's actually an interesting point, but I'm not sure if it really matters if they sync up their downtimes (since they would still have downtime, and maybe more or less of it depending on how much they could afford to invest into managing their distributed solution). I'm also not sure if that really solves the problem, since there are other concentrations in the network, just less visible ones (there are a fairly small number of major datacenters around the world, for instance, and managing your own colocated server doesn't matter if the whole building goes dark).

I do agree that at the very least we need to maintain an ecosystem of "cloud" providers, however.

philwelch · on Oct 27, 2012

It's distributed for you, but with an oligopoly of cloud providers it's considerably less distributed overall.

mhurron · on Oct 26, 2012

The resiliency of the internet has nothing to do with the uptime of any service running at endpoints. The resiliency is about the resiliency of the network, a router goes down, there are paths around it, not necessarily paths to every endpoint hanging off it or the services they offer.

digeridoo · on Oct 26, 2012

We shifted from services that fail often independently, to services failing rarely all at once. Clearly the former is going to be more noticeable and has greater societal impact, but as a business I'll take the latter any day.

lurker14 · on Oct 26, 2012

Which is better? Having a day of downtime each year, or not launching at all?

0xbadcafebee · on Oct 26, 2012

Which is better? Using a fallacious comparison to suggest cloud computing is the only viable option, or comparing the pros and cons of different computing models to choose the best one for you?

davidkatz · on Oct 26, 2012

While the argument was perhaps coming on a bit too strong, it's hard to deny the ease of deployment on cloud services. It's probably a safe bet to say that for most early stage startups the cloud is a good move.

0xbadcafebee · on Oct 26, 2012

I'm at a loss in these discussions. I don't understand this developer-point-of-view.

Can you specifically give me examples of why using a cloud provider is better for a startup than, for example, using a couple desktops in your garage?

You can't say it's because of backups because the cloud doesn't provide a backup (unless you purchase an extra data backup solution with your cloud provider?). And correct me if i'm wrong, but you still have to set up your development environment on your local computer to write the code, install libraries to test with, etc.

What exactly are the steps involved in "deploying" that you couldn't do on your laptop, or a VPS?

dpritchett · on Oct 26, 2012

The Germans referred to a Schwerpunkt (focal point and also known as Schwerpunktprinzip or concentration principle) in the planning of operations; it was a center of gravity or point of maximum effort, where a decisive action could be achieved. Ground, mechanised and tactical air forces were concentrated at this point of maximum effort whenever possible. By local success at the Schwerpunkt, a small force achieved a breakthrough and gained advantages by fighting in the enemy's rear. It is summarized by Guderian as “Klotzen, nicht kleckern!” (literally "boulders, not blots" and means "act powerfully, not superficially"). [1]

For a product prototype, the initial primary goal is "get it online so that we can start validating our assumptions". System administration skills and in-house server administration teams are valuable but not necessary.

[1] http://en.wikipedia.org/wiki/Blitzkrieg#Schwerpunkt

gtaylor · on Oct 26, 2012

This is a much more succinct explanation than my own. This is the point I was trying to make.

Even given that I have a good bit of sysadmin skills, I am needed more as a software developer right now in the early goings. I expect, as you've pointed out, that priorities will change with time and growth. We may even move to bare metal eventually, if we find ourselves needing and able to do so.

Excellent analogy.

0xbadcafebee · on Oct 26, 2012

So to summarize you both: You want a rapid development platform that doubles as a production system and costs nothing to maintain. That does sound useful!

gtaylor · on Oct 26, 2012

What? Did you read any of what was written? Point-by-point breakdown for you:

> You want a rapid development platform

No, we want low-maintenance infrastructure.

> that doubles as a production system

It is a production system. It does successfully serve many thousands of users for us every day. We've yet to have an outage that wasn't our own fault.

> and costs nothing to maintain

What? I specifically said we're willing to pay more not to have to spend as much time on infrastructure.

It's OK if you're too set in your ways to even attempt to level with alternative points of view, but at least try to read a little more thoroughly. And maybe admit that you're not willing to budge, so nobody wastes time trying to explain an alternative point of view.

0xbadcafebee · on Oct 28, 2012

I wasn't dismissing your point of view. Heroku is what I described. AWS, to a certain extent, is what I described. And I said it was a production system, too; you didn't have to over-emphasize what I had already said, as if I didn't just say it. "Costs nothing to maintain" is in comparison to paying to maintain it yourself. But thanks for knee-jerking.

gtaylor · on Oct 26, 2012

> Can you specifically give me examples of why using a cloud provider is better for a startup than, for example, using a couple desktops in your garage?

For us (a small three-man team), the full portfolio of AWS services lets us shove responsibility for some of our infrastructure off to Amazon, which saves us loads of time (EC2, ELB, S3, CloudFront, SQS, Route53, SES, and some light DynamoDB in our case). Even with their repeat EBS issues, we've engineered around the common failure points, and do so with a tiny number of self-managed VMs relative to our traffic. Even if we were to go down every few months, our hosted services do better than a one-man devops team could ever do on his own in his garage, or even in a co-lo. Though AWS is not expensive, we gladly pay up in order to focus on our own software. Sure beats hiring another person, in our case.

The common counter-argument is that if we managed it ourselves, that we'd have the ability to resolve outages on our own, because it'd be our responsibility. That is just the thing, we don't want it to be our responsibility, we've got one ops guy (me), and we need to be iterating on our product fast at this point. We'll instead plan for failure, design our architecture to continue operating under some degree of failure, and keep shipping.

Not to say that the "cloud" is a silver bullet. It's not. However, especially from the developer point of view, it lets our entire (tiny) team stay more focused on product development.

Addendum: I've managed servers in my basement, in co-lo, and in the "cloud". Each of these routes is more (or less) appropriate in various cases, but AWS has been a boon to our particular usage case.

robin_reala · on Oct 26, 2012

For what it’s worth, Heroku’s PG Backups addon is free of charge.

https://addons.heroku.com/pgbackups

bgentry · on Oct 26, 2012

We also perform our own backups regardless of PG Backups and use Continuous Protection to protect recent data, even in the event of irrecoverable volume failure: https://devcenter.heroku.com/articles/heroku-postgres-docume...

fidotron · on Oct 26, 2012

It's not just the cloud itself, but your connection into it. I'm more concerned that my connection to the internet is lost so many times a day (traveling underground etc.) and that this is enough to prevent many devices in my vicinity from acting in a coherent way. They should be able to synchronize among themselves without the central intermediary.

Obviously this doesn't happen because it's hard, but also companies have a vested interest in piping all sorts of data through them for analytics purposes. This is not in the interest of the users at all.

prodigal_erik · on Oct 26, 2012

That depends on how much long-term reputation damage you take from having availability that low before a large audience. Which in turn depends on how important your service really is—"can't play my game" is drastically different than "my landlord didn't receive my payment".

minm · on Oct 26, 2012

Point well made. We operate our tonido relay server across the world using linode, softlayer, herztner and it currently supports couple of million devices and half a million users.In the last 2 years our downtime is less than 30 minutes. It is nil for end users since they migrate to the nearest relay server hosted offered by a different provider.

SPOF,security and control are the major issues with Iaas and pass offerings.

SPOF, security and control are the major problems for the Iaas and pass.

chrisfarms · on Oct 26, 2012

> "App Engine is currently experiencing serving issues. The team is actively working on restoring the service to full strength. Please follow this thread for updates."

-- Max Ross (Google) maxr@google.com via googlegroups.com

https://groups.google.com/forum/?fromgroups=#!topic/google-a...

daave · on Oct 26, 2012

And they've sent the all-clear:

At this point, we have stabilized service to App Engine applications. App Engine is now successfully serving at our normal daily traffic level, and we are closely monitoring the situation and working to prevent recurrence of this incident.

This morning around 7:30AM US/Pacific time, a large percentage of App Engine’s load balancing infrastructure began failing. As the system recovered, individual jobs became overloaded with backed-up traffic, resulting in cascading failures. Affected applications experienced increased latencies and error rates. Once we confirmed this cycle, we temporarily shut down all traffic and then slowly ramped it back up to avoid overloading the load balancing infrastructure as it recovered. This restored normal serving behavior for all applications.

We’ll be posting a more detailed analysis of this incident once we have fully investigated and analyzed the root cause.

Regards,

Christina Ilvento on behalf of the Google App Engine Team

https://groups.google.com/forum/#!topic/google-appengine-dow...

abhijitr · on Oct 26, 2012

Meanwhile... Gmail etc are working quite fine. So the claim that if you build on GAE you "take advantage of the same infrastructure used for Google services!!" starts to ring a bit hollow.

Ironlink · on Oct 26, 2012

Or perhaps they are older than the affected infrastructure and just haven't been rewritten. Or perhaps they happen to not use the affected resource. In fact, affected services include developers.google.com and code.google.com, and to a lesser extent youtube.com.

hugoc · on Oct 26, 2012

Well http://developer.android.com/ is down, so google does use GAE for some of its properties?

adolph · on Oct 26, 2012

Maybe they mean == instead of ===.

Tobu · on Oct 26, 2012

I know the claim of running Amazon.com on AWS was initially bogus, but did anyone notice trouble on that site with the recent AWS outages? It would be instructive to know how much they have integrated it since.

rytis · on Oct 26, 2012

I wonder if anyone ever believed this claim to be true...

wildmXranat · on Oct 26, 2012

Or if the definition of the word 'same' is somewhat fluid enough to get away with it

cyberpanther · on Oct 26, 2012

I'm seeing a bunch of Google properties also. Maybe they are running on app engine? Like https://developers.google.com/

libria · on Oct 26, 2012

Also,

http://dartlang.org

http://golang.org,

http://code.google.com/codejam/

http://www.chromeexperiments.com/

Gotta give 'em props for dogfooding.

moepstar · on Oct 26, 2012

Props, really?

That's what i _expect_ them to do - otherwise i can't see anyone trusting/using them if even they themselves avoid their own product(s)..

Sami_Lehtinen · on Oct 26, 2012

http://sites.google.com/ was down too. I got monitoring report, but I first assumed it's false. Because false down reports are much more common than Google services being actually down.

cilvento · on Oct 26, 2012

At about 7:30am US/Pacific time this morning, Google began experiencing slow performance and dropped connections from one of the components of App Engine. Many App Engine applications are experiencing slow responses and an inability to connect to services. We currently show that a majority of App Engine users and services are affected. We are actively working on restoring service as quickly as possible.

We are posting regular updates to our downtime-notify list here: https://groups.google.com/forum/?fromgroups=#!topic/google-a...

Thanks, Christina, Google App Engine Product Manager

kjhughes · on Oct 26, 2012

What's the earliest sign of trouble you've had?

Pingdom reports my GAE-hosted site has been down since 2012-10-26 10:37:38 EST, a bit over an hour now.

UPDATE: My site is back. Delayed report from Pingdom says site came back online after 50 minutes. Performance is sketchy still. We're probably not in the clear yet.

At least we can now get to the status dash:

http://code.google.com/status/appengine

warrentr · on Oct 26, 2012

I think the first tweet i saw was 7:35 PST

jsdalton · on Oct 26, 2012

It's really quite remarkable (to be honest, inexcusable is probably a better word) that their status page is failing as well. My expectations for a company with Google's resources and infrastructure are a lot higher than that.

Nothing on their Twitter account either: https://twitter.com/app_engine

A poor handling of a systems failure in my opinion.

rdwallis · on Oct 26, 2012

If you subscribe to:

https://groups.google.com/forum/?hl=en&fromgroups#!forum...

They'll email you when issues occur and info becomes available.

It took about 30 minutes after the crash for me to receive an email which seems very reasonable.

IheartApplesDix · on Oct 26, 2012

If you find that reasonable, I have a slice in Brooklyn to sell you.

Yoms · on Oct 26, 2012

Latest update:

"At approximately 7:30am Pacific time this morning, Google began experiencing slow performance and dropped connections from one of the components of App Engine. The symptoms that service users would experience include slow response and an inability to connect to services. We currently show that a majority of App Engine users and services are affected. Google engineering teams are investigating a number of options for restoring service as quickly as possible, and we will provide another update as information changes, or within 60 minutes."

https://groups.google.com/forum/?fromgroups=#!topic/google-a...

brutuscat · on Oct 26, 2012

Status reports from the mailing list https://groups.google.com/d/topic/google-appengine-downtime-...

debacle · on Oct 26, 2012

I'm really happy I don't host in the cloud. How quickly are the cost savings of cloud computing obliterated by PR, customer service, and system administration time when an outage like this occurs?

tomgallard · on Oct 26, 2012

Surely hosting yourself exposes you to just as much, if not more risk? Problem in the datacentre where you're co-lo'd, or one of your servers blows up?

I think people not trusting the cloud is similar to how people feel safer driving their cars then taking a plane. The stats say the plane's safer, but people prefer being in control. People like the idea of being in control of their servers, even if that means there's hundreds of extra things that can go wrong compared to a cloud provider.

We also get a lot more publicity when a cloud provider has an outage as LOTS of sites go down at once. Hardly anyone notices when service X who self-host go down for a few hours...

Alan01252 · on Oct 26, 2012

I agree with you. From my past experience any data center is subject to risks. I've witnessed:

  Power failures.
  Cross site links being cut due to engineering works.
  Over heating due to air conditioning failures.
  Flooding

And I've experienced all the above from a very large, very well known, very expensive data center company based in London.

acdha · on Oct 26, 2012

This is true of every data center I've worked with. Also network providers: everyone has downtime and sometimes you learn the hard way that despite being written into your contract someone took the cheap way and ran your “redundant” fiber through the same conduit which a backhoe just tore up.

debacle · on Oct 26, 2012

We're coloed across three datacenters spanning the US (one might be in TO I think) and if a datacenter were to go down, we have a hot backup that's no more than 12 hours stale.

The only real manual maintenance that we've got is a rolling reimaging of servers based on whatever's in version control, which usually takes a few hours twice a year, but we'd probably do that if we were in the cloud anyway.

When you can script away 90% of your system administration tasks, hosting in the cloud doesn't really make a ton of sense.

tomgallard · on Oct 26, 2012

But a DNS based failover is still going to take an hour or so to propagate right (given that a lot of browsers/proxies/DNS servers don't respect TTL very well at all)? And then you end up with a system with stale data, and the mess of trying to reconcile it when your other system comes back up.

I'd take an hour long Appengine outage once a year over that anytime!

0xbadcafebee · on Oct 26, 2012

Your name server or stub resolver is what respects DNS TTL, not your browser or proxy. Everyone - including people hosting on AWS - needs to be able to fail over DNS, if the AWS IP you're using is in a zone that just went down, for example.

Any time you have an outage you need to contact your service provider to get an estimate of downtime. If they can't give you one, assume it'll take forever and cut the DNS over. The worst case is some of your users will start to come back online slowly. If you don't cut over, the worst case is all your users are down until whenever the service provider fixes it, and you get to tell your users "we're waiting for someone else to deal with it", which won't make them very happy.

12 hour stale data sounds kind of long to me. 4 hours sounds more reasonable.

codeka · on Oct 27, 2012

I've seen plenty of crappy ISP DNS servers ignore TTL values and cache DNS entries for many hours longer than they're supposed to. Unfortunately, it's all too common.

stickfigure · on Oct 26, 2012

When you can script away 90% of your system administration tasks, hosting in the cloud doesn't really make a ton of sense.

How big is your ops team? I'm guessing it's more than 0.

debacle · on Oct 26, 2012

Ops team? We're a two man operation with occasional contractors.

stickfigure · on Oct 26, 2012

In that case, what is the ratio of "time spent doing ops-related tasks" vs "time spent developing new features" in your company? Please offer an honest evaluation. Everything has a cost; I'm genuinely curious about data points other than my own.

debacle · on Oct 26, 2012

I probably spend no more than an hour a week on ops, and most of that is reading emails from our service providers.

stickfigure · on Oct 26, 2012

Today, maybe, assuming a calm ocean and no scaling issues. But I don't believe you spent an hour a week setting up your three data centers, backups, failover procedure, etc.

debacle · on Oct 26, 2012

The backup script was written in a night, and the most complicated part about failover is remembering to sync data when the outage is over.

0xbadcafebee · on Oct 26, 2012

The "safety" of the cloud is about two things: 1. trusting your service provider, and 2. redundancy.

You have to trust your cloud provider. They control everything you do. If their security isn't bulletproof, you're screwed. If their SAN's firmware isn't upgraded properly to deal with some performance issue, you're screwed. If their developers fuck up the API and you can't modify your instances, you're screwed. You have to put complete faith in a secret infrastructure run for hundreds of thousands of clients so there's no customer relationship to speak of.

That's just the "trust" issue. Then there's the issue of actual redundancy. It's completely possible to have a network-wide outage for a cloud provider. There will be no redundancy, because their entire system is built to be in unison; one change affects everything.

Running it yourself means you know how secure it is, how robust the procedures are, and you can build in real redundancy and real disaster recovery. Do people build themselves bulletproof services like this? Usually not. But if you cared to, you could.

modoc · on Oct 26, 2012

It really depends on how many servers and how good your sys admins are. If you have 1 server in a cheap colo, then yes, the cloud is probably better. If you have a GOOD hosting provider, and build out a redundant at every tier cluster, you can easily beat the major cloud providers' uptime.

We run about 11 client clusters across ~250 servers across 3 data centers in the US and Europe. Each of our client's uptime is very very close to 100%, and we've NEVER lost everything, even for 1 second.

bsaul · on Oct 26, 2012

It's funny, i've got the exact opposite reasoning : it's those moments where i can really appreciate the fact that i'm using the cloud : 1/ I don't have to spend the night debugging or replacing broken hardware 2/ It doesn't cost me any time, any additional resource, any support upgrade, any hardware. 3/ No one can blame me or anybody in my team for the fact that it's not working.

I don't feel like i'm lacking control, i feel like somebody else is taking care of that really annoying shit that happens all the time, no matter how well you design your system.

modoc · on Oct 26, 2012

If you have paying clients, they WILL blame you and your team for the fact it's not working. They don't care who/what the underlying infrastructure is.

Also a good hosting company will handle identifying/fixing/replacing bad hardware for you.

calinet6 · on Oct 26, 2012

Not to mention sufficient redundancy will ensure that you never see many effects from those hardware failures/power outages/floods/fires/anything else with any reasonable probability.

"The cloud is great because I can blame someone else" is obviously a tenuous argument.

lurker14 · on Oct 26, 2012

You customers will not be mollified if you tell them that the service you sold is down due to a subcontractor failure.

untog · on Oct 26, 2012

I think there is an interesting debate to be had here. When your site goes down at the same time as Tumblr/Reddit/one of the big guys, the damage might not actually be that high. In the minds of many, "the internet" is probably quite broken.

debacle · on Oct 26, 2012

Depends on the site. Most of the traffic we cater to is not referrer driven.

ceejayoz · on Oct 26, 2012

My colocation provider (Frontier, telecom in 27 states) went down for an hour and a half last night. It's hardly unique to the cloud.

hartleybrody · on Oct 26, 2012

I don't think you move to the cloud for cost savings. More like time and headache savings.

foolery · on Oct 26, 2012

Yup, all of the HRD apps are down. But the M/S apps are working.

jis · on Oct 26, 2012

Well, my one M/S app is also down.

aderaynal · on Oct 26, 2012

I have 2 M/S apps and 3 HRD. They are all down :(

cfontes · on Oct 26, 2012

Mine is also down... :( any one got any position about it ?

vitorarins · on Oct 26, 2012

my one M/S app is up and running.

aidos · on Oct 26, 2012

Dropbox [0] is showing a 500 for me to. I've very confused as to what has just happened to the internet...

[0] https://www.dropbox.com/

tomnewton · on Oct 26, 2012

My Google contact said that 'SRE are all over it. Hope to have more details soon.' but that was about 30 minutes ago.

Does tumblr.com use app engine? They're down...

tszming · on Oct 26, 2012

@tumblr (https://twitter.com/tumblr/status/261840787350896640)

Tumblr is experiencing network problems following an issue with one of our uplink providers. We will return to full service shortly.

libria · on Oct 26, 2012

Hm, bad week for the Cloud. Can't even get to the status page; hopefully it's not hosted on App Engine.

So going forward, what's the best way to protect against cloud downtime? Have a hot/standby failover with a different provider? Prepare customers' expectations for the possibility of server outages? Do a ton of research, pay $$$ for lots of nines uptime, and lambast the host when they don't deliver?

bad_user · on Oct 26, 2012

Downtimes happen regardless, unless you have a lot of money and talent to spend on your own infrastructure and even then it's hard to beat cloud providers like Amazon, or Google, who have more resources and knowledge than you do.

The greatest thing about cloud-hosting is that you can just sit by and let them fix it. It usually takes about half an hour, or a couple of hours if the outage is severe, but usually less than the time it takes for an update of DNS records (unless you've got some proxy in front of your IPs, which would be another point of failure).

And then, even with these severe outages, the overall monthly uptime is still better than %99.9 and it's really hard to beat that, so just relax and let them fix it.

acdha · on Oct 26, 2012

There's no such thing as "cloud downtime" - it's still servers, data centers, networks, same as everything else.

You need to decide how much uptime you're willing to pay for, how much your service can degrade for how long, and methodically address each level of the hierarchy between you and your customers – and it might be the case that you decide that the ongoing costs of your engineering support for e.g. wide geographic separation just aren't sustainable at the level your customers are willing to pay, particularly if you have something like a CDN helping keep your site partially responsive during less than catastrophic failures.

davidjgraph · on Oct 26, 2012

I'd say the answer depends on how fast GAE recovers. If you're building redundancy over multiple clouds, if there's a lot of data:

1) It's very complex and expensive 2) You're looking at DNS to hot failover, in most cases.

If GAE can recover in less than 30 minutes and sticks to, say, one outage a year, you just can't justify the kind of cost you're looking at with 2 (seriously, it's a lot of cash).

josh2600 · on Oct 26, 2012

Build redundancy into your software to deal with single provider failure.

Achshar · on Oct 26, 2012

That is not always a feasible option, specially for young projects with limited capital.

bsaul · on Oct 26, 2012

I would love it so much to see people at google showing all the internal tools they're using to detect and solve this kind of issues. I can only imagine a war room with screens all over the place showing gigantic amount of red flashing lines :) Hope it doesn't last for long though, i was just praising what a good choice app engine has been so far 10 minutes ago...

hugofierro · on Oct 26, 2012

I hope it's not due to DiRT Exercises (SRE Disaster Recovery Test). Looking forward to reading the post-mortem report!

notreadbyhumans · on Oct 26, 2012

It's a bit nuts that they're hosting the status pages on the same infrastructure.

albumedia · on Oct 26, 2012

GAE has been very good since I started using it. They entire internet seems to be slow...even HN

jeremi23 · on Oct 26, 2012

Now even the Google AppEngine status page is down.

vanwilder77 · on Oct 26, 2012

The Google app engine page down as well!

0xbadcafebee · on Oct 26, 2012

Here's a zen koan for you:

If every website on the internet is hosted in the cloud, and the cloud goes down, is there an internet?

cfontes · on Oct 26, 2012

things are starting to run again...

My site is back up :D SLOW but up.

ams6110 · on Oct 26, 2012

passpack.com seems to be affected.

singingwolfboy · on Oct 26, 2012

www.howfuckedisappengine.com

on Oct 26, 2012

[deleted]

whalesalad · on Oct 26, 2012

Whoa did not expect to receive such negative feedback on this post. Was really looking for feedback from the community on a replacement. I'll post this elsewhere I guess.

tomgallard · on Oct 26, 2012

I think the best place would be in an 'Ask HN' thread. I think the community felt your post was off topic- and that's why it was downvoted

0xbadcafebee · on Oct 26, 2012

I can not bear with a wall of text about a note-taking app. Could you pastebin that?