Before the doom and gloomers come out, this is the first time since leaving beta I can remember it happening.
We left AWS about 18 months ago after one of the outages and switched to GAE. I've counted 3-4 big downtimes for AWS compared to this one on GAE. That's still a good decision (for now)....
One thing to remember: this took down all of app engine for at least an hour. AWS has had only 17 minutes of downtime affecting all of us-east this year (that network glitch a couple days after PyCon) - the rest of it has been a subset of the service amplified by people rediscovering that they weren't as redundant as they thought.
The correct less to draw is that any one point of infrastructure is a risk, so you need to scale wide. This is possible to do with AWS regions, or other providers - even internal bare iron if you're so inclined, but impossible to do with GAE because you're committed to a single-vendor API as well as their infrastructure.
GAE applications are distributed across multiple data centers[1], so in theory you get "scale wide" automatically. Unfortunately it looks like there was some sort of flaw in the architecture. I believe this is the first systemwide failure of the HRD.
The real question is: Can you and your ops team build a "scale wide" system better than Google?[2] How much effort are you willing to put into it, when those development resources could be put into making features instead?
[1] For apps which use the high-replication datastore. Old (deprecated) master/slave apps are served out of a single datacenter.
[2] This is a reasonably serious question to ask. In GAE, you're one app among many, so a you-specific scaling solution might be easier and more robust than the generalized one Google builds. But not necessarily.
I completely agree that Google is likely to do better than many teams, modulo your second point (which I completely agree with - generic is much harder than specific). For me it really just comes down to the lock-in aspect: with GAE if you decide that Google isn't taking the platform in the right direction for your business you're looking at something close to rewriting your application. This is far from the most likely outcome - although pricing can be interesting - but it's the kind of thing you really want to consciously acknowledge, similar to the way Netflix has accepted the risks of AWS by investing heavily in failure management tools.
I keep hearing the lock-in argument over and over and I'm not quite sure that there is a solid basis for it beyond a level of paranoia. It's fine to be paranoid, I just don't want it to hold me back unnecessarily.
Looking at things more closely, the only thing that you're truly locked into with GAE is the esoteric nature of the datastore. This isn't any worse than picking say MySQL vs. Oracle or Riak vs. Mongo. Most applications end up depending on some sort of specific functionality of each database that they are written for. While it would be difficult to migrate to another solution for storing your data, it wouldn't be impossible.
There is no way to predict what direction any product might take in the future. Look at the way that Oracle is treating MySQL now. Tons of vendor lockdown there.
The only reason to migrate away from GAE would be if you find out that your application doesn't work well on it (pricing, scalability, etc) or if Google decides to kill GAE entirely. Hopefully you do the analysis of your application before you decide to use GAE (ie: you can't blame GAE for you deciding to use it) and with a 3 year deprecation promise, I'm pretty confident that it will be around for a while longer.
> The only reason to migrate away from GAE would be if you find out that your application doesn't work well on it (pricing, scalability, etc) or if Google decides to kill GAE entirely. Hopefully you do the analysis of your application before you decide to use GAE (ie: you can't blame GAE for you deciding to use it)
This seem too simplistic. Off the top of my head, you may leave because:
* you want to do something new in your app that is not possible/effective to do within GAE
* you find another option that is cheaper
* you find that some of the assumptions and architectural decisions you took on day 0 no longer hold
* you did the initial analysis, and it was wrong
I do agree with GAE being awesome and lock-in not being so bad, but I doubt the "you have it all figured out before building it so it's going to work forever" idea.
FWIW, I frequently want to do things that are not possible/effective on GAE. And so I do it - several of my GAE apps communicate with services I set up in rackspace cloud (30-40 ms of latency away). It would probably be even lower running in Google's cloud service, but I haven't gotten an invite yet.
GAE is not "all or nothing". You can still run exotic services in other hosts. Or, for that matter, use GAE for specific services in your "other cloud" app. You get two bills. Not much of a downside.
> Looking at things more closely, the only thing that you're truly locked into with GAE is the esoteric nature of the datastore. This isn't any worse than picking say MySQL vs. Oracle or Riak vs. Mongo.
Sure it is. If you pick the wrong DB your only locked into that DB. You pick GAE and you're locked into GAE's DB.... and GAE. I can move my MySql db to another cloud provider.
@latchkey I think you are missing the point the MatthewPhillips is making. With GAE's DB there is nowhere to move, so if you want to move you would have to migrate your data to a different DB engine. If your hosting/cloud provider uses something standard like MySQL you can find another provider or roll your own if you decide to migrate out.
Your analogy of moving to another cloud provider is incorrect. Unless you've been 100% database agnostic, you can't just migrate your schema and data from say MySQL to Postgres.
With the datastore on GAE, you can get your data out of it and move it into something like MongoDB. I'd argue that it would probably be less code to move to MongoDB because the datastore has all sorts of esoteric issues that you have to code around (like the way that entity groups and transactions are handled).
In terms of the rest of your application, it is just a standard webapp in whatever platform you choose. The .war file I have for GAE will run just fine in Tomcat. The only real lockin is the way you store your data.
With the Java API you can write code that is standard (JDO/JPA for database stuff), that works across any Java environment. So it doesn't lock you in if you write your code in a smart way.
Except that writing code using JDO/JPA is a nightmare, at least on Datastore. So that's almost certainly not the smart way. I'd rather rewrite my app three times - go with Objectify (simple, easy, maps well to Datastore concepts) and just deal with the rewrite if it happens.
It's one of the first times the whole service has been down, but parts of the service go down at least once a week. memcache, task queues are "elevated" with regularity and urlfetch is frequently down totally. ("elevated" generally means unusable).
Of course master/slave even has scheduled downtime.
I had network issues with my two virtual machines on RackspaceCloud about a year or so ago (something was wrong with their routing). They were resolved quickly, though.
Internet Traffic Report, while a nice concept, is unfortunately very misleading.
Their sample size is extremely small, and most of those are permanently down.
Have a look through their list of north american routers and find one of them where packet loss has gotten worse as their main overall graph for packet loss would suggest - I've just been through them all and couldn't find one.
Agreed. I was surprised to see only one[1] of the "down" routers with any packet loss over the last 24 hours.
That said, it's still interesting how the overall traffic trends so sharply downwards. I wonder if they have more data than they are showing in the graphs.
I noticed a couple of days ago that some of our dns entries were mysteriously removed from level 3 servers which out of old habit are used for resolution (some of the ip's go back to uu-net/worldcom/mci)
Now the interesting bit is they were for private subnet ip's. They're working fine everywhere else.
Today the last of their dns servers removed the entries so I had users go to google (8.8.8.8) and all's well with our apps.
Level 3's entries for our external stuff is there, just the private subnet stuff is removed.
If others do this too and resolve with level 3....
As Ewantoo mentioned, ITR is pretty unreliable. We have found it works as a relative measure -- when their packet loss rises, we do indeed see consumer traffic fall, though usually not by the amount their graphs would indicate.
It's time we remembered the whole strength of the internet was that it was distributed and we avoided introducing single points of failure. We have ended up using vast amounts of infrastructure for no reason other than developer convenience (often with respect to security), when having local direct connections is often more suitable than shooting everything into the cloud.
Huh? Historically, almost all content on the internet has had a single point of failure. Moving "into the cloud" is moving to a distributed model, and generally app engine will protect you from single points of failure. As others have noted, if you distribute your own self-managed servers across the globe, you end up with a very similar system, but now you have to manage it.
What appears to be happening here is that you're still vulnerable because the C&C infrastructure ultimately has a single command source, and so can be vulnerable if some code is pushed that affects the whole system. Your homegrown cloud will suffer from the same vulnerability, it may just be more or less easy to manage depending on how specialized your needs are compared to the more general requirements of GAE.
Edit: actually, maybe I missed what you're advocating with "local direct connections". This might make more sense from a user's perspective: if everyone ran their own little cloud, a failure may bring down reddit, but not reddit, heroku, pinterest, etc simultaneously. That's actually an interesting point, but I'm not sure if it really matters if they sync up their downtimes (since they would still have downtime, and maybe more or less of it depending on how much they could afford to invest into managing their distributed solution). I'm also not sure if that really solves the problem, since there are other concentrations in the network, just less visible ones (there are a fairly small number of major datacenters around the world, for instance, and managing your own colocated server doesn't matter if the whole building goes dark).
I do agree that at the very least we need to maintain an ecosystem of "cloud" providers, however.
The resiliency of the internet has nothing to do with the uptime of any service running at endpoints. The resiliency is about the resiliency of the network, a router goes down, there are paths around it, not necessarily paths to every endpoint hanging off it or the services they offer.
We shifted from services that fail often independently, to services failing rarely all at once. Clearly the former is going to be more noticeable and has greater societal impact, but as a business I'll take the latter any day.
Which is better? Using a fallacious comparison to suggest cloud computing is the only viable option, or comparing the pros and cons of different computing models to choose the best one for you?
While the argument was perhaps coming on a bit too strong, it's hard to deny the ease of deployment on cloud services. It's probably a safe bet to say that for most early stage startups the cloud is a good move.
I'm at a loss in these discussions. I don't understand this developer-point-of-view.
Can you specifically give me examples of why using a cloud provider is better for a startup than, for example, using a couple desktops in your garage?
You can't say it's because of backups because the cloud doesn't provide a backup (unless you purchase an extra data backup solution with your cloud provider?). And correct me if i'm wrong, but you still have to set up your development environment on your local computer to write the code, install libraries to test with, etc.
What exactly are the steps involved in "deploying" that you couldn't do on your laptop, or a VPS?
The Germans referred to a Schwerpunkt (focal point and also known as Schwerpunktprinzip or concentration principle) in the planning of operations; it was a center of gravity or point of maximum effort, where a decisive action could be achieved. Ground, mechanised and tactical air forces were concentrated at this point of maximum effort whenever possible. By local success at the Schwerpunkt, a small force achieved a breakthrough and gained advantages by fighting in the enemy's rear. It is summarized by Guderian as “Klotzen, nicht kleckern!” (literally "boulders, not blots" and means "act powerfully, not superficially"). [1]
For a product prototype, the initial primary goal is "get it online so that we can start validating our assumptions". System administration skills and in-house server administration teams are valuable but not necessary.
This is a much more succinct explanation than my own. This is the point I was trying to make.
Even given that I have a good bit of sysadmin skills, I am needed more as a software developer right now in the early goings. I expect, as you've pointed out, that priorities will change with time and growth. We may even move to bare metal eventually, if we find ourselves needing and able to do so.
So to summarize you both: You want a rapid development platform that doubles as a production system and costs nothing to maintain. That does sound useful!
What? Did you read any of what was written? Point-by-point breakdown for you:
> You want a rapid development platform
No, we want low-maintenance infrastructure.
> that doubles as a production system
It is a production system. It does successfully serve many thousands of users for us every day. We've yet to have an outage that wasn't our own fault.
> and costs nothing to maintain
What? I specifically said we're willing to pay more not to have to spend as much time on infrastructure.
It's OK if you're too set in your ways to even attempt to level with alternative points of view, but at least try to read a little more thoroughly. And maybe admit that you're not willing to budge, so nobody wastes time trying to explain an alternative point of view.
I wasn't dismissing your point of view. Heroku is what I described. AWS, to a certain extent, is what I described. And I said it was a production system, too; you didn't have to over-emphasize what I had already said, as if I didn't just say it. "Costs nothing to maintain" is in comparison to paying to maintain it yourself. But thanks for knee-jerking.
> Can you specifically give me examples of why using a cloud provider is better for a startup than, for example, using a couple desktops in your garage?
For us (a small three-man team), the full portfolio of AWS services lets us shove responsibility for some of our infrastructure off to Amazon, which saves us loads of time (EC2, ELB, S3, CloudFront, SQS, Route53, SES, and some light DynamoDB in our case). Even with their repeat EBS issues, we've engineered around the common failure points, and do so with a tiny number of self-managed VMs relative to our traffic. Even if we were to go down every few months, our hosted services do better than a one-man devops team could ever do on his own in his garage, or even in a co-lo. Though AWS is not expensive, we gladly pay up in order to focus on our own software. Sure beats hiring another person, in our case.
The common counter-argument is that if we managed it ourselves, that we'd have the ability to resolve outages on our own, because it'd be our responsibility. That is just the thing, we don't want it to be our responsibility, we've got one ops guy (me), and we need to be iterating on our product fast at this point. We'll instead plan for failure, design our architecture to continue operating under some degree of failure, and keep shipping.
Not to say that the "cloud" is a silver bullet. It's not. However, especially from the developer point of view, it lets our entire (tiny) team stay more focused on product development.
Addendum: I've managed servers in my basement, in co-lo, and in the "cloud". Each of these routes is more (or less) appropriate in various cases, but AWS has been a boon to our particular usage case.
It's not just the cloud itself, but your connection into it. I'm more concerned that my connection to the internet is lost so many times a day (traveling underground etc.) and that this is enough to prevent many devices in my vicinity from acting in a coherent way. They should be able to synchronize among themselves without the central intermediary.
Obviously this doesn't happen because it's hard, but also companies have a vested interest in piping all sorts of data through them for analytics purposes. This is not in the interest of the users at all.
That depends on how much long-term reputation damage you take from having availability that low before a large audience. Which in turn depends on how important your service really is—"can't play my game" is drastically different than "my landlord didn't receive my payment".
Point well made. We operate our tonido relay server across the world using linode, softlayer, herztner and it currently supports couple of million devices and half a million users.In the last 2 years our downtime is less than 30 minutes. It is nil for end users since they migrate to the nearest relay server hosted offered by a different provider.
SPOF,security and control are the major issues with Iaas and pass offerings.
SPOF, security and control are the major problems for the Iaas and pass.
> "App Engine is currently experiencing serving issues. The team is actively working on restoring the service to full strength. Please follow this thread for updates."
-- Max Ross (Google) maxr@google.com via googlegroups.com
At this point, we have stabilized service to App Engine applications. App Engine is now successfully serving at our normal daily traffic level, and we are closely monitoring the situation and working to prevent recurrence of this incident.
This morning around 7:30AM US/Pacific time, a large percentage of App Engine’s load balancing infrastructure began failing. As the system recovered, individual jobs became overloaded with backed-up traffic, resulting in cascading failures. Affected applications experienced increased latencies and error rates. Once we confirmed this cycle, we temporarily shut down all traffic and then slowly ramped it back up to avoid overloading the load balancing infrastructure as it recovered. This restored normal serving behavior for all applications.
We’ll be posting a more detailed analysis of this incident once we have fully investigated and analyzed the root cause.
Regards,
Christina Ilvento on behalf of the Google App Engine Team
Meanwhile... Gmail etc are working quite fine. So the claim that if you build on GAE you "take advantage of the same infrastructure used for Google services!!" starts to ring a bit hollow.
Or perhaps they are older than the affected infrastructure and just haven't been rewritten. Or perhaps they happen to not use the affected resource.
In fact, affected services include developers.google.com and code.google.com, and to a lesser extent youtube.com.
I know the claim of running Amazon.com on AWS was initially bogus, but did anyone notice trouble on that site with the recent AWS outages? It would be instructive to know how much they have integrated it since.
http://sites.google.com/ was down too. I got monitoring report, but I first assumed it's false. Because false down reports are much more common than Google services being actually down.
At about 7:30am US/Pacific time this morning, Google began experiencing slow performance and dropped connections from one of the components of App Engine. Many App Engine applications are experiencing slow responses and an inability to connect to services. We currently show that a majority of App Engine users and services are affected. We are actively working on restoring service as quickly as possible.
Pingdom reports my GAE-hosted site has been down since 2012-10-26 10:37:38 EST, a bit over an hour now.
UPDATE: My site is back. Delayed report from Pingdom says site came back online after 50 minutes. Performance is sketchy still. We're probably not in the clear yet.
It's really quite remarkable (to be honest, inexcusable is probably a better word) that their status page is failing as well. My expectations for a company with Google's resources and infrastructure are a lot higher than that.
"At approximately 7:30am Pacific time this morning, Google began experiencing slow performance and dropped connections from one of the components of App Engine. The symptoms that service users would experience include slow response and an inability to connect to services. We currently show that a majority of App Engine users and services are affected. Google engineering teams are investigating a number of options for restoring service as quickly as possible, and we will provide another update as information changes, or within 60 minutes."
I'm really happy I don't host in the cloud. How quickly are the cost savings of cloud computing obliterated by PR, customer service, and system administration time when an outage like this occurs?
Surely hosting yourself exposes you to just as much, if not more risk? Problem in the datacentre where you're co-lo'd, or one of your servers blows up?
I think people not trusting the cloud is similar to how people feel safer driving their cars then taking a plane. The stats say the plane's safer, but people prefer being in control. People like the idea of being in control of their servers, even if that means there's hundreds of extra things that can go wrong compared to a cloud provider.
We also get a lot more publicity when a cloud provider has an outage as LOTS of sites go down at once. Hardly anyone notices when service X who self-host go down for a few hours...
This is true of every data center I've worked with. Also network providers: everyone has downtime and sometimes you learn the hard way that despite being written into your contract someone took the cheap way and ran your “redundant” fiber through the same conduit which a backhoe just tore up.
We're coloed across three datacenters spanning the US (one might be in TO I think) and if a datacenter were to go down, we have a hot backup that's no more than 12 hours stale.
The only real manual maintenance that we've got is a rolling reimaging of servers based on whatever's in version control, which usually takes a few hours twice a year, but we'd probably do that if we were in the cloud anyway.
When you can script away 90% of your system administration tasks, hosting in the cloud doesn't really make a ton of sense.
But a DNS based failover is still going to take an hour or so to propagate right (given that a lot of browsers/proxies/DNS servers don't respect TTL very well at all)? And then you end up with a system with stale data, and the mess of trying to reconcile it when your other system comes back up.
I'd take an hour long Appengine outage once a year over that anytime!
Your name server or stub resolver is what respects DNS TTL, not your browser or proxy. Everyone - including people hosting on AWS - needs to be able to fail over DNS, if the AWS IP you're using is in a zone that just went down, for example.
Any time you have an outage you need to contact your service provider to get an estimate of downtime. If they can't give you one, assume it'll take forever and cut the DNS over. The worst case is some of your users will start to come back online slowly. If you don't cut over, the worst case is all your users are down until whenever the service provider fixes it, and you get to tell your users "we're waiting for someone else to deal with it", which won't make them very happy.
12 hour stale data sounds kind of long to me. 4 hours sounds more reasonable.
I've seen plenty of crappy ISP DNS servers ignore TTL values and cache DNS entries for many hours longer than they're supposed to. Unfortunately, it's all too common.
In that case, what is the ratio of "time spent doing ops-related tasks" vs "time spent developing new features" in your company? Please offer an honest evaluation. Everything has a cost; I'm genuinely curious about data points other than my own.
Today, maybe, assuming a calm ocean and no scaling issues. But I don't believe you spent an hour a week setting up your three data centers, backups, failover procedure, etc.
The "safety" of the cloud is about two things: 1. trusting your service provider, and 2. redundancy.
You have to trust your cloud provider. They control everything you do. If their security isn't bulletproof, you're screwed. If their SAN's firmware isn't upgraded properly to deal with some performance issue, you're screwed. If their developers fuck up the API and you can't modify your instances, you're screwed. You have to put complete faith in a secret infrastructure run for hundreds of thousands of clients so there's no customer relationship to speak of.
That's just the "trust" issue. Then there's the issue of actual redundancy. It's completely possible to have a network-wide outage for a cloud provider. There will be no redundancy, because their entire system is built to be in unison; one change affects everything.
Running it yourself means you know how secure it is, how robust the procedures are, and you can build in real redundancy and real disaster recovery. Do people build themselves bulletproof services like this? Usually not. But if you cared to, you could.
It really depends on how many servers and how good your sys admins are. If you have 1 server in a cheap colo, then yes, the cloud is probably better. If you have a GOOD hosting provider, and build out a redundant at every tier cluster, you can easily beat the major cloud providers' uptime.
We run about 11 client clusters across ~250 servers across 3 data centers in the US and Europe. Each of our client's uptime is very very close to 100%, and we've NEVER lost everything, even for 1 second.
It's funny, i've got the exact opposite reasoning : it's those moments where i can really appreciate the fact that i'm using the cloud :
1/ I don't have to spend the night debugging or replacing broken hardware
2/ It doesn't cost me any time, any additional resource, any support upgrade, any hardware.
3/ No one can blame me or anybody in my team for the fact that it's not working.
I don't feel like i'm lacking control, i feel like somebody else is taking care of that really annoying shit that happens all the time, no matter how well you design your system.
If you have paying clients, they WILL blame you and your team for the fact it's not working. They don't care who/what the underlying infrastructure is.
Also a good hosting company will handle identifying/fixing/replacing bad hardware for you.
Not to mention sufficient redundancy will ensure that you never see many effects from those hardware failures/power outages/floods/fires/anything else with any reasonable probability.
"The cloud is great because I can blame someone else" is obviously a tenuous argument.
I think there is an interesting debate to be had here. When your site goes down at the same time as Tumblr/Reddit/one of the big guys, the damage might not actually be that high. In the minds of many, "the internet" is probably quite broken.
Hm, bad week for the Cloud. Can't even get to the status page; hopefully it's not hosted on App Engine.
So going forward, what's the best way to protect against cloud downtime? Have a hot/standby failover with a different provider? Prepare customers' expectations for the possibility of server outages? Do a ton of research, pay $$$ for lots of nines uptime, and lambast the host when they don't deliver?
Downtimes happen regardless, unless you have a lot of money and talent to spend on your own infrastructure and even then it's hard to beat cloud providers like Amazon, or Google, who have more resources and knowledge than you do.
The greatest thing about cloud-hosting is that you can just sit by and let them fix it. It usually takes about half an hour, or a couple of hours if the outage is severe, but usually less than the time it takes for an update of DNS records (unless you've got some proxy in front of your IPs, which would be another point of failure).
And then, even with these severe outages, the overall monthly uptime is still better than %99.9 and it's really hard to beat that, so just relax and let them fix it.
There's no such thing as "cloud downtime" - it's still servers, data centers, networks, same as everything else.
You need to decide how much uptime you're willing to pay for, how much your service can degrade for how long, and methodically address each level of the hierarchy between you and your customers – and it might be the case that you decide that the ongoing costs of your engineering support for e.g. wide geographic separation just aren't sustainable at the level your customers are willing to pay, particularly if you have something like a CDN helping keep your site partially responsive during less than catastrophic failures.
I'd say the answer depends on how fast GAE recovers. If you're building redundancy over multiple clouds, if there's a lot of data:
1) It's very complex and expensive
2) You're looking at DNS to hot failover, in most cases.
If GAE can recover in less than 30 minutes and sticks to, say, one outage a year, you just can't justify the kind of cost you're looking at with 2 (seriously, it's a lot of cash).
I would love it so much to see people at google showing all the internal tools they're using to detect and solve this kind of issues. I can only imagine a war room with screens all over the place showing gigantic amount of red flashing lines :)
Hope it doesn't last for long though, i was just praising what a good choice app engine has been so far 10 minutes ago...
Whoa did not expect to receive such negative feedback on this post. Was really looking for feedback from the community on a replacement. I'll post this elsewhere I guess.
We left AWS about 18 months ago after one of the outages and switched to GAE. I've counted 3-4 big downtimes for AWS compared to this one on GAE. That's still a good decision (for now)....