Half the internet is down, Amazon.com is flapping, the AWS website doesn't have any assets, and status.aws has a Green checkmark with a little (i), 40 minutes after the problems start?!
I love AWS but they really need to improve their procedures for communicating during outages. If your parent companies' billion dollar site can be affected in any way on the night before black friday, and even your own site is down, and you are not acknowledging that the service is FUBAR - you have a problem.
Exactly this. In our experience Amazon's status page doesn't reflects actual outages we are having, Cloudwatch often sends false positives (systems being down, when in fact, they are not) and SNS messages get lost in the ether.
We've built our systems to recover from failure states once we know there is a problem. AWS's inability to do that reliably is forcing us to own the problem ourself, and as a result, we will probably migrate away to use cheaper boxes in the cloud.
UPDATE: RESOLVED according to the AWS status page -- 6:24 PM PST Between 4:12 PM and 6:02 PM PST, users experienced elevated error rates when making DNS queries for CloudFront distributions. The service has recovered and is operating normally. [2]
--
Cloudfront DNS is hosed. Doing a DNS lookup on my cloudfront distribution fails. Many folks on Twitter also see the issue too [1]. Maybe a failed upgrade or something, their whois info was updated today @ 2014-11-26T16:24:49-0800.
[~]$ host d1cg27r99kkbpq.cloudfront.net
Host d1cg27r99kkbpq.cloudfront.net not found: 2(SERVFAIL)
The AWS status page not showing anything isn't unusual. They'll probably update in half an hour to describe it as a partial failure then revise it to an "all OK" 10 minutes before it's actually fixed then retroactively downgrade it to a minor quality of service disruption.
Not that I've noticed they tend to lie through their teeth on the status page or anything....
Apparently, "DNS resolution errors" are considered neither a "service disruption" nor a "performance issue", but instead are categorized as an "informational message."
AWS Status shows Cloudfront DNS issues, which is reflected in our page's assets not loading. Kinda makes me wish we were using something like https://github.com/etsy/cdncontrol/ but that's a fight for another day!
<title type="text">Informational message: DNS Resolution errors </title>
<link>http://status.aws.amazon.com</link>
<pubDate>Wed, 26 Nov 2014 17:00:39 PST</pubDate>
<guid>http://status.aws.amazon.com/#cloudfront_1417050039</guid>
<description>We are currently investigating increased error rates for DNS queries for CloudFront distributions. </description>
One wonders if this is same matter currently afflicting a certain infamous torrent site ("..downtime appears to be a routing issue as the site is still reachable in most parts of the world"): http://torrentfreak.com/the-pirate-bay-goes-down-locally-141...
It's a decent idea, but it would be better to do client-side. For this sort of event, knowing that the name resolves on the server doesn't give you any confidence it will resolve for the client.
What would be really nice is if you could specify a fallback host in your DNS prefetch, and the browser would make it "just work."
It's flapping in some locations. Seems to be hard down in others. From a service monitoring Cloudfront on various geographically distributed Nagios monitors (times are PST):
Columbus:
[11-26-2014 16:46:24] SERVICE ALERT: public-www;CDN - Logo;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 2960 bytes in 0.165 second response time
[11-26-2014 16:45:34] SERVICE ALERT: public-www;CDN - Logo;CRITICAL;SOFT;1;Name or service not known
[11-26-2014 16:39:24] SERVICE ALERT: public-www;CDN - Logo;OK;SOFT;3;HTTP OK: HTTP/1.1 200 OK - 2960 bytes in 0.030 second response time
[11-26-2014 16:38:34] SERVICE ALERT: public-www;CDN - Logo;CRITICAL;SOFT;2;Name or service not known
[11-26-2014 16:37:34] SERVICE ALERT: public-www;CDN - Logo;CRITICAL;SOFT;1;Name or service not known
[11-26-2014 16:25:24] SERVICE ALERT: public-www;CDN - Logo;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 2960 bytes in 0.030 second response time
[11-26-2014 16:24:34] SERVICE ALERT: public-www;CDN - Logo;CRITICAL;SOFT;1;Name or service not known
[11-26-2014 16:21:24] SERVICE ALERT: public-www;CDN - Logo;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 2960 bytes in 0.066 second response time
[11-26-2014 16:20:24] SERVICE ALERT: public-www;CDN - Logo;CRITICAL;SOFT;1;Name or service not known
Portland:
[11-26-2014 16:49:40] SERVICE ALERT: public-www;CDN - Logo;CRITICAL;SOFT;1;Name or service not known
[11-26-2014 16:43:40] SERVICE ALERT: public-www;CDN - Logo;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 2960 bytes in 0.148 second response time
[11-26-2014 16:42:40] SERVICE ALERT: public-www;CDN - Logo;CRITICAL;SOFT;1;Name or service not known
[11-26-2014 16:39:40] SERVICE ALERT: public-www;CDN - Logo;OK;HARD;3;HTTP OK: HTTP/1.1 200 OK - 2960 bytes in 0.186 second response time
[11-26-2014 16:21:40] SERVICE ALERT: public-www;CDN - Logo;CRITICAL;HARD;3;Name or service not known
[11-26-2014 16:20:40] SERVICE ALERT: public-www;CDN - Logo;CRITICAL;SOFT;2;Name or service not known
[11-26-2014 16:19:41] SERVICE ALERT: public-www;CDN - Logo;CRITICAL;SOFT;1;Name or service not known
Santa Clara:
[11-26-2014 16:24:26] SERVICE ALERT: public-www;CDN - Logo;CRITICAL;HARD;3;Name or service not known
[11-26-2014 16:23:25] SERVICE ALERT: public-www;CDN - Logo;CRITICAL;SOFT;2;Name or service not known
[11-26-2014 16:22:25] SERVICE ALERT: public-www;CDN - Logo;CRITICAL;SOFT;1;Name or service not known
6:24 PM PST Between 4:12 PM and 6:02 PM PST, users experienced elevated error rates when making DNS queries for CloudFront distributions. The service has recovered and is operating normally.
This was linked elsewhere in the thread https://github.com/etsy/cdncontrol/ and Etsy seemed to be up through the entire thing. So perhaps they're doing something right :)
I'm curious if the root cause of this is software failure or external attack. Aws is clearly at the point in size and competence that hardware failures are unlikely as root cause.
> Aws is clearly at the point in size and competence that hardware failures are unlikely as root cause.
The opposite might also be true: Amazon might now have reached a point in size where they can't scale further upwards without losing visibility and control of part of their hardware.
We were first alerted to an issue by our monitoring at 4:20 Pacific, 40 minutes before AWS updated their status page. It's infuriating that AWS doesn't communicate more promptly and transparently when an issues like this occurs.
We have business-level support through AWS activate, and it's not any different really - we could submit a ticket and get a phone call back within a 1 hour guarantee, but I'm not sure that would get you much more information or faster resolution for an issue like this. The really big customers presumably have contacts within the AWS team who give them extra information in cases like this.
I had an Urgent-level issue just two days ago (1-hr SLA), also with business-level support through AWS Activate, and it took them THREE hours to get back to me, with me simultaneously waiting on one chat and two phone lines to try to get in contact with anyone I could,
Finally, chat came through--but only after a wait far longer than what their SLA guarantees.
Point being that support probably isn't going to get you much either, especially given that they aren't holding to the 1hr SLA.
Trust me, bare metal doesn't get you around this kind of mess, it just leads to a different set of problems. When you're core router loses both its brains because of a malfunctioning line card, all bet's are off.
I love AWS but they really need to improve their procedures for communicating during outages. If your parent companies' billion dollar site can be affected in any way on the night before black friday, and even your own site is down, and you are not acknowledging that the service is FUBAR - you have a problem.