Amazon DNS error

jordanthoms · on Nov 27, 2014

Half the internet is down, Amazon.com is flapping, the AWS website doesn't have any assets, and status.aws has a Green checkmark with a little (i), 40 minutes after the problems start?!

I love AWS but they really need to improve their procedures for communicating during outages. If your parent companies' billion dollar site can be affected in any way on the night before black friday, and even your own site is down, and you are not acknowledging that the service is FUBAR - you have a problem.

erichmond · on Nov 27, 2014

Exactly this. In our experience Amazon's status page doesn't reflects actual outages we are having, Cloudwatch often sends false positives (systems being down, when in fact, they are not) and SNS messages get lost in the ether.

We've built our systems to recover from failure states once we know there is a problem. AWS's inability to do that reliably is forcing us to own the problem ourself, and as a result, we will probably migrate away to use cheaper boxes in the cloud.

On the positive side, RDS has been solid.

felixgallo · on Nov 27, 2014

Imagine what kind of conditions would have to exist for Amazon to use a red indicator.

corobo · on Nov 27, 2014

Greentick [i]: Amazon AWS is closing its doors forever, we apologise for the inconvenience

cperciva · on Nov 27, 2014

Hey, if you're able to load status.aws.amazon.com then clearly it's working!

jordanthoms · on Nov 27, 2014

It's just as well they don't use CF for anything critical on the status site...

robryan · on Nov 27, 2014

They should take it out of the console as well.

WestCoastJustin · on Nov 27, 2014

UPDATE: RESOLVED according to the AWS status page -- 6:24 PM PST Between 4:12 PM and 6:02 PM PST, users experienced elevated error rates when making DNS queries for CloudFront distributions. The service has recovered and is operating normally. [2]

--

Cloudfront DNS is hosed. Doing a DNS lookup on my cloudfront distribution fails. Many folks on Twitter also see the issue too [1]. Maybe a failed upgrade or something, their whois info was updated today @ 2014-11-26T16:24:49-0800.

  [~]$ host d1cg27r99kkbpq.cloudfront.net
  Host d1cg27r99kkbpq.cloudfront.net not found: 2(SERVFAIL)

[1] https://twitter.com/search?q=cloudfront

[2] http://status.aws.amazon.com/

dice · on Nov 27, 2014

The AWS status page not showing anything isn't unusual. They'll probably update in half an hour to describe it as a partial failure then revise it to an "all OK" 10 minutes before it's actually fixed then retroactively downgrade it to a minor quality of service disruption.

Not that I've noticed they tend to lie through their teeth on the status page or anything....

teraflop · on Nov 27, 2014

Apparently, "DNS resolution errors" are considered neither a "service disruption" nor a "performance issue", but instead are categorized as an "informational message."

anonymuse · on Nov 27, 2014

AWS Status shows Cloudfront DNS issues, which is reflected in our page's assets not loading. Kinda makes me wish we were using something like https://github.com/etsy/cdncontrol/ but that's a fight for another day!

  <title type="text">Informational message: DNS Resolution      errors </title>

  <link>http://status.aws.amazon.com</link>

  <pubDate>Wed, 26 Nov 2014 17:00:39 PST</pubDate>
	   <guid>http://status.aws.amazon.com/#cloudfront_1417050039</guid>

  <description>We are currently investigating increased error   rates for DNS queries for CloudFront distributions.  </description>

anonymuse · on Nov 27, 2014

Anecdotally, our site is now loading quickly for me: http://canary.is/.

However, digging the Cloudfront name servers times out intermittently:

  $ dig +short @ns-666.awsdns-19.net cloudfront.net
  ;; connection timed out; no servers could be reached

Essa · on Nov 27, 2014

Amazon right now http://i.imgur.com/Ed2jtn2.jpg

pestaa · on Nov 27, 2014

I too like jokes, but saying nothing else leads to superficial discussions.

ejdyksen · on Nov 27, 2014

Amazon appears to be up for me, but a number of places are still affected:

- All of Vox Media's properties (The Verge, Polygon, Vox.com, SBNation, etc)

- All of Atlasssian's services (Bitbucket, Jira OnDemand, etc)

- Flowdock

- Instagram

- aws.amazon.com has no assets

Edit: console.aws.amazon.com has no assets, either, so it's also currently worthless.

It's probably worth having a DNS failover strategy for Route53 (if that's what you're using) that doesn't involve the UI on console.aws.amazon.com.

stevekemp · on Nov 27, 2014

> It's probably worth having a DNS failover strategy for Route53 (if that's what you're using) that doesn't involve the UI on console.aws.amazon.com.

Which is one of the reasons I setup https://dns-api.com/ - A way of updating Route53 DNS via git hooks.

jbinto · on Nov 27, 2014

Amazon.com is up for me, but is entirely missing product images.

theophrastus · on Nov 27, 2014

One wonders if this is same matter currently afflicting a certain infamous torrent site ("..downtime appears to be a routing issue as the site is still reachable in most parts of the world"): http://torrentfreak.com/the-pirate-bay-goes-down-locally-141...

cjreyes · on Nov 27, 2014

Is there a RoR Gem / configuration that will serve assets locally if an external asset host name doesn't resolve or times out?

Hengjie · on Nov 27, 2014

There isn't, but there's a frontend library that you can use to add fallbacks to <script> tags:

https://github.com/shinnn/script-fallback-from-urls

jacobsenscott · on Nov 27, 2014

You can assign a proc to config.asset_host, so you could easily do it -

  config.asset_host = -> {
    cdn_up? ? "http://mycdn.com" : "http://mydomain.com"
  }

tbfrench · on Nov 27, 2014

Not a gem, but from jQuery days:

lukeschlather · on Nov 27, 2014

It's a decent idea, but it would be better to do client-side. For this sort of event, knowing that the name resolves on the server doesn't give you any confidence it will resolve for the client.

What would be really nice is if you could specify a fallback host in your DNS prefetch, and the browser would make it "just work."

thejosh · on Nov 27, 2014

Exactly, if a POP for Australia is down but works in Germany, better to serve this on the client side.

bbnnt · on Nov 27, 2014

that'd be genius

joeevans1000 · on Nov 27, 2014

The status page icon for CloudFront turns red when HN comments reach 100.

jordanthoms · on Nov 27, 2014

Actually I think red is reserved for a nuclear attack on N. Virginia by ze Russians...

dice · on Nov 27, 2014

It's flapping in some locations. Seems to be hard down in others. From a service monitoring Cloudfront on various geographically distributed Nagios monitors (times are PST):

  Columbus:

  [11-26-2014 16:46:24] SERVICE ALERT: public-www;CDN - Logo;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 2960 bytes in 0.165 second response time
  [11-26-2014 16:45:34] SERVICE ALERT: public-www;CDN - Logo;CRITICAL;SOFT;1;Name or service not known
  [11-26-2014 16:39:24] SERVICE ALERT: public-www;CDN - Logo;OK;SOFT;3;HTTP OK: HTTP/1.1 200 OK - 2960 bytes in 0.030 second response time
  [11-26-2014 16:38:34] SERVICE ALERT: public-www;CDN - Logo;CRITICAL;SOFT;2;Name or service not known
  [11-26-2014 16:37:34] SERVICE ALERT: public-www;CDN - Logo;CRITICAL;SOFT;1;Name or service not known
  [11-26-2014 16:25:24] SERVICE ALERT: public-www;CDN - Logo;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 2960 bytes in 0.030 second response time
  [11-26-2014 16:24:34] SERVICE ALERT: public-www;CDN - Logo;CRITICAL;SOFT;1;Name or service not known
  [11-26-2014 16:21:24] SERVICE ALERT: public-www;CDN - Logo;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 2960 bytes in 0.066 second response time
  [11-26-2014 16:20:24] SERVICE ALERT: public-www;CDN - Logo;CRITICAL;SOFT;1;Name or service not known

  Portland:

  [11-26-2014 16:49:40] SERVICE ALERT: public-www;CDN - Logo;CRITICAL;SOFT;1;Name or service not known
  [11-26-2014 16:43:40] SERVICE ALERT: public-www;CDN - Logo;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 2960 bytes in 0.148 second response time
  [11-26-2014 16:42:40] SERVICE ALERT: public-www;CDN - Logo;CRITICAL;SOFT;1;Name or service not known
  [11-26-2014 16:39:40] SERVICE ALERT: public-www;CDN - Logo;OK;HARD;3;HTTP OK: HTTP/1.1 200 OK - 2960 bytes in 0.186 second response time
  [11-26-2014 16:21:40] SERVICE ALERT: public-www;CDN - Logo;CRITICAL;HARD;3;Name or service not known
  [11-26-2014 16:20:40] SERVICE ALERT: public-www;CDN - Logo;CRITICAL;SOFT;2;Name or service not known
  [11-26-2014 16:19:41] SERVICE ALERT: public-www;CDN - Logo;CRITICAL;SOFT;1;Name or service not known

  Santa Clara:

  [11-26-2014 16:24:26] SERVICE ALERT: public-www;CDN - Logo;CRITICAL;HARD;3;Name or service not known
  [11-26-2014 16:23:25] SERVICE ALERT: public-www;CDN - Logo;CRITICAL;SOFT;2;Name or service not known
  [11-26-2014 16:22:25] SERVICE ALERT: public-www;CDN - Logo;CRITICAL;SOFT;1;Name or service not known

Hengjie · on Nov 27, 2014

6:24 PM PST Between 4:12 PM and 6:02 PM PST, users experienced elevated error rates when making DNS queries for CloudFront distributions. The service has recovered and is operating normally.

rynop · on Nov 27, 2014

The Route53 (AWS DNS service) is 100% SLA: http://aws.amazon.com/route53/sla/

I would hope they honor this dns issue under the same guidelines although its technically not the route53 service we are paying for.

andrea_s · on Nov 27, 2014

I use Route53 with no Cloudfront involvement, and everything looks good... So I think it's not really related to Route53

anonymuse · on Nov 27, 2014

This is good, let's get this sort of hiccup out of the way before the predictably massive traffic from Black Friday and Cyber Monday.

What's everyone using to monitor external asset hosts? Is anyone dynamically switching between them, or failing back to local assets?

donutpepperoni · on Nov 27, 2014

This was linked elsewhere in the thread https://github.com/etsy/cdncontrol/ and Etsy seemed to be up through the entire thing. So perhaps they're doing something right :)

jokamoto · on Nov 27, 2014

Looks like anything running off of AWS DNS is dead right now.

  $ dig @ns-666.awsdns-19.net cloudfront.net

Times out for that server as well as for all of the nameservers listed for cloudfront.net.

rustyconover · on Nov 27, 2014

Route53 still is returning answers for records that are not aliases to Cloudfront.

If you have an alias to a cloudfront distribution the answers aren't being provided by the Route53 servers.

You can verify this with a dig:

dig mx [your domain like domain.com] - will probably still work.

dig a [name of alias like www.domain.com] - isn't working.

tootie · on Nov 27, 2014

My Route53 domains are ok.

rdl · on Nov 27, 2014

I'm curious if the root cause of this is software failure or external attack. Aws is clearly at the point in size and competence that hardware failures are unlikely as root cause.

toyg · on Nov 27, 2014

> Aws is clearly at the point in size and competence that hardware failures are unlikely as root cause.

The opposite might also be true: Amazon might now have reached a point in size where they can't scale further upwards without losing visibility and control of part of their hardware.

It's all speculation anyway.

ams6110 · on Nov 27, 2014

But aren't big, complex systems more likely to fail in unexpected ways?

robryan · on Nov 27, 2014

I wonder if we should except s3 to degrade pretty badly if this isn't fixed soon. Going to be seeing a much higher than normal rate of requests.

bbnnt · on Nov 27, 2014

yep all crashed :/ is that a regular behavior from the amazon status page, being all green when a bunch of users experience the same problem ?

timfernando · on Nov 27, 2014

Having lots of issues on the EU side too (Route 53 + Cloudfront).

Noticed: - trello.com - import.io - intercom.io

as some sites as well as ours who we've noticed issues with.

jpetitbon · on Nov 27, 2014

From what I can see myself and from twitter feed, the issue looks resolved. But the AWS status page is still not reporting any update.

jpetitbon · on Nov 27, 2014

"5:57 PM PST Error rates for DNS queries of CloudFront distributions are currently recovering."

maknz · on Nov 27, 2014

Our CloudFront distribution is down. Route53 seems fine, but it could just be caching. Looks like we're in good company though.

ranman · on Nov 27, 2014

Status page is updated: http://status.aws.amazon.com/

msaffitz · on Nov 27, 2014

We were first alerted to an issue by our monitoring at 4:20 Pacific, 40 minutes before AWS updated their status page. It's infuriating that AWS doesn't communicate more promptly and transparently when an issues like this occurs.

fletchowns · on Nov 27, 2014

Is it any different if you shell out the big bucks for support?

jordanthoms · on Nov 27, 2014

We have business-level support through AWS activate, and it's not any different really - we could submit a ticket and get a phone call back within a 1 hour guarantee, but I'm not sure that would get you much more information or faster resolution for an issue like this. The really big customers presumably have contacts within the AWS team who give them extra information in cases like this.

jdotjdot · on Nov 27, 2014

I had an Urgent-level issue just two days ago (1-hr SLA), also with business-level support through AWS Activate, and it took them THREE hours to get back to me, with me simultaneously waiting on one chat and two phone lines to try to get in contact with anyone I could,

Finally, chat came through--but only after a wait far longer than what their SLA guarantees.

Point being that support probably isn't going to get you much either, especially given that they aren't holding to the 1hr SLA.

fletchowns · on Nov 27, 2014

So what's the consequence of them violating the SLA?

pauloschilling · on Nov 27, 2014

donutpepperoni · on Nov 27, 2014

Seems like the best way to verify an AWS outage is via Twitter (#AWS) ... This doesn't feel right ..

stephenc_c_ · on Nov 27, 2014

My site is down, I'm using Route53, S3 and CloudFront so I guess that's why.

zuccs · on Nov 27, 2014

Same setup here except Cloudflare instead of Route53. So I guess it's S3/CloudFront related?

donutpepperoni · on Nov 27, 2014

Also experiencing issues with the AWS Console. Still not resolved for me.

johngd · on Nov 27, 2014

Amazon.com's content cdn is returning 0 byte replies for images.

aidos · on Nov 27, 2014

Seeing issues with aws console (also looks cloudfront DNS related).

sumnulu · on Nov 27, 2014

Not fixed yet, CloudFront is down. DNS resolution not working.

jordanthoms · on Nov 27, 2014

Yep, seeing strange intermittent DNS errors for CF here.

juliann · on Nov 27, 2014

Not loading any images for me. Browsing from Argentina.

jonperl · on Nov 27, 2014

NewRelic synthetics caught this for me.

KhalilK · on Nov 27, 2014

DDoS?

general_failure · on Nov 27, 2014

None of the images are loading

thanhquanky · on Nov 27, 2014

I think it's fixed

flippyhead · on Nov 27, 2014

seeing the same :/

on Nov 27, 2014

[deleted]

emeraldd · on Nov 27, 2014

Trust me, bare metal doesn't get you around this kind of mess, it just leads to a different set of problems. When you're core router loses both its brains because of a malfunctioning line card, all bet's are off.

zerouniverse · on Nov 27, 2014

touché