Amazon goes down; takes S3, Salesforce, Target, and others with them.

cperciva · on Dec 24, 2009

And people told me I was crazy for hard-coding IP addresses for s3.amazonaws.com and sdb.amazonaws.com into the Tarsnap code... :-)

akl · on Dec 24, 2009

You are crazy for doing that - one instance of downtime doesn't justify ignoring all the advantages DNS brings.

cperciva · on Dec 24, 2009

When I made that design decision I wasn't considering the possibility of DNS outages at all; I was just thinking in terms of "there's a huge number of places between me and ultradns where someone could insert a spoofed DNS response".

jacquesm · on Dec 24, 2009

What if amazon gets a new netblock assigned ?

Can you forcibly upgrade the software in that case ?

cperciva · on Dec 24, 2009

Of course I can upgrade the software; and there's nothing forcible about it. We're talking about code I'm running on my server here...

jacquesm · on Dec 25, 2009

Ah, my bad, sorry I thought you meant in the tarsnap client. Then I really don't know why people would make a fuss over you hardcoding the ips in there. All you need to do is keep an eye out in case they change them (which you could even automate).

cperciva · on Dec 25, 2009

All you need to do is keep an eye out in case they change them

Exactly, and that's what I do. (With the caveat that I look for AWS endpoints being taken out of service, not for changes in what DNS tells me.)

derefr · on Dec 24, 2009

Are there DNS servers that support versioning? The best solution I could imagine would simply be to set Tarsnap to normally use the current DNS records for S3, but be able to rollback to a valid zone record if they encounter an update that makes the servers stop resolving.

cperciva · on Dec 24, 2009

The Tarsnap client only talks to the Tarsnap server, but I do cache that lookup in order to avoid problems with glitchy DNS resolution.

I could have the Tarsnap server cache DNS lookups if I was only worried about working around DNS outages -- but as I said, that wasn't something I was considering at all when I made the decision to eschew DNS.

maw · on Dec 24, 2009

Some DNS caches will continue to try giving out expired records if no newer ones can be found. Unclear that it's worth it, though.

baq · on Dec 24, 2009

you can expect such downtimes every Christmas. why? ask yourself how much protection money UltraDNS was told to pay.

justinsb · on Dec 24, 2009

Awesome ... was that for security or for redundancy? Did you have a solution for what would happen if the IP address changed?

cperciva · on Dec 24, 2009

Mostly security -- DNS is one of the worst offenders when it comes to protocols with security problems, both in terms of protocol issues and bugs in implementations.

Amazon doesn't move endpoints very often, and I round-robin requests to several endpoints; so in the rare cases an AWS endpoint changes I see a slight decrease in Tarsnap performance and can make the Tarsnap server stop using the dead endpoint before any users are likely to notice.

esja · on Dec 24, 2009

Hmmm... should I trust my backups to generic software vendor X, or to someone who clearly takes nothing for granted... tough choice. Signing up now.

justinsb · on Dec 24, 2009

It is DNS. If you put the EC2/S3 address into /etc/hosts, the services work fine. Affecting lots of other big websites as well apparently (target, salesforce) because they all outsource to UltraDNS

shaddi · on Dec 24, 2009

NANOG chatter confirming it is an issue with UltraDNS. Seems to be west coast related.

EDIT: Potentially a DOS attack. From NANOG: "We have some DNS providing type customers (not UltraDNS) receiving a few million packets/sec of UDP/53 DoS traffic, starting at about the same time as the UltraDNS problems. No clue if it's related, but it certainly sounds suspicious. :)"

aristus · on Dec 24, 2009

I'm very surprised that they rely on a single vendor. But I guess DNS is one of those things you don't think about until it fails.

metachor · on Dec 24, 2009

The point of using UltraDNS is that it they provide fast "real-time" failover of DNS routing. This is used, for example, for fault tolerance where the IP address responding to a domain name might change due to failure scenarios or load balancing (where a different server is now primary responder to the domain name, maybe located in a different data center or country). Infrastructure-wise, UltraDNS is kind of like the Akami of DNS, instead of content distribution.

justinsb · on Dec 24, 2009

I thought DNS was supposed to try backups servers automatically... any DNS experts able to explain what's going on? Some of the ultradns servers are returning (correct) values, others simply not responding.

lsc · on Dec 24, 2009

yeah, uh, if you are smart, you have a secondary DNS provider. But that really requires you managing it yourself. the problem was that many people outsource, which usually means going with only one provider. (now, ultradns does have a good setup, they probably aren't a bad choice for a provider, but having only one is just plain stupid.)

boredguy8 · on Dec 24, 2009

Never let truth get in the way of a good headline.

artagnon · on Dec 24, 2009

Haha.. true. What else can we expect from Techcrunch?

jseifer · on Dec 24, 2009

It doesn't say in the TC article and I can't really tell from this thread. How long was Amazon actually down?

ggrot · on Dec 24, 2009

I wonder how much money amazon loses per minute 2 days before christmas. Ouch.

rlpb · on Dec 24, 2009

I don't think that Amazon itself will lose much, since they're well-known enough that most people who fail to reach them will retry later.

I think the small fry using S3 and EC2 will be the ones who are actually hit by this.

cookiecaper · on Dec 24, 2009

I don't know, I don't think it'd be as much as 4 - 7 days before Christmas. Most people know that it's too late to order from Amazon or any other online-only store by the night of Dec. 23. It's probably more money than they'd lose normally, but maybe not that significant?

I have no data to back that up, it's pure speculation.

alain94040 · on Dec 24, 2009

At $20B annual, assuming Christmas is a big chunk, I'm guessing $2B (10%) in the last two weeks, 2 minutes would then be worth $200,000?

I think my 10% number is too low, so maybe $500K for two minutes? But this is not profit, just revenue.

wglb · on Dec 24, 2009

Do you think it might catch up once it is available again? It is a pretty serious shopping destination, and possibly people retry.

robryan · on Dec 24, 2009

It really depends on downtime, if it was more than say an hour a lot of people would probably buy elsewhere, small amount of downtime you wouldn't think would effect sales though.

notmyname · on Dec 24, 2009

surely they mean "Amazon goes down and takes the Internet with it"

slig · on Dec 24, 2009

Only if TC was hosted there.

Sam_Odio · on Dec 24, 2009

It looks like a DNS issue, since while http://amazon.com isn't working for me http://72.21.207.65 is (kind of).

EDIT: "No A records were found for amazon.com" http://www.zoneedit.com/lookup.html?host=amazon.com&type...

mattiss · on Dec 24, 2009

When Rackspace goes down and takes TechCrunch with them, the whole INTERNET goes down. When Amazon goes down, Salesforce, Target, and others...

tinio · on Dec 24, 2009

It looks to be back up now.

newhouseb · on Dec 24, 2009

Why wouldn't such a company run their own name servers? I understand it's "yet another thing to maintain," but I've set up bind before... didn't seem that bad.

evgen · on Dec 24, 2009

This is one of those services that a dedicated provider can sometimes do better than internal IT. Ultradns, for example, has secure secondaries Colocated with large ISPs so you get some good protection against cache poisoning attacks. Everything they do you could do yourself, but it would cost you a lot more than what they charge. (full disclosure: I am a customer and until this evening I was quite happy with their service and reliability)

newhouseb · on Dec 24, 2009

"Colocated with large ISPs so you get some good protection against cache poisoning attacks"

Ah yes, that's not something I factored into my equation. Thanks!

count · on Dec 24, 2009

Amazon IS a large ISP though...

andrewvc · on Dec 24, 2009

Yep, they are, to all the tenants in AWS datacenters.

My apartment however currently only can talk to Time Warner which talks to Level 3 and then, finally Amazon. I believe what the parent meant was that UltraDNS is colo'd with end-user ISPs like Time Warner, Verizon, etc.

newhouseb · on Dec 24, 2009

Right, a lot of DNS attacks rely on the attacker replying to a request faster than the actual DNS server and there's no one closer to the end user than ISPs.

rbranson · on Dec 24, 2009

It's relatively trivial to outsource. It's one of those services that's easy to measure, quantify, and manage (from an outsourced perspective). There's also a bit more to it then that. The Anycast routing can be quite difficult to setup and maintain. It's virtually useless outside of a very small set of protocols (DNS being one of them), so it wouldn't make sense for Amazon to bring that kind of talent in-house for something like DNS.

kierank · on Dec 24, 2009

The Anycast routing can be quite difficult to setup and maintain.

Anycast's not particularly difficult to setup and maintain.

lsc · on Dec 24, 2009

the mistake was outsourcing to only one provider. it's easy enough to setup a BIND slave elsewhere that automatically transfers the zone from your primary provider.

rbranson · on Dec 24, 2009

I'm not sure that "setup a BIND slave" is quite that easy for someone with as much traffic as Amazon.com.

lsc · on Dec 24, 2009

I'm sure they could hire the muscle to do it.

I'm just saying, relying on just one company is usually a bad idea.

tybris · on Dec 24, 2009

Internet can be so fragile.