We're coloed across three datacenters spanning the US (one might be in TO I thin...

tomgallard · on Oct 26, 2012

But a DNS based failover is still going to take an hour or so to propagate right (given that a lot of browsers/proxies/DNS servers don't respect TTL very well at all)? And then you end up with a system with stale data, and the mess of trying to reconcile it when your other system comes back up.

I'd take an hour long Appengine outage once a year over that anytime!

0xbadcafebee · on Oct 26, 2012

Your name server or stub resolver is what respects DNS TTL, not your browser or proxy. Everyone - including people hosting on AWS - needs to be able to fail over DNS, if the AWS IP you're using is in a zone that just went down, for example.

Any time you have an outage you need to contact your service provider to get an estimate of downtime. If they can't give you one, assume it'll take forever and cut the DNS over. The worst case is some of your users will start to come back online slowly. If you don't cut over, the worst case is all your users are down until whenever the service provider fixes it, and you get to tell your users "we're waiting for someone else to deal with it", which won't make them very happy.

12 hour stale data sounds kind of long to me. 4 hours sounds more reasonable.

codeka · on Oct 27, 2012

I've seen plenty of crappy ISP DNS servers ignore TTL values and cache DNS entries for many hours longer than they're supposed to. Unfortunately, it's all too common.

stickfigure · on Oct 26, 2012

When you can script away 90% of your system administration tasks, hosting in the cloud doesn't really make a ton of sense.

How big is your ops team? I'm guessing it's more than 0.

debacle · on Oct 26, 2012

Ops team? We're a two man operation with occasional contractors.

stickfigure · on Oct 26, 2012

In that case, what is the ratio of "time spent doing ops-related tasks" vs "time spent developing new features" in your company? Please offer an honest evaluation. Everything has a cost; I'm genuinely curious about data points other than my own.

debacle · on Oct 26, 2012

I probably spend no more than an hour a week on ops, and most of that is reading emails from our service providers.

stickfigure · on Oct 26, 2012

Today, maybe, assuming a calm ocean and no scaling issues. But I don't believe you spent an hour a week setting up your three data centers, backups, failover procedure, etc.

debacle · on Oct 26, 2012

The backup script was written in a night, and the most complicated part about failover is remembering to sync data when the outage is over.