> We used multiple availability zones and it didn't prevent downtime at all.
Can you explain this a little more? Amazon says this only affected one AZ, and they specifically note:
For this event, customers that were running their applications across multiple
Availability Zones in the Region were able to maintain availability throughout
the event.
Apart from one internal project which mistakenly had all it's app server instances in -2b (ooops!) - all my production mobile app backends are spread across the 3 Sydney AZs. That's a few dozen EC2 app servers across about 15 projects.
My monitoring reported a worst case of 57 seconds of degraded connectivity - which was an instance in -2b going offline and the ELB not taking it out of the rotation very quickly, the app running on that had interruption, but only while waiting for the timeouts. Crashlytics and GA crash reporting didn't bat an eyelid... I had under 70 users active at the time, 1/3rd of them may have seen a minute or less of loading spinner if they'd fired of a UI blocking api call during those 57 seconds. I'm not looking _super_ closely, but nothing I'm monitoring apart from EC2 - like RDS, S3, ELB, SNS - showed _any_ glitches (I'd _probably_ have caught even single digit second problems for _some_ of that...)
I'm actually quite happy with how everything went - we don't go to any particular heroic lengths to ensure HA or uptime, we just follow recommended best practice, and at least in this outage, that worked out fine for us (except for that project where all the app servers were in -2b, and I'm happy to wear that as our fuckup)
Can you explain this a little more? Amazon says this only affected one AZ, and they specifically note: