I don't know the last time I had an instance go down. Not because it doesn't happen, but because it's sufficiently unimportant that we don't alert on it. Our ASG just brings up another to replace it.
Many applications won't be as resilient. That's the trade-off. We don't have a single stateful application. RDS/Redis/Dynamo/SQS are all managed by someone else. We had to build things differently to accommodate that constraint, but as a result we have ~35M active users and only 2 ops engineers, who spend their time building automation rather than babysitting systems.
If you lean in completely, it's a great ecosystem. Otherwise it's a terrible co-lo experience.
Many applications won't be as resilient. That's the trade-off. We don't have a single stateful application. RDS/Redis/Dynamo/SQS are all managed by someone else. We had to build things differently to accommodate that constraint, but as a result we have ~35M active users and only 2 ops engineers, who spend their time building automation rather than babysitting systems.
If you lean in completely, it's a great ecosystem. Otherwise it's a terrible co-lo experience.