Hacker News new | past | comments | ask | show | jobs | submit login

No the answer is keeping all of your VMs stateless and just using autoscaling with the appropriate health checks. Even if you just having a min/max of 1.



Describe a health check that can detect any possible hardware problem.

The error rate on the machines was higher in both cases, but many requests still succeeded. Amazon certainly didn't detect an issue right away either.


There is no way that you could record metrics - even custom metrics that get populated via the CloudWatch logs agent to CloudWatch and over a certain threshold of errors, bring another instance up and kill the existing instance? If you could detect sporadic errors there must be some method to automated it.

I’m assuming this isn’t a web server, if so it’s even simpler.


A statistical rule moves you into the realm of deciding what rate of false positives and false negatives you'll tolerate. Based on data from exactly two incidents in this case, which is obviously a bit fraught.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: