Hacker News new | past | comments | ask | show | jobs | submit login

Good morning (posted from throwaway for reasons Ill describe).

I feel for you greatly here, and I commend your openness about how data restoration caused 6 hours of data loss. I too work in a critical area where even minutes DB lost is bad.

We just had our own test event recently. We make sure that we can fail everything over, and run on all secondaries. I found out how that worked; we failed. The problem with this, is I found out after the fact. Due to the secrecy, not even the teams knew why things failed the way they did. I had to piece it together from disjoint hearsay, and now I believe I have a competent picture.

So yes, when I read your post mortem and RCA, it reminded me greatly of what happened here as well. But we all can learn from your example. As for me, I'm posting it as a throwaway due to likely threats on my job.




I agree that 6 hours is way too much.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: