I have a very nice quote from a discussion I remember:
"If you need to get up at 3AM to keep services running, you're doing something wrong."
You can make sure that most of the services in *NIX world to take care of itself while you're away without using any fancy SaaS or PaaS offering.
Heck, even you can do failovers with heartbeatd. Even with a serial cable if you feel fancy.
Bonus: This is the first thing I try to teach anyone in system administration and software engineering. "Do your work with good quality, so you don't have to wake up at 3AM".
I'd pick a call in the morning any time given that the cause of the call occurs rarely and the alternative is to spend a lot of time automating things with possibility of blowing things up in a much bigger way. If situation like [0] had happened to me at night, I'd happily take a time off my sleep and do manual standby server promotion (or no promotion at all) rather than spend days recovering from diverged servers that Raft kool-aid was supposed to save me from.
I'm not against your point of view to be honest. It's a perfectly rational and pragmatic to act this way.
I'm not also advocating that "complete, complex automation" is the definitive answer to this problem. On the contrary, I advocate "incremental automation" which, solves a single problem in a single step. If well documented, it works much better and reliably in the long run & can be maintained with ease.
Quoting John Gall:
> A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work.
I'm a good engineer, but I'm not qualified as a system administrator. I know that I will do something wrong, and I don't have the time to learn everything.
You can try and build and test redundancy and contingency management, and you can lower the frequency of surprises through good choices.
But you're still going to get woken up at 3am sometimes. Things break, in unexpected ways. Maybe the hot spare didn't actually work when a raid set started rebuilding onto it. Maybe third party software did something unexpected. Or maybe something broke and your failover didn't actually work because of subtle configuration drift since the last test.
We have a standard routine called The restart test. We reboot the machine in a normal way to see how it behaves, but in the middle of the workload. Also, if the system is critical, sometimes we just yank the power cables to see what happens.
Normally all plausible scenarios are tested and systems are well tortured before putting into production.
It also helps that our majority of servers are cattle, not pets. So a missing cattle can wait until morning. Also all "pet" servers have active and tested failover, so they can wait until morning too.
We once had a problem with a power outage when our generators failed to kick in, so we lost the whole datacenter. Even in this case we can return to all-operational in 2hrs or less.
I forgot to add: We use installations from well tested templates, so installations have no wiggle-room configuration wise. If something is working, we can replicate that pretty reliably.
But you probably don't yank power on critical things mid-load after making a trivial change. Excessive testing breeds its own risks.
But it's really, really easy to gank up a trivial change now and then.
In the past 10 years, I've been woken up three times. One was from third-party software having a certificate that we didn't know about expiring; one was from a very important RAID set degrading and failing to auto-rebuild to the hotspare (it was RAID-10, so didn't want to leave it with a single copy of one stripe any longer than necessary); and one was from a bad "trivial change" that actually wasn't. I don't see how you can get to a rate much lower than this if you are running critical, 24x7 infrastructure.
Don't looks like you have much of experience with what's you're talking about - there is no such thing as heartbeatd, it's called keepalived (or pacemaker if you prefer unnecessary complex solutions), any ops person can't even misspell that.
Having just setup a HA Postgres with Patroni - I disagree. Honestly I think we should've just stuck with a single Postgres server.
Sure you can have an orchestration tool "Make sure everything is running, and respond to failures", but that's yet another tool that can break, be misconfigured, etc.
"If you need to get up at 3AM to keep services running, you're doing something wrong."
You can make sure that most of the services in *NIX world to take care of itself while you're away without using any fancy SaaS or PaaS offering.
Heck, even you can do failovers with heartbeatd. Even with a serial cable if you feel fancy.
Bonus: This is the first thing I try to teach anyone in system administration and software engineering. "Do your work with good quality, so you don't have to wake up at 3AM".