I have a very nice quote from a discussion I remember: "If you need to get up at...

vthriller · on Jan 30, 2020

I'd pick a call in the morning any time given that the cause of the call occurs rarely and the alternative is to spend a lot of time automating things with possibility of blowing things up in a much bigger way. If situation like [0] had happened to me at night, I'd happily take a time off my sleep and do manual standby server promotion (or no promotion at all) rather than spend days recovering from diverged servers that Raft kool-aid was supposed to save me from.

[0] https://github.blog/2018-10-30-oct21-post-incident-analysis/

bayindirh · on Jan 30, 2020

I'm not against your point of view to be honest. It's a perfectly rational and pragmatic to act this way.

I'm not also advocating that "complete, complex automation" is the definitive answer to this problem. On the contrary, I advocate "incremental automation" which, solves a single problem in a single step. If well documented, it works much better and reliably in the long run & can be maintained with ease.

Quoting John Gall:

> A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work.

golergka · on Jan 30, 2020

I'm a good engineer, but I'm not qualified as a system administrator. I know that I will do something wrong, and I don't have the time to learn everything.

So I'd rather pay amazon.

parliament32 · on Jan 31, 2020

So instead of learning systems you have to learn the AWS spaghetti of microservices.

mlyle · on Jan 30, 2020

You can try and build and test redundancy and contingency management, and you can lower the frequency of surprises through good choices.

But you're still going to get woken up at 3am sometimes. Things break, in unexpected ways. Maybe the hot spare didn't actually work when a raid set started rebuilding onto it. Maybe third party software did something unexpected. Or maybe something broke and your failover didn't actually work because of subtle configuration drift since the last test.

bayindirh · on Jan 30, 2020

We have a standard routine called The restart test. We reboot the machine in a normal way to see how it behaves, but in the middle of the workload. Also, if the system is critical, sometimes we just yank the power cables to see what happens.

Normally all plausible scenarios are tested and systems are well tortured before putting into production.

It also helps that our majority of servers are cattle, not pets. So a missing cattle can wait until morning. Also all "pet" servers have active and tested failover, so they can wait until morning too.

We once had a problem with a power outage when our generators failed to kick in, so we lost the whole datacenter. Even in this case we can return to all-operational in 2hrs or less.

I forgot to add: We use installations from well tested templates, so installations have no wiggle-room configuration wise. If something is working, we can replicate that pretty reliably.

mlyle · on Jan 30, 2020

Sure, this is typical of well-run environments.

But you probably don't yank power on critical things mid-load after making a trivial change. Excessive testing breeds its own risks.

But it's really, really easy to gank up a trivial change now and then.

In the past 10 years, I've been woken up three times. One was from third-party software having a certificate that we didn't know about expiring; one was from a very important RAID set degrading and failing to auto-rebuild to the hotspare (it was RAID-10, so didn't want to leave it with a single copy of one stripe any longer than necessary); and one was from a bad "trivial change" that actually wasn't. I don't see how you can get to a rate much lower than this if you are running critical, 24x7 infrastructure.

pepemon · on Jan 30, 2020

Don't looks like you have much of experience with what's you're talking about - there is no such thing as heartbeatd, it's called keepalived (or pacemaker if you prefer unnecessary complex solutions), any ops person can't even misspell that.

bayindirh · on Jan 30, 2020

Sorry, you're right. I confused it with its cousin, which is indeed called heartbeatd [0].

I'm new at Linux and system administration. I'm using Linux just for 20 years and managing systems for 13 years.

[0]: https://www.manpagez.com/man/8/heartbeatd/

penagwin · on Jan 30, 2020

Having just setup a HA Postgres with Patroni - I disagree. Honestly I think we should've just stuck with a single Postgres server.

Sure you can have an orchestration tool "Make sure everything is running, and respond to failures", but that's yet another tool that can break, be misconfigured, etc.

commandlinefan · on Jan 30, 2020

> If you need to get up at 3AM to keep services running

I just turn off the phone until I get up. That way I don't have to get up at 3 AM, I don't even know they were down until five hours later. :)