SSH is not disallowed, because there are still cases where it is necessary to ss...

nrr · on Feb 12, 2020

I'm left with just questions at this point. Assume production, not development or UAT or staging or integration, etc.

How do you enforce that kind of compliance on SSH sessions? How do you audit for SSH sessions that invoke commands that mutate the state of the system? How do you account for configuration drift in the event that an SSH session mutates machine state outside of those compliance and auditing mechanisms?

Why are you SSHing in to curate installed packages on machines manually instead of letting a deployment agent/service take care of that?

Furthermore, what kind of environment are you operating in where your response to a hardware fault is to SSH in instead of immediately evacuating the workloads from the faulty machine, replacing it with a new one, and taking the machine to a triage location for further investigation?

Do you operate at a small enough scale where leaving a faulty machine active on the network, installing packages by hand, and SSHing to individual machines is actually sensible?

Do you have a lot of false hardware alarms where the response is to SSH in, run a few commands, and then bail when the going looks good? What kind of monitoring practices do you employ?

Annatar · on Feb 13, 2020

SSH in and get caught messing with the system manually - get fired. Keeps everyone honest.

The environment is close to 90,000 systems. We service the hardware because it is configured redundantly, for example the / filesystem is always on a mirror. The physical systems are system engineered and configured to withstand multiple hardware faults without any loss of service.

nrr · on Feb 13, 2020

You keep saying these things, but I'm less and less convinced that you have a significant hand in how these things are "system engineered," as you put it. I'm also concerned by how few of my questions you actually answered.

"SSH in and get caught messing with the system manually" is an extremely hand-wavey answer, especially in an environment of O(1e6) machines. I'd expect such an environment to have a rather significant degree of automated compliance and audit controls in place.

You'll also note well that I didn't say not to service machines. I'd asked why you would prefer to leave a machine with an implied potential data-destroying fault in the rack than immediately swap it out with a new machine that has been verified not to destroy data. The servicing part comes into play here in order to mark the previous-faulty machine as no longer faulty.

In particular, rack space is expensive, and certifying a machine as fit for service again can take a long time, so it's a bit of a waste of real estate to leave a machine that can't serve production traffic in place when you can quickly swap it out with one that can.

Furthermore, redundancy doesn't help when you have an HBA or an ethernet controller or a memory controller or CPU or a piece of PCIe interconnect hardware or a hard disk with a firmware bug that corrupts data. At that point, your redundancy becomes a liability, and you could wind up propagating corrupt data throughout your system.

This all said, I'll agree that the louder and more obvious hardware faults like disk crashes are relatively easy to cope with, so in those cases, I'd likely also leave the machine in the rack and just perform the respective component swaps. The place where I predict we'll disagree is whether to evacuate the machine's workloads prior to servicing.

So, again, I'll assert that you likely have less of a hand in the engineering of these things than you're letting on. That's nothing to be ashamed of, but it does make your responses seem less genuine when you try to pass off your experience working in such an environment as expertise in designing such an environment.

Annatar · on Feb 16, 2020

I have single handedly designed entire data centers, irrespective of which impression you might get from my responses, which are admittedly terse because I'm usually on a mobile phone tap-tapping them out, and that starts to severely piss me off really quickly. Like now. Which is why I'm going to stop right here.