SSH in and get caught messing with the system manually - get fired. Keeps everyone honest.
The environment is close to 90,000 systems. We service the hardware because it is configured redundantly, for example the / filesystem is always on a mirror. The physical systems are system engineered and configured to withstand multiple hardware faults without any loss of service.
You keep saying these things, but I'm less and less convinced that you have a significant hand in how these things are "system engineered," as you put it. I'm also concerned by how few of my questions you actually answered.
"SSH in and get caught messing with the system manually" is an extremely hand-wavey answer, especially in an environment of O(1e6) machines. I'd expect such an environment to have a rather significant degree of automated compliance and audit controls in place.
You'll also note well that I didn't say not to service machines. I'd asked why you would prefer to leave a machine with an implied potential data-destroying fault in the rack than immediately swap it out with a new machine that has been verified not to destroy data. The servicing part comes into play here in order to mark the previous-faulty machine as no longer faulty.
In particular, rack space is expensive, and certifying a machine as fit for service again can take a long time, so it's a bit of a waste of real estate to leave a machine that can't serve production traffic in place when you can quickly swap it out with one that can.
Furthermore, redundancy doesn't help when you have an HBA or an ethernet controller or a memory controller or CPU or a piece of PCIe interconnect hardware or a hard disk with a firmware bug that corrupts data. At that point, your redundancy becomes a liability, and you could wind up propagating corrupt data throughout your system.
This all said, I'll agree that the louder and more obvious hardware faults like disk crashes are relatively easy to cope with, so in those cases, I'd likely also leave the machine in the rack and just perform the respective component swaps. The place where I predict we'll disagree is whether to evacuate the machine's workloads prior to servicing.
So, again, I'll assert that you likely have less of a hand in the engineering of these things than you're letting on. That's nothing to be ashamed of, but it does make your responses seem less genuine when you try to pass off your experience working in such an environment as expertise in designing such an environment.
I have single handedly designed entire data centers, irrespective of which impression you might get from my responses, which are admittedly terse because I'm usually on a mobile phone tap-tapping them out, and that starts to severely piss me off really quickly. Like now. Which is why I'm going to stop right here.
The environment is close to 90,000 systems. We service the hardware because it is configured redundantly, for example the / filesystem is always on a mirror. The physical systems are system engineered and configured to withstand multiple hardware faults without any loss of service.