HPC admin here (and possibly managing a similar system topology with their room)...

HPC admin here (and possibly managing a similar system topology with their room).

In heterogeneous system rooms, you can't stuff everything into a virtualization cluster with a shared storage and migrate things on the fly, thinking that every server (in hardware) is a cattle and you can just herd your VMs from host to host.

A SLURM cluster is easy. Shutdown all the nodes, controller will say "welp, no servers to run the workloads, will wait until servers come back", but storage systems are not that easy (ordering, controller dependencies, volume dependencies, service dependencies, etc.).

Also there are servers which can't be virtualized because they're hardware dependent, latency dependent, or just filling the server they are in, resource wise.

We also have some pet servers, and some cattle. We "pfft" to some servers and scramble for others due to various reasons. We know what server runs which service by the hostname, and never install pet servers without team's knowledge. So if something important goes down everyone at least can attend the OS or the hardware it's running on.

Even in a cloud environment, you can't move a VSwitch VM as you want, because you can't have the root of a fat SDN tree on every node. Even the most flexible infrastructure has firm parts to support that flexibility. It's impossible otherwise.

Lastly, not knowing which servers are important is a big no-no. We had "glycol everywhere" incidents and serious heatwaves, and all we say is, "we can't cool room down, scale down". Everybody shuts the servers they know they can, even if somebody from the team is on vacation.

Being a sysadmin is a team game.