"Custom DHCP config"?!? You do realize DHCP stands for "dynamic host configuration protocol"? That's exactly what it was designed for... And if you can't get "working PXE" (PXE is implemented in firmware, by the way), you might want to brush up on the basics of putting together an IT infrastructure, or this moment might be a good time for a career switch out of IT... There are no "Kickstart manifests", it's a ruleset. There is no "RPM registry" (this isn't Windows™️) or anything of the sort. There are no "custom spec files" as it's a formalized format with grammar and lexicography which the OS's software management subsystem understands - it's like putting a VHS tape into a video cassette recorder, modular and standardized.
Day two is where a software deployment server and clearly defined change management process according to the department of defense capability maturity model come into play.
The PXE client is implemented in firmware, but for the PXE client to have anything to talk to, there's a sizable amount of moving parts to configure just right in order for it to work well.
Those moving parts often include telling your dhcpd to include some additional payloads to offer to clients when they obtain IP leases. This is in tandem with getting tftpd running, writing your ks.cfg, and writing a suitable pxelinux.cfg so that you can boot into Anaconda to install the OS.
This is all assuming that you even have the appropriate access to do these things in our increasingly cloud-centric world. Amazon, Microsoft, and Google all won't let you use PXE to provision VMs. Notwithstanding, networks that are IPv6-only pose a problem for netboot installations. PXE doesn't work so well there unless you make the conscious decision to use DHCPv6 over stateless autoconfiguration.
--
Now, this all said, you're on something resembling the right track with respect to OS packages. There's a common pattern (at least in my experience) in the public cloud world where you use something like Packer to build a machine image for provisioning a VM in, e.g., EC2. I've used Kickstart to great effect here in conjunction with some rules in the build system for autogenerating specfiles to hand off to rpmbuild.
Combine that with a set of functional and integration tests (the apps I wrote during this period of my life were in Ruby, so the tests were in Ruby, but the technology doesn't matter; you could just as well use ksh93 or pdksh) to ensure that the image you built is what you expect, and it's actually rather killer.
The only things that yum/rpmbuild truly lacks that Nix gives us for free are (1) a package build environment, on Linux at least, that's hermetic all the way down to libc; (2) the ability to install more than one release of a particular library at a time without confusing the dynamic linker machinery; and (3) a way to herald when to rebuild certain packages based on changes to their dependencies.
Apropos deployment and running in production, the intent then is to keep it immutable (you disallow SSH as a general rule after day one, right?) and for your change management process to consider the VM as a whole deployment unit. Want to deploy updated configuration? Deploy new VMs. Want to deploy a new software release? Deploy new VMs.
SSH is not disallowed, because there are still cases where it is necessary to ssh in order to service faulty hardware. However, ssh is disallowed for making any ad hoc changes to any system, unless that is done by removing, installing, or upgrading a configuration OS package. It would be better to say that any commands which manipulate system state by hand are disallowed. In worst case scenario, every system is a throwaway one: since the configuration is 100% OS packaged, I can just re-instantiate a server, be it physical or VM. I don't have AWS or hosting trouble because I didn't fall for external hosting hype, and having everything system engineered, I run everything on-premise for peanuts. AWS or Azure just cannot compete with me in cost.
I'm left with just questions at this point. Assume production, not development or UAT or staging or integration, etc.
How do you enforce that kind of compliance on SSH sessions? How do you audit for SSH sessions that invoke commands that mutate the state of the system? How do you account for configuration drift in the event that an SSH session mutates machine state outside of those compliance and auditing mechanisms?
Why are you SSHing in to curate installed packages on machines manually instead of letting a deployment agent/service take care of that?
Furthermore, what kind of environment are you operating in where your response to a hardware fault is to SSH in instead of immediately evacuating the workloads from the faulty machine, replacing it with a new one, and taking the machine to a triage location for further investigation?
Do you operate at a small enough scale where leaving a faulty machine active on the network, installing packages by hand, and SSHing to individual machines is actually sensible?
Do you have a lot of false hardware alarms where the response is to SSH in, run a few commands, and then bail when the going looks good? What kind of monitoring practices do you employ?
SSH in and get caught messing with the system manually - get fired. Keeps everyone honest.
The environment is close to 90,000 systems. We service the hardware because it is configured redundantly, for example the / filesystem is always on a mirror. The physical systems are system engineered and configured to withstand multiple hardware faults without any loss of service.
You keep saying these things, but I'm less and less convinced that you have a significant hand in how these things are "system engineered," as you put it. I'm also concerned by how few of my questions you actually answered.
"SSH in and get caught messing with the system manually" is an extremely hand-wavey answer, especially in an environment of O(1e6) machines. I'd expect such an environment to have a rather significant degree of automated compliance and audit controls in place.
You'll also note well that I didn't say not to service machines. I'd asked why you would prefer to leave a machine with an implied potential data-destroying fault in the rack than immediately swap it out with a new machine that has been verified not to destroy data. The servicing part comes into play here in order to mark the previous-faulty machine as no longer faulty.
In particular, rack space is expensive, and certifying a machine as fit for service again can take a long time, so it's a bit of a waste of real estate to leave a machine that can't serve production traffic in place when you can quickly swap it out with one that can.
Furthermore, redundancy doesn't help when you have an HBA or an ethernet controller or a memory controller or CPU or a piece of PCIe interconnect hardware or a hard disk with a firmware bug that corrupts data. At that point, your redundancy becomes a liability, and you could wind up propagating corrupt data throughout your system.
This all said, I'll agree that the louder and more obvious hardware faults like disk crashes are relatively easy to cope with, so in those cases, I'd likely also leave the machine in the rack and just perform the respective component swaps. The place where I predict we'll disagree is whether to evacuate the machine's workloads prior to servicing.
So, again, I'll assert that you likely have less of a hand in the engineering of these things than you're letting on. That's nothing to be ashamed of, but it does make your responses seem less genuine when you try to pass off your experience working in such an environment as expertise in designing such an environment.
I have single handedly designed entire data centers, irrespective of which impression you might get from my responses, which are admittedly terse because I'm usually on a mobile phone tap-tapping them out, and that starts to severely piss me off really quickly. Like now. Which is why I'm going to stop right here.
> "Custom DHCP config"?!? You do realize DHCP stands for "dynamic host configuration protocol"? That's exactly what it was designed for...
Usually used for assigning network configuration dynamically. I'd venture a guess that most people never touch their DHCP configuration, aside from assigning DNS and IP pools.
And good luck getting your hosting provider to give you raw access to their DHCP config.
> And if you can't get "working PXE" (PXE is implemented in firmware, by the way), you might want to brush up on the basics of putting together an IT infrastructure, or this moment might be a good time for a career switch out of IT...
There is a lot more to stable PXE than having a client. Nevermind the insanity of effectively giving anyone who can broadcast IP packets root access to all new servers.
> There are no "Kickstart manifests", it's a ruleset.
Different words, same thing.
> There is no "RPM registry" (this isn't Windows™️) or anything of the sort.
Correct, guess I've done too much Docker recently. Meant RPM repository. Unless you're shipping around your RPMs manually? In which case, I refer back to the day two question marks.
> There are no "custom spec files" as it's a formalized format with grammar and lexicography which the OS's software management subsystem understands - it's like putting a VHS tape into a video cassette recorder, modular and standardized.
So.. the same as Nix definitions then? Except with Nix you don't have to worry about whether the system was set up for VHS or Betamax.
> Day two is where a software deployment server and clearly defined change management process according to the department of defense capability maturity model come into play.
So.. completely unrelated to the process you've advocated so far?
Not unrelated, in tandem. Once the systems are provisioned, the software deployment server is used to mass deploy components and bundles (OS packages) and is integrated into the change management process (for example, no deployment to production without an approved change request identifier).
And you need to write custom rpmspecs and kickstart manifests. Do those not count as code?
Kickstart also doesn't help you with any day-two ops concerns.