If I'm running on the cloud why do I care how PXE boot works? How is that helpin...

xemdetia · on April 18, 2023

I think you missed the point where they were saying most infra dev might not know how to do on prem and cloud so it becomes less of a choice to use cloud if cloud skills are the only skills in your org.

crabbone · on April 18, 2023

That's a very good question. Well, maybe PXE specifically isn't such a great example, but still, this is a good question because the answer isn't obvious.

First, cloud vendors don't really invent the infrastructure from zero. They typically adopt existing solutions to their needs. And they aren't infallible. Every now and then there'll be a bug or an edge case they didn't handle. Alternatively, there will be weird restrictions that you, the customer, would have to go fish for in substantial volume of quite opaque documentation in order to protect yourself from encountering those unfortunate edge cases.

I don't know if AWS internally uses PXE boot. But, suppose they did: this would mean that there'd have to be some specific details of the network configuration that enabled the use of PXE. Also, there could be some pathological cases of interaction between "special" (boot, bootloader) partitions and PXE boot. You wouldn't normally care in VM-based deployment about boot partition as some VM players can even offer to boot your kernel w/o you having a bootloader in the image, so, you may go for a long time w/o ever discovering what your bootloader is configured to do and how it may interact with PXE boot.

Suppose at some point AWS upgrades something in their PXE server / code it runs on clients... and it breaks your stuff. Because this would affect most likely all of your VMs, the effect may be devastating...

But... you didn't even know PXE boot existed. You had no idea that you had to look for some subsection X.Y.Z in the appendix to the user manual that warned you against using a particular configuration in your bootloader, which stayed dormant for a long time.

It's not hard to imagine how this might go financially...

Second, any solution purports to sell to many customers -- the more the better, that's how the solution becomes profitable. Naturally, the solution provider wants to find as homogeneous group of customers as possible, so that with as little effort possible they'd be able to cover most customers. But the reality tends to work against this desire by creating customers who aren't like each other. So, some generalization must happen, and it means that very specific, very individual desires of customers will likely not be covered by the service provider. Now, suppose you, the client, know how the underlying technology works, and thus are able to defeat the unwillingness of the provider to accommodate your very specific needs... or, you don't. Typically, this ends up being not a deal-breaker, but an inconvenience. Often times it's the money you'll pay for the service you don't need, or the resource consumption that could've been avoided.

I cannot think about a good example with PXE, but here's something I crossed paths with recently at my kid's birthday party: I met a guy who works at Gitlab, and that reminded me about a grudge I had against Gitlab for some time. So, Gitlab is an example of service that replaces in-house IT / Ops who'd manage company's repository in this case.

Gitlab, at least initially, gained popularity due to CI bundled into service. This CI knows how to run jobs. These jobs produce logs. The logs can be obtained through API, if necessary. But... if you do this at the time the job hasn't finished, there's no API call to enable paging of logs. So, if you want to display logs as they are being generated, you poll, and each subsequent response will get you 0..N characters, with N ever increasing. This is hugely wasteful, and, in part the reason why their logging configuration puts a hard limit on log size... Well, bottom line, their logging API is bad. But, you won't abandon the service because the logging API is bad. You'll suck it up, especially since they give you a bunch of free stuff... You'll just pay a bit extra, if you don't need the free stuff.

grogenaut · on April 19, 2023

With the PXE boot issue... amazon has millions of servers and they roll changes out to hardware on a 5 year cycle or versions/skus of the compute. So anything hardware wise will be detectable when your software doesn't work on the new instance type/sku/class. I've had this happen, multiple times over 10 years, and always just canary out new instance types, in fact I setup our systems to make this simple. But you can easily try it out and hand the hardware back if it doesn't work. We reported the issues every time. I believe AWS scrapped one sku due to an issue we reported. They also pulled back skus from us because another customer realized they were broken in a away that would impact everyone eventually.

When it comes to software amazon again has millions of machines, regions, availability zones. AWS is very very slow and deliberate about how they roll out software. They do things by Sub AZ (shard, cell), AZ, Region... they can and do split on other axes as well. Rolling out a PXE boot change would be caught by increase in failed boot or under utilized instances (they didn't boot right) pretty quickly in a roll out and rolled back.

All of this is aided by many may of AWS's customers treating servers like cows not pets, so whacking and restarting a small % of hosts isn't super detrimental and likely won't even be noticed. AWS can be extra cruel to internal teams actually. We have have a few concepts internally that push users to be even more ephemeral than normal.

As for how do people learn PXE boot? Experience or on the job training. I have no idea if AWS uses PXE boot anymore, I suspect they have a specialized setup, boot vms directly, etc. There's hardware security with the hypervisors you don't get on normal machines. I PXE booted a linux desktop the other day in the office to install linux so the corp side does use it.

I'm not really sure what the point you are getting at is tho. The cloud providers are doing things on a scale most people are not used to thinking of and the problems are often quite different as are the solutions. Are you rolling your own custom hardware with FoxCon (https://www.importgenius.com/suppliers/foxconn-aws)? Do you roll your own custom silicon (https://www.amazon.jobs/en/landing_pages/annapurna%20labs)? Would you be solving PXE boot issues if you had these capabilities?