CoreOS: Boot on Bare Metal with PXE

WestCoastJustin · on Sept 11, 2013

For anyone who does not know what PXE (pixie) boot is, it is an environment to boot computers using a network interface independently of data storage devices (like hard disks) or installed operating systems. [1] In the BIOS, rather than booting from CD, or Hard disk, you would select Network.

A very simplified explanation is that the PXE enabled network card (almost all modern network cards support PXE, desktop and servers), can make a DHCP request (outside of an operating system), download via TFTP, then boot a kernel/initial ram disk, say for example an OS installer, a LIVE CD, or CoreOS.

    +-----> #1 PXE makes DHCP request, 
    |          redirects to TFTP server,
    |          loads kernel/initial ram disk via TFTP
    |
    |  +--+ #2 PXE boots kernel/initial ram disk 
    |  |
    +  v
  +------------------+
  | PXE Network Card |
  |------------------|
  |     server       |
  |   hardware/os    |
  +------------------+

So, with a little infrastructure (DHCP, TFTP, and elbow grease), you can boot many things over the network without even having a hard drive or cd-rom in a machine. I use this often for installing new desktops and servers. Just boot into the PXE menu and then select the installer that I want. Then you use something like puppet to configure the machine as needed. A typical RHEL install can take ~5 minutes.

In summary, it looks like CoreOS provides you with their kernel and initial ram disk [2], then you can just boot the machine over the network without actually installing anything. Basically running everything out of RAM, just like a LIVE CD, most likely with the option to install to disk for persistent storage of your images, etc.

ps. It is common to use NFS to mount persistent storage in these environments too, or if you are using a HPC environment, then Luster or something like that.

[1] http://en.wikipedia.org/wiki/Preboot_Execution_Environment

[2] http://coreos.com/docs/pxe/

jlgaddis · on Sept 11, 2013

"Back in the day" (pre-PXE), we had rooms full of machines -- without hard drives -- that booted completely across the network and mounted all of their filesystems via NFS.

https://en.wikipedia.org/wiki/Diskless_node

Over the last few years, I've wondered why server vendors don't ship servers with type of flash-based storage (or similar) -- perhaps 4-8 GB -- that's large enough to hold an installation of (for example) VMware ESXi (or another hypervisor) and its related configuration files, leaving any local storage exclusively for VMs. Alternatively, you could boot the hypervisor from this "onboard storage" and access all data across the network (i.e. NFS, iSCSI, SAN) and not have any HDDs whatsoever in the server.

lreeves · on Sept 12, 2013

Server vendors have shipped that for quite a while now; Dell for example has offered an option on servers where ESXi is pre-installed on an SD or CF card built into the system for at least five years.

telephonetemp · on Sept 11, 2013

Well, there's the solution of putting a USB flash drive inside the server's case (optionally securing it with duct tape). In fact, a USB flash drive is the recommend medium for booting FreeNAS.

derekp7 · on Sept 12, 2013

I've seen recent servers also ship with an SD or MicroSD slot on the motherboard too. Probably a bit more secure than a USB stick that can come unplugged easily.

knotty66 · on Sept 11, 2013

SmartOS is often booted from a USB flash drive too.

enigmo · on Sept 11, 2013

I've been looking at a mSATA SSD and a low profile pcie card for our Ceph cluster to free up the OS/journal drive slots for more 4tb spinning disks.

vidarh · on Sept 12, 2013

If you want an SSD on PCIe, any reason why you are not looking at SSDs built into PCIe cards? E.g. OCZ Vector or RevoDrive PCIe cards, but there's several other alternatives too.

tacticus · on Sept 12, 2013

How are you finding ceph?

enigmo · on Sept 12, 2013

We've been using it for about 15 months now without any problems with RADOS. Last winter we had some data loss with CephFS and needed to rebuild the filesystem from backup, but CephFS is unsupported so it was somewhat expected. I think the issues came from the Linux kernel client (circa the 3.2 kernel iirc), so we switched over to the FUSE client instead and have been on that since.

Good news is that performance is much, much better with 0.61 than prior releases. Both for RADOS and CephFS. We'll probably upgrade to 0.67 in the next few weeks and it's probably time to upgrade to the 3.10 kernel as well for btrfs fixes and to kick the tires on the kernel fs client again.

likpok · on Sept 12, 2013

VMware has a product to do exactly this (Auto Deploy). It PXE boots ESXi and can configure it automatically.

colechristensen · on Sept 12, 2013

Many of them do. Many newer HP blades, for example, come with an SD card inside to do exactly that with either local or network storage.

ewams · on Sept 11, 2013

It is possible to PXE boot a full blown OS, not just OS installer, Live CD or CoreOS.

The product usually used to do this is: http://support.citrix.com/proddocs/topic/technologies/pvs-pr... It is just a nice GUI that ties into PXE and TFTP and just streams an image from a disk file.

On a 1Gb network I have booted over 300 physical machines in under 2 minutes simultaneously.

For those curious, unfortunately PVS is packaged with XenDesktop so you have to buy XenDesktop to get PVS. Citrix does not seem willing to separate the two as different products even though I have deployed PVS for several customers without even download the XenDestkop bits.

devicenull · on Sept 11, 2013

DRBL can do this for Linux (if you require some software to set it up for you). You can actually accomplish this with nothing more then the linux kernel if you're clever: https://www.kernel.org/doc/Documentation/filesystems/nfs/nfs...

ewams · on Sept 12, 2013

Thanks I have not heard of DRBL. After reading it sounds pretty much like PVS, though PVS does have some additional features I don't see on it. Will have to try it out! DRBL is a free software though.

justincormack · on Sept 11, 2013

The product usually used to do this is Linux...

ewams · on Sept 12, 2013

Yea I meant "I usually" :-p

rajahafify · on Sept 12, 2013

Can bios detect wireless network or do i need to have Ethernet cable on?

nitrogen · on Sept 12, 2013

There might be some motherboards with built-in WiFi chips that support PXE with configurable access point and security settings. Maybe try a Duck Duck Go search for wifi PXE.

There's no way for the BIOS to know your WPA key automatically, of course, and in general, you'll need wired Ethernet for PXE.

wmf · on Sept 11, 2013

Since I asked for this in the original CoreOS thread, let me be the first to say thanks. I think stateless immutable servers that boot from the network and run from RAM are going to be a great base layer to build on.

BTW, if people haven't tried PXE booting before, it's pretty easy with dnsmasq. You can basically read the sample config file and uncomment a few lines. I recommend experimenting with PXE in Vagrant or on a separate physical network to avoid breaking your production DHCP.

286c8cb04bda · on Sept 11, 2013

> BTW, if people haven't tried PXE booting before, it's pretty easy with dnsmasq. You can basically read the sample config file and uncomment a few lines.

You don't even need a config file. Here's a command line snippet I have saved for testing PXE installs:

    $ sudo dnsmasq -hdq -i en0 -p0 --enable-tftp --tftp-root=`pwd` -Mpxelinux.0 -F10.10.10.100,10.10.10.199

samstave · on Sept 11, 2013

So this snippet makes the machine you run it from into the PXE Server?

wmf · on Sept 12, 2013

jlgaddis · on Sept 11, 2013

> I think stateless immutable servers that boot from the network and run from RAM are going to be a great base layer to build on.

Heh, we were doing this 20-25 years ago. It's funny sometimes how I.T. seems to circle back around.

nl · on Sept 12, 2013

>> stateless immutable servers [emphasis added]

>Heh, we were doing this 20-25 years ago.

It's possible, but I'd be surprised (unless you mean X servers in the crazy X-windows server-is-really-what-any-sane-person-would-call-a-client sense).

Doing this kind of thing is pretty new for servers. Even massive HPC clusters tend to boot from persistent disk.

lsc · on Sept 12, 2013

maybe not 25 years ago... but it was common in the early '00s. This is not new tech. I participated in a project to do this with FreeBSD and nfs, I think around '03 or so. It was a smallish cluster (under 100 nodes) built to handle a simple Freebsd Apache Mysql PHP stack for a growing realestate ASP. Even then, it was not a new thing; there was plenty of documentation from other folks doing it.

I mean, it didn't go as smoothly as pxe booting does today; we ended up needing to reflash all our network cards to make it work, something I have not had to do in the last 5 years.

derekp7 · on Sept 12, 2013

The NCD X terminals I used to work on did this (they actually ran a BSD Unix kernel under the hood, I believe). Also was popular for some Sun diskless workstations -- although the protocol was bootp instead of pxe.

bodyfour · on Sept 12, 2013

> although the protocol was bootp instead of pxe

Well, they used bootp followed by tftp, and so did most other UNIX workstations of that time period. DHCP is effectively a superset of bootp -- it uses the same UDP port. PXE is DHCP+tftp along with some specification of tftp paths, where the image is loaded in RAM, etc.

Really the overall netboot mechanism is nearly identical as it was 25 years ago, just some protocol tweaks and a new acronym.

primitur · on Sept 12, 2013

>Doing this kind of thing is pretty new for servers. Even massive HPC clusters tend to boot from persistent disk.

Not really, no. Servers all over the IT world have been using PXE and tftpboot for decades - in fact it was a very common method of booting a server even in the 80's. I know of quite a few Rail/Transportation companies whose track-side computers are booted over the network using these techniques, and there are about 250,000 of those things just in Europe alone ..

What's new is that another generation are discovering this technique.

noelherrick · on Sept 12, 2013

What are the upsides/downsides of this? It's such a paradigm shift, I'm having a hard time thinking of any advantages.

wmf · on Sept 12, 2013

It's based on the concept of immutable servers http://martinfowler.com/bliki/ImmutableServer.html which helps eliminate errors due to drift. Containers also have lower overhead than VMs.

songgao · on Sept 11, 2013

Awesome! My favorite part is this: http://coreos.com/docs/pxe/#state-only-installation

which would enable you to run CoreOS in memory only (loaded from PXE), but still store all of your containers on filesystem. It liberates me from having to install OS on a cluster, but still lets me use persistent storage.

voltagex_ · on Sept 11, 2013

PXE support on some consumer boards is a mess - often I have to use iPXE [1] just to get them loading from TFTP reliably. Now I've played with UEFI PXE boot, and it seems to be even worse - instead of requesting an "x64 bootloader", the NIC seems to request a "UEFI bytecode bootloader" which I haven't been able to supply.

[1]: http://ipxe.org/

sturadnidge · on Sept 11, 2013

I'm really glad to see CoreOS taking this path, forged by the likes of VMware's ESXi and Joyent's SmartOS. It truly is the only way to run scalable infrastructure.

lsc · on Sept 12, 2013

>I'm really glad to see CoreOS taking this path, forged by the likes of VMware's ESXi and Joyent's SmartOS. It truly is the only way to run scalable infrastructure.

I find comments like these amusing. Sysadmins have been using PXE to boot servers... for quite some time now. Long before joyent existed, and long before VMware was something you'd seriously run a server under. (vmware came about around the time of the 2.1 version of the PXE standard, which was when I was first getting my feet wet; I didn't seriously start using pxe until the early oughts.)

Hell, /I/ built a nfs/pxe diskless cluster before joyent existed. As part of that, I demo'd a 'initrd only' system like this coreos thing, only we were using FreeBSD. We ended up going with / on nfs; it was way easier to update.

That said, I'm not knocking CoreOS; I might even use this. Maintaining your own bootable initrd with root filesystem is work. I currently use distro 'rescue images' (for centos, at least, you append 'rescue' to the installer, and it downloads a small initrd /... but it's less than optimal.)

I mean, I'm not shitting on joyent, either; I think most of the value they bring is managing this shit ongoing, which is not a trivial amount of work. I mean, the whole idea behind companies like Joyent is to make is so you don't need a me screwing with your dhcp server, and there's value in that. I'm grouchy and charge a lot of money.

hosay123 · on Sept 11, 2013

Good luck finding a network infrastructure and PXE server able to boot a few hundred machines simultaneously during a power event. Yes in theory PXE boot sounds great. In reality it's a pointless SPOF.

Also most likely makes it harder to reuse most of the trusted boot infrastructure that already exists for Linux. So we can assume at least in the initial release Mallory can race with the real PXE server assuming a network that hasn't been partitioned with a crazy complex config (i.e. basically all of them).

fintler · on Sept 11, 2013

We commonly reboot entire clusters at once (around 10,000 servers in larger clusters -- each running a full Linux OS) over PXE without a problem. We have a configuration management machine that creates an image, then we push that down to a small cluster of TFTP servers that serve it out. The strain on NFS (we keep parts of the OS in RAM, and load other parts on demand over NFS) after we kexec from the PXE kernel into the production kernel causes more problems than the initial TFTP traffic (but it usually works fine as well). Btw, after booting, we use PanFS (DirectFlow) or Lustre for computing stuff, not NFS.

Although it's not what we use, here's a program that does a similar type of management: http://warewulf.lbl.gov/trac If you take the time to combine Warewulf with something like Puppet or Chef, you'll have a nice system for managing 100s of thousands of machines (I could easily see this scaling to over a million servers if you have the cash to build something like that).

If you're wondering about dynamic libraries in an environment like this, take a look at https://github.com/hpc/Spindle

And yes, I still get giddy when I type one command to reboot 10,000 servers.

ajdecon · on Sept 12, 2013

Since you mention kexec and TFTP+NFS, are you currently using Perceus? Or is there another system out there with that combo?

fintler · on Sept 15, 2013

We're using a modified Perceus tied into cfengine.

devicenull · on Sept 11, 2013

So, a 'PXE server' is nothing more then a DHCP server and some method of fetching files (historically, TFTP, but this is not a requirement).

ISC DHCPD is pretty bulletproof. Many large carriers/ISPs use it, so scaling this end should be pretty straightforward. You'd probably want to statically configure leases for all your servers though.

Pretty much all PXE roms built into NICs only support TFTP. I would use TFTP to load a iPXE, then use iPXE to load the operating system over something else (probably HTTP). Scaling HTTP is well understood (this is about the easiest HTTP scaling you can do, downloading a few static files).

Nothing about this is hard to scale, nor does it require any special network setup.

zdw · on Sept 12, 2013

iPXE is awesome.

My standard install is to embed a script into iPXE that chainloads a config file over TFTP (this works properly with non-ISC DHCPD servers, like OpenBSD's fork and dnsmasq), then loads various OS's and diagnostic tools over HTTP or NFS.

So much easier than slinging around optical media or hard disks for installation/diagnostic/booting diskless clients.

bretpiatt · on Sept 11, 2013

I'm unsure why PXE cannot be a reliable network service just as DHCP or DNS. It can be clustered and made reliable as any other system can. You can also load balance and distribute out the PXE traffic in the event of a full refresh of all nodes just fine.

stephengillie · on Sept 11, 2013

I don't get this either. PXE is a few special DNS settings & a TFTP server & a very small boot file. The concept has been around since before 2001, yet it's treated like some kind of weird voodoo.

Using PXE in your environment means you have to feck around with DNSmasq, or manually configure DNS, or use enterprise-level bloatware.

baruch · on Sept 12, 2013

TFTP is a UDP based file transfer protocol that uses a stop-and-wait and a simple retry loop, if you use only that you risk hitting congestion very fast and having a hard time to fix it.

The suggestions above to use iPXE and switch to TCP transfer methods such as HTTP make a lot of sense in that aspect.

ook · on Sept 11, 2013

Use iPxe (http://ipxe.org/) to boot from a http image server.

stephengillie · on Sept 11, 2013

Only if your datacenter or server room is underpowered. A properly specced location can handle all servers at full load.

jlgaddis · on Sept 11, 2013

> In reality it's a pointless SPOF.

Like pretty much anything else, it's not a problem when implemented correctly.

Even if you have hundreds or thousands of machines, it's not that difficult to configure your network switches (assuming you don't buy your network gear at Office Depot) to distribute the DHCP and TFTP across multiple servers -- if you even need to, that is; DHCP and TFTP are two of the most lightweight networking protocols in use.

pkj · on Sept 12, 2013

If you are ok with CentOS/RHEL, there is a scalable way to boot multiple machines simulataneously using ........ bittorrent !

http://www.rocksclusters.org/rocks-doc/papers/two-pager/pape...

Rocks is used at production cluster installations with thousands of servers.

http://www.rocksclusters.org/rocks-register/index.php?sortby...

slynux · on Sept 12, 2013

I had written a Wifi based LTSP for Linux sometime ago. http://www.sarathlakshman.com/2010/03/14/wireless-ltsp/

jimmcslim · on Sept 12, 2013

Running CoreOS on Intel NUC's via PXE boot from a HP Microserver; a poor man's blade chassis/datacenter?

vezzy-fnord · on Sept 11, 2013

So is this basically LFS with built-in PXE boot support and orientation towards servers?

wiradikusuma · on Sept 12, 2013

Is this some 'lightweight' OS that you can use for Raspberry Pi?

volokoumphetico · on Sept 11, 2013

from the title I thought somebody had discovered a way to run an operating system on just a thin sheet of metal

johnpmayer · on Sept 11, 2013

Bare metal is a common term used to describe running directly on the hardware. Such as a non-virtualized operating system, or a program running without an operating system.

Or yeah, we could just downmod everybody.

volokoumphetico · on Sept 11, 2013

wow tough crowd.