Unikernels are cool, but half this official post is pointlessly throwing rocks at postgres, the other half having to hack around to get this extremely common piece of software working with your own product. No thanks, I will get my unikernel fix from someone else.
Plus, if you often have to LD_PRELOAD getuid, just add a bogus impl to your libc or as a syscall to reduce onboarding friction.
If you think I was throwing rocks that most definitely was not the intention. The whole point of writing the post was to point out all the abstractions that don't quite gel well in unikernel land.
The LD_PRELOAD wasn't used for this particular port but it does get used sometimes . We don't actually have real uid calls but some applications will require specific ones that we stub from time to time. Very very few apps actually require this.
I wonder why unikernels haven't caught on. Years ago when they started to be hyped, I imagined the technology would surpass the adoption of containers. In principle, they have the security and isolation of VMs which containers lack, but also have even less overhead than containers, and can run on bare metal. It sounds like the best of both worlds.
The main issues years ago were developer tooling, packaging, and difficulty with debugging, so the UX was far from what Docker did for containers. But I was hoping that these issues would be addressed (are they?), and that a defacto standard would emerge that would propel them into the mainstream. Instead, we have these disparate niche tools and services that seem to be competing, but there's no widespread industry adoption.
I'm sure that the evolution of the container ecosystem made unikernels less desirable, and the increased popularity of container orchestration engines like Kubernetes cemented this trend. But I wish that the unikernel ecosystem were more mature so that it's at least a topic of conversation in tech circles, and k8s wasn't such an obvious choice. Though I might be in the wrong circles and have outdated information, so I'm curious to know what HN thinks.
Because it's hard to find programmers who can do it.
Overwhelming majority of programming today is happening in userspace. And even there, it's typical to be far removed from system interfaces. Popular language runtimes s.a. Python or Java or JavaScript create a layer of obfuscation on top of system interfaces.
Programmers who could realistically pull this out are either in embedded field or in OS / system field. OS / system is very tiny compared to everything else, but even with embedded being larger, the problem is also the lack of expertise necessary to tie the knowledge necessary to deal with unikernels with knowledge necessary to develop enterprise or user-facing applications.
Another aspect is the lack of support from large cloud vendors. In principle, you could create RMIs (AWS EC2 VM images) from scratch, but it will be hard to integrate them with the rest of the cloud infrastructure. Popular tools that create RMIs aren't designed to deal with unikernel applications. They often assume that UNIX shell is available or that cloud-init is available etc. It's even worse with VHD. I don't know what the state of events is in GCP, but wouldn't expect that to be any different.
Yet another aspect is the lack of instrumentation for debugging. A lot of common instrumentation assumes or relies on stuff like UNIX shell or presence of standard UNIX utilities. This is even more noticeable in automation / CI side.
Containers are a lot easier because they rely on existing, however bloated and broken, infrastructure. They aren't better, and the way forward for containers promises no increase in quality... but it's the easy way "forward" (in temporal sense), and that's why it's being taken.
Definitely agree with the top part, however, I should note that, ops, the tool's, whole existence is to create disk images and upload them to any cloud, any hypervisor.
In particular, both https://ops.city && https://nanos.org are Go unikernels running on GCP and their deploys take just a few seconds to push out. AWS can be even faster cause we skip the s3 upload part. We also have lots of people using Azure which would be utilizing vhdx.
I feel like you’re mixing layers. Alternative to VM is Linux namespaces. They provide isolation. Docker, Kubernetes, LXC, systemd-spawn and whatnot are just thin wrappers around Linux API.
There’s nothing that would prevent you from running workloads in VMs and managing them with Kubernetes. Just write proper adapters. Actually there are projects in the wild which do exactly that.
As to why VM is supposed to be lighter than container, I don’t really understand. I think it’s the other way around. Industry does not really care about security and losing performance for unclear benefits won’t fly for masses.
They're really not. Containers (and by that I mean their underlying technology) are not alternatives to VMs, in the same way that unikernels are not alternatives to containers (yet, anyway). They're technologies that can be used for similar purposes, but in practice, they all have their tradeoffs that should be considered before use.
For one, containers share the same kernel of the host machine. The isolation provided by VMs is on a deeper level, which also makes them more secure. Unikernels are more like VMs in this sense.
My point is that the only drawback of unikernels in theory is the immature tooling and support around it. So I'm wondering if there is another reason that they haven't been as widely adopted as containers.
> As to why VM is supposed to be lighter than container, I don’t really understand.
You misunderstood. I said that unikernels, not VMs, have less overhead than containers. Containers require a minimal, but fully functioning, OS to work, while unikernels only require a cut-down version of a kernel. Whether unikernels are more lightweight is debatable, since containers are merely another process running on the same host OS with improved isolation, while unikernels depend on a hypervisor to work, but if we define "lightweight" to include boot time and performance, unikernels have certainly surpassed containers in this sense as well[1]. So the main benefit containers had over VMs is also an attribute of unikernels.
I think you've touched on it in regards to ecosystem maturity. When ever I traipse through this topic I do find mostly research projects. I know there are startups in this space, but I think it'd be hard to show a $$ value proposition for them, so I can understand why most things look quite research focused instead.
I think it's likely that instead of a generic unikernel ecosystem and companies built around that, we could instead see very specifically targeted unikernel applications as SaaS offerings. And as consumers of such a service we might not even know (or care) they are unikernel hosted at all, just that they're faster. E.g. my comment below about DB unikernels.
If, say, e.g. Snowflake were to run bare on the hypervisor for extra oomph... it wouldn't impact their customers beyond some rather transparent performance improvements. And such a thing would be done only if there was profit in it that would exceed the R&D costs -- put another way, if you're a company considering this, you could either buy more and bigger hardware, or invest in unikernel... former is far less risk.
Which is why I think where we'd see action in this space would be from very specific performance oriented spaces. E.g. super low latency transaction processing for e.g. telecomes, finance / high-frequency-trading, ad-tech RTB, games ... rather than say large scale analytic workloads.
Unikernels have caught on, in the form of processes. You have to run your unikernel in a hypervisor anyway, so why not give the hypervisor a bunch of features and then it's an operating system with a weird hardware-emulating interface. De-weird the interface, and then you have a normal operating system.
The big difference is the hypervisor in this case is the public cloud which is not just multi-tenant but massively multi-tenant. That is the host and vmware and aws and everyone else did a really excellent job of abstracting the host away to the extent that most developers don't even know what a server looks like anymore.
Unikernels take the same approach but abstract the guest away. The abstraction is to treat each vm as an individual application (since we're treating the cloud as the operating system) versus trying to micro-manage hundreds or thousands of virtual operating systems inside the hypervisor.
I've been messing around with a hobby OS that's adjacent to unikernels for a while now. Once I'm done enough, I plan to benchmark running my application+OS vs running my application as PID 1 in a mainstream OS. I imagine, if I'm lucky, performance will be similar, although chances are, the mainstream OS will be significantly better, and development time for application as PID 1 will be less than a month vs I've been tooling around for 3 years at this point (only really focused for a few months, though)
Most applications lack the introspection to be really operable in a unikernel situation too, so then you're looking at unikernel with extras for administration vs minimal mainstream OS with a small set of administrative tools, and that's an easy win for the mainstream OS.
Unikernels are a way to do the same thing you do with your "process abstraction", but better: using less resources, minimizing the number of failure points, improving speed of deployment, minimizing dependencies which also allows for faster iteration when developing software.
The reason they didn't catch on is because overwhelming majority of programmers aren't inventors and aren't knowledgeable enough about the system aspect of their environment -- they need tools, frameworks, bundles of documentation and support forums to get anywhere.
Now, all of the aids listed above can either be created by a dedicated group of enthusiasts or by large corporations investing into such a process. Large corporations have little interest in software qualities mentioned above because the "good enough" containers are capable of solving their business problems. So... we are left with enthusiasts, who are few and whose work isn't being integrated into "mainstream" development process.
Will these enthusiasts eventually gain enough momentum to make it really attractive for the whole industry to switch, or with it be another IPv4 vs IPv6 -- time will show.
In what way does a hypervisor and unikernels use less resources, reduce the number of failure points, improve deployment speed or reduce dependencies compared to an operating system (no hypervisor) and processes?
If you are comparing a bare metal system, indeed you might not see a lot of benefits.
However, there is a basic assumption that these workloads are running in the cloud which means they are already virtualized and then when you compare against a ubuntu/debian/etc. vm they get their benefits.
If there is no hypervisor involved we would consider that bare metal regardless of how automated the process of provisioning the system is (eg: packet.net, hetzner).
However, if we use your example then one is still managing the install process of the base system, updating it, patching it, securing it, managing application deployment on top, networking, etc.)
At the end of the day though that is still a layer below.
and unikernels also have a layer below because they are always run on hypervisors. Nobody is talking about true bare metal here, and you don't want to write your applications on true bare metal most of the time anyway. That's reserved for the situations where you really need to squeeze 110% out of your hardware, which will never be upgraded. A lot of older console games were written on true bare metal for this reason - I remember that on the Nintendo DS you could set up a DMA controller to copy a command buffer from main memory into the graphics processor's command FIFO, and another one to transfer sound samples, and there was a whole set of registers to select which VRAM bank was allocated to which purpose and whether the main graphics processor would be allocated to the top or bottom screen with the other one getting the secondary graphics processor - but web app servers are not in this situation.
Let's make it more concrete and talk about Linux, instead of talking about hypervisor and OS in abstract.
So, Linux comes with a lot of legacy. Just take the whole of bootloader, then the whole multi-stage boot with ramfs... you don't need any of that in a unikernel designed to run on a known hypervisor. QEMU can boot Linux VM skipping the bootloader process, but not skipping the ramfs part. But, really, you don't need any of that. Even worse, having to debug problems that happen before pivot is a huge pain.
Second, Linux keeps adding more modules directly to the kernel (rather than making them dynamically loaded). Not so long ago the kernel adopted RAID modules, for example. Similarly, iirc, bond, vLAN, bridge interface modules are part of some / all (?) kernels today. In other words, the kernel is growing bigger and most of the things they add there are not relevant to you. It becomes more Microsoft-World-like, where no feature works really well, but the sum total of all features beats every individual better-quality editor especially if you don't really know what are you going to use it for.
Beside modules, Linux adds just more configuration to its internal functionality: did you know you can configure I/O scheduler to work in different modes? Do you even know what they are? I'm pretty sure the answer is "no" and "no", and, in most likelihood for your application that would make no difference. Do you need multiple memory overcommit modes in your application? -- of course not. Do you need interfaces into userspace like udev, procfs, sysfs etc. in your application? -- of course not, but these are an integral part of Linux.
These and many other things not mentioned here are all potential points of failure. You get a disk with an incompletely deleted RAID, and while you may not even know that you had RAID drivers, depending on bootloader configuration you've never looked at, suddently, during ramfs stage RAID module kicks in and starts a RAID rebuild... now imagine the fun of dealing with that.
---
Resources: well, the drivers Linux loads take space... both on disk and in your memory. All the work Linux does to keep its pieces running -- most of it is completely irrelevant to you. If you ever have a glimpse of your Linux booting, you could probably see it mentioning something about rfkill... well, no matter if you didn't. I promise you, it's there. So, your Linux initialized some mechanism for... drummroll... Bluetooth? -- How cool is that? Right? Your server sitting in the datacenter still tries to use Bluetooth for who knows what reason. Until not so long ago Lunux had floppy drivers built into kernel.
This stuff adds up. It takes memory. It makes it necessary for VMs to be created with some kind of input device. Did you know that your input device on QEMU is a... tablet? For some god-knows-why hacky reason it's a tablet. So, you also need tablet drivers. And, because you don't know what keyboard is going to be connected, you also need a bunch of keyboard layouts and so on.
Another thing... Linux is in constant flux. It's always between removing something old and adding something new. You always have at least two ways of doing something, often times more. This uses more space, more time to boot. Eg. /etc/fstab is, generally, obsolete, but its support isn't going anywhere. So, you can specify your devices as systemd modules, or you can rely on systemd to parse /etc/fstab and create those modules for you. And, no, you cannot prevent systemd from trying to read /etc/fstab and you cannot delete the code responsible for reading and transforming it.
---
Development speed: Linux is a big project, made up of many smaller ones. There are a lot of internal dependencies. Suppose Linux added LUKS2 driver (that's disk encryption), but the dmcrypt project is falling behind the schedule in providing interface to this driver (this was actually the case in SLES). Well, all the cool kids now can use LUKS2, but you got a package deal, and you are stuck with LUKS1 because the package you probably don't even care about failed to make it on time to a release deadline.
In practice, today, if you use one of the most popular Linux distributions, you are many versions behind the latest stable kernel release, but you are also many versions behind the latest stable everything. This is because integration takes time, and the more there is to integrate the longer it takes. Just as an example: latest Ubuntu LTS uses kernel 5.15, but latest stable kernel is 6.3.8 as of the time of this writing. That's years of development.
But you're not going to implement all of this in your unikernel, are you? If you want to run your unikernel on RAID you're going to need a hypervisor with a RAID driver built in or loaded at runtime. If you want it encrypted you're going to ask the hypervisor to encrypt it. And how do you want to schedule I/O between different unikernels? And do you think the interfaces will remain stable for all time, or will there be transitional periods where two interfaces are supported at the same time? You see, all of this stuff is still needed.
This attempt to escape the fundamental requirements of operating systems by calling them something else reminds me of... well, nothing in particular, but a lot of projects started by wide-eyed visionaries. They say "we're going to solve the problems of X by making Y which isn't X" and then end up re-creating a really shitty version of X by ignoring all the lessons learned from making X. Consider NoSQL, blockchain, the inner platform effect, or Joel Spolsky's essay on rewrites.
I think in correlation to your other comment is that there is the notion that all of this is done for you via the cloud thus you don't really need/want to do it yourself.
If you do wish to deal with this yourself yes you need something but that is not the goal here. The goal is to deploy something under the assumption that a provider (eg: the cloud) is already doing this for you.
UX is definitely an issue. One interesting thing coming up is confidential computing (where basically the RAM and registers of the workload are encrypted so they hypervisor cannot see them). The hardware implementations are all based around VMs, so they require you to run a full kernel, even if what you're doing is basically a confidential container workload. We think there might be a niche for unikernels here.
I didn't say "from the platform", I very specifically said they're encrypted so the hypervisor or other VMs cannot see them. Believe it or not, that's how Intel TDX, AMD SEV and other systems do work, and these are real world implementations you can buy right now.
Indeed, it's defense in depth. The hypervisor shouldn't be compromised, and other VMs shouldn't be able to read your VM's data, but if they are or can for some reason, it's encrypted.
So my opinion, having worked in the space for years and the author of this post, is that the vast majority of people (engineers included) underestimate the sheer amount of work building a kernel that can run lots of applications on various clouds takes. Linux has been in development for ~30 years, Unix for 50.
That is to say it isn't necessarily lack of use-cases - it is ensuring you have enough support built for said use-cases when some user wants to use it. I'll give an example. Years ago people would say something like "I see you have GCP support - that's awesome - what about Azure?" Back then I'd say something like, well, give us a month cause we have to write new disk/network drivers but yes we can make that happen - a month is too long for that person though and so you immediately lose that person as a user/potential customer and have to wait until they come back. (We have great azure support now fyi.) Now take that same example and add it to every language, every cloud, etc. It's a ton of work.
Thanks for your perspective, and kudos on the article, impressive work!
So it is a matter of lack of support and general tooling around it. Anyone interested in the technology needs to write support for each individual tech stack and infrastructure to run it. The amount of complex work and the scarcity of people knowledgeable enough to do it is a huge hurdle.
Whereas Docker came around at a time when the low-level building blocks were already in place in Linux. Containers were already widely used in the industry, but each company had their own tooling around them. Docker made this technology much easier to use, which in turn helped define the standards that we use today.
So I'm still hopeful that something similar will happen eventually with unikernels. There's been a lot of work in improving the tooling around it, as your company has done, adding support for various tech stacks, optimizing hypervisors to run them efficiently, etc., so hopefully in a few years a tool that greatly improves the UX will be built, and an industry standard will emerge.
I think open source is a crucial part for this to happen. BTW, Nanos looks very cool!
To me it's more sensible to consider writing a database from scratch for the unikernel. Or at least the storage engine portion of it.
Plenty of recent DB research about running up against the wall of what the Linux VM subsystem can provide for them in terms of memory management. LeanStore[1], Umbra [2], and research [3] since then show that to crank the most out of buffer pool mgmt, we're getting closer and closer to the TLB itself. Fiddling with mmap & vm overcommit, pointer swizzling (or not), userfaultfd & custom page fault handling, even custom kernel drivers, etc.
To really crank performance on in-memory and hybrid in-memory/disk systems -- why even bother with Linux then? Let's run direct on hypervisor! On the whole, DBs already manage their own persistent storage, so don't strictly need a filesystem (esp in the New Cloud World where pages often go up into S3 buckets etc); they manage their own memory; and often user accounts, etc too, and often managing their own concurrency. They're really an OS within an OS in many respects.
Virtualization already handles abstracting things enough that drivers for network and disk etc aren't as big of benefit from the OS side. Security and monitoring can be handled at the per-VM level. We're no longer held back by the requirement to have to have a pile of drivers for different hardware configurations. At least in broad strokes.
But I wouldn't start with Postgres as a base, that's for sure. If you're building enough of libc and a POSIX/Unix ABI that you can run stock programs, there's likely little benefit at all.
I doubt you'd get much win for analytical workloads, but for very high throughput transactional workloads ... what fun!
Doesn't Oracle (and I think reaches waaaaay back in the memory banks DB2) do something similar in the sense that it is packaged up with its own VM/storage subsystems that bypasses the native OSes facilities and is hyper-optimized for the DB use case?
All serious databases are doing their own low-level buffer pool management for relations&tuples, which bypass e.g. malloc, for a bunch of reasons. There are various techniques for this. The papers I link to in my comment refer to some recent ones.
My point is that in the end what a DB needs to do at this level is translate tuple or btree etc. node references into physical memory references; either by a fault that pulls them from disk, by a fault that maps a virtual address into the kernel as physical memory, or by a live reference, etc. And it feels to me that a lot of the logic there mirrors what already happens in the OS's VM subsystem. (For various reason's the "automagic" facilities that do things like this in the OS -- file-backed mmap'd -- are a terrible way to implement a DB, BTW: https://db.cs.cmu.edu/mmap-cidr2022/)
That and of course at the persistent storage layer the data structures are tuned around physical device performance characteristics. Back in the days to deal with the mechanics of head & cylinders etc; these days, the interface and aspects of flash storage, write amplification, etc. Hence log structured merge trees, b-epsilon-trees, etc. etc. The persistent storage layer inside the DB looks very much like a filesystem. I mean, it kind of is one, but with a different concept of namespace (relations&tuples&versions etc instead of files and directories)
So, I dunno, I'd maybe rather drink straight from the spring rather than get a bottle of bubbly water at the restaurant. Or something something weird analogy.
This is my recollection as well. I was at a big Solaris shop when I cared enough about DBs to dig in this deep, but there were many, many discussions with the ops folks about whether this bypass was a good thing, and in the end the performance metrics were judged 'better', for values of better I don't accurately recall.
The database itself assumes the existence of a filesystem. The whole toolset from Oracle, also includes a filesystem that you can install and use with the database, but the database itself cannot just work with a bare block device.
I have never used it but I've heard people talk about that as an essentially failed path. Admins apparently hated it, and the performance likely wasn't all that.
However, we are in a different era now. Virtualization and cloud infrastructure changes this whole situation quite a bit.
> On the whole, DBs already manage their own persistent storage
Nope. No, they don't. They use filesystem. I think only MSSQL can be configured to use bare block device, but this configuration is rarely used in practice.
Oracle, for example, implements its own filesystem, and it would use that, if you configure it to do so, but the database itself expect there to be files, directories etc.
You're missing the point, or misreading me. Yes DBs are using files in the filesystem, yes on the whole, but inside those files are datastructures (btrees, etc.) that are really self-managed storage.
Your filesystem uses tree data structures to map files and dirs to physical locations in block storage, and caches chunks of them, etc. A DB's storage layer does the same for relations, tuples, and indexes; including explicit optimization for various page sizes, perf characteristics of the underlying block device, etc.
Yes it's doing that, in turn, inside files, but that's quite different from how a "regular application" uses files. It's using quite little of the FS's value-add beyond it being a way to share tenancy with other things on the machine.
If e.g. DB used a file for every tuple ("row"), that would suck. If it relied completely on the OS's default sync and recovery facilities, that would also not be ideal.
Sorry, no, you are missing the point. Most popular databases cannot exist w/o filesystems. filesystem isn't just files and directories, it's the guarantees you get on reads / writes / copy / delete / create, the cache behavior, the ownership, snapshots, deduplication, redundancy, compression, checksums, encryption... (obviously, not all filesystems do this, and some databases already can take on some of the functions of filesystem).
Databases don't implement those features themselves, but will have to, if they decide to work with bare block devices.
The less between your application and the hardware the better you can utilise it. There’d be basically no contention for resources, IO could go basically straight to the drive with less fsync-shenanigans etc.
With our Linux-based unikernel you can make small adjustments to the application -- usually just a handful of lines of code -- and get big performance improvements, like 25% or so. We had a paper about this at Eurosys'23: https://dl.acm.org/doi/abs/10.1145/3552326.3587458
That‘s the problem of unikernels. There are papers how great they are but almost nobody is using them. If applications could easily get a 25% boost without much work they would do it.
But nobody is using it. So I am not trusting the „benefits“.
There are a ton of well known techniques in computer science that will make software faster, more reliable or better in other ways, that people don't use, yet really work.
I mean, if I had an easy-enough way to use them, I absolutely would.
Docker isn’t ideal, but it’s common, and the UX is decent enough. If a unikernel solution existed with near-enough quality of UX and deployment, I’d 100% switch.
> But nobody is using it.
The development community at large often chooses to use/not-use things (packaging solutions, tooling, engineering approaches, software architectures, etc) for a huge variety of reasons, not all of them based in any kind of logic or benefit. I suspect microkernels could become a viable solution for some things, they just need a “docker moment”.
If you look at PostgreSQL configuration settings, you'll notice a chunk of them are dedicated to dealing with fsync. They even have a dedicated testing tool to test how the database behaves with different modes of syncing.
fsync is a serious performance problem, especially in distributed storage systems. If you can get it under your control, you can definitely benefit from it.
I don't think there'll be much. Most modern relational databases rely on underlying filesystem to manage the persistence aspect. Many cannot even work with bare block device (I don't think PostgreSQL can). So, they need OS to provide at least that.
However, in principle, if you could design a database to run as unikernel you could benefit from creating your own persistence layer. For instance, you might be able to completely or to a large extent avoid the problem created by fsync.
Another aspect you could aim for is becoming a real-time database, because you'd have control over memory allocation and thread context switching. This may not give any tangible benefits to an average database user, but in cases where being real-time is relevant (eg. medical equipment) you'd certainly be able to expand your area of application.
So Nanos can run quite a few databases today and you should get the same sort of experience that running a database on any vm will produce (that is everything running in the cloud).
You are absolutely correct though that most filesystems in use today were designed to utilize and more appropriately put deal with various aspects of running on actual real hardware which is interesting as compared to running on a vm. I think there is a ton of room for newer filesystems to emerge that are tuned for virtualized workloads vs tuning them for hardware.
From my POV security and this particular implementation uses threads so automatically gets bonus points imo when dealing with the problems associated and discussed on the linked mailing list thread wrt large memory.
However, from a much more arguably important POV - usability - if you look at the vast collection of docker-compose.yml files out there a ton that want postgres to run. Yes, you could run it on the side and run the rest as unikernels but that breaks the UX of the compose functionality. It is such a small thing but having it goes an insanely long way towards having better developer experience.
We now have compose like support too, so you can take some app that spins up 10+ instances locally and along with the others and spin up postgres inside as well. So usability/dx.
not sure how long it's going to take to get there but I think unikernels/library OS are the future. cloud computing is moving in the direction of lots of small tasks running in strong isolation and communicating over the network instead of via IPC and shared memory. there's a lot of surface area in traditional operating systems that is completely unused by most modern applications and just sits there consuming cycles and waiting to be attacked.
it’s the tooling that’s lacking now. Whoever gets it right could make a lot of money.
“Isomorphic to” in what sense? Surely you’re not claiming process-level isolation is as strong as a virtual machine. There are similar roles in both but the abstractions are pretty different.
Plus, if you often have to LD_PRELOAD getuid, just add a bogus impl to your libc or as a syscall to reduce onboarding friction.