I love it. We spent 30 years getting preemptive-multitasking working, and it is such a giant fragile pain in the a, that it is easier to just virtualize the whole damn computer rather than making different software play nice together (maybe run well together is more apropos).
We have now come full circle. People want DOS.
I have officially lived "long enough." Not too long, mind you, just long enough.
it is easier to just virtualize the whole damn computer
IBM did this starting in the 1960s as a way of providing time-sharing. It became their VM/CMS OS product line (VM - hypervisor, CMS - single-user guest OS - but any OS could be the guest, including VM itself)
CMS influenced the CP/M microcomputer operating system. So yes, it was not unlike DOS.
I've never thought that preemptive multitasking is "such a giant fragile pain in the a". I think it's pretty useful. What has been the problem with it?
We could probably do with partitioning the desktop though, but that's easily done with a bit more cgroups.
The biggest problem is that context switching overhead becomes the main "work" being done, rather than the actual work of the threads, for some core count n (which depends on workload).
You can already saturate an entire core just copying data from a NIC. Of course, enter BPF and all that jazz. Plus it doesn't play as nice with BIG.little multiprocessing (AMP - asymmetric multiprocessing).
edit: I should note I am not advocating getting rid of preemption entirely, because it is needed for at least one processor, but having a fixed quantum causing a context switch on application CPUs is highly disruptive, both from the operation itself, and from blowing the cache all the time. Incidentally, one of the main reason for green threads.
One modern systems, preemptive multitasking has a surprisingly high overhead. Much of your CPU time can be spent on context switching and the associated cache thrashing, which isn't productive. Furthermore, many kinds of macro-optimizations based on temporal locality stop working because preemption can randomly inject badly chosen computation into your carefully crafted computational sequencing. Preemptive multitasking makes it more difficult to ensure robustness under load generally because the default behavior of the system is that it will happily do the wrong thing at the wrong time under stress, because the OS has no idea what the right thing is, which can aggravate the situation. Obviously, none of this applies to systems that are typically under light-to-modest load.
For servers with hardware capable of very high bandwidth I/O, preemptive multitasking is a major drag on performance and scalability. Hence why "thread-per-core" software architectures are idiomatic for high-performance data infrastructure software these days. You don't just gain significant absolute performance, the behavior can also degrade more gracefully under extreme load.
Like with all things, it depends on what you are doing. I've designed database engines both ways and Linux is not opinionated about such things.
So how do you schedule them now they are VM threads contending for cores? This doesn't change that problem at all. The gains are in security (reducing complexity and attack surface) and packaging. However, it uses coarse isolation with a single process model, so you would be compelled to isolate via many vm processes, which pushes the security/isolation management up to the container level and can incur a performance penalty.
You use the native cloud facilities available by spinning up a new instance per deploy. Whether you need a simple shared thread (smallest instances available) or you want to tackle the 368 thread monsters available on something like GCloud both are fine.
It's worth pointing out that if you run containers on public cloud you are inherently virtualized already - this actually removes layers instead of adding layers on.
Theoretically you are renting 100% of a virtual core (whatever fraction of a real core that is depends on the hardware you're scheduled on at that moment), but in practice you're sharing cycles at small core counts.
But anyway, if you spawn 8 instances each with 2 cores you're not getting any more compute than 16 cores on the same box. You could schedule them with cpuset to have 2 cores per process, or you could let the scheduler run. But there is no free lunch from the cloud.
I have no idea why you think running on N images reduces overhead.
Where do most of the container users deploy their containers? Would you say they are on the public cloud which is inherently virtualized to begin with?
A lot of people that have not deployed unikernels might not be aware of how they are actually deployed. When we deploy our websites and database and such to something like Google or AWS we upload a disk image. That disk image is then converted into a machine image. On AWS that would be an AMI. Then we boot that AMI and that AMI only contains your program. There is no underlying linux that you deploy and then you ssh into and install some k8s substitute.
So, yes, in comparison to launching a linux instance on AWS or Google, then installing k8s or using nomad or something like that, that has to inherently duplicate networking, duplicate storage, duplicate security, and many other things - yes it reduces quite a lot. This doesn't even touch on the fact that many unikernels tend to run much faster for other reasons. We are just talking about not having to use things like underlay/overlay networks or storage proxies.
Again, the images that are deployed get managed by the cloud of choice - so your networking, storage, security, etc. is all managed by that cloud.
OSv has preemptive multi-tasking. Unikernels are more about having your application run in kernelspace, and recognizing that in distributed computing contexts, individual computers tend to be only doing one thing, and having them doing other things reduces efficiency and decreases complexity.
Docker is dead. (OK, to be fair, it's still in the process of dying).
As Docker don't solve isolation you need to run Docker in VMs anyway, so actually people start to think how to get rid of this unnecessary overhead. The foreseeable future are microVMs (the hype is just heating, projects show up "everywhere"). And in the long run unikernels could become the "packaging format" for those microVMs (but they have to mature first as ease of use, und performance / stability are not there right now).
> but they have to mature first as ease of use, und performance / stability are not there right now
And this is what I have a hard time seeing. Unikernels had their hype moment several years ago, and nothing has happened in the meantime. If there is no big leap in tooling, lightweight virtualisation will remain a transparent implementation detail for the cloud vendors.
, which effectively makes it a Linux binary compatible unikernel (for more details about Linux ABI compatibility please read this doc). In particular OSv can run many managed language runtimes including JVM, Python 2 and 3, Node.JS, Ruby, Erlang, and applications built on top of those runtimes. It can also run applications written in languages compiling directly to native machine code like C, C++, Golang and Rust as well as native images produced by GraalVM and WebAssembly/Wasmer.
OSv can boot as fast as ~5 ms on Firecracker using as low as 15 MB of memory.
OSv can run on many hypervisors including QEMU/KVM, Firecracker, Xen, VMWare, VirtualBox and Hyperkit as well as open clouds like AWS EC2, GCE and OpenStack."
Solutions for coordinating work across multiple cores are many. Some are highly programmer-friendly and enable development of software that works exactly if it were running on a single core. For example the classic Unix process model is designed to keep each process in total isolation and relies on kernel code to maintain a separate virtual memory space per process. Unfortunately this increases the overhead at the OS level.
Software development challenges
[...]
Hardware has changed to the point where the assumptions originally made on small numbers of CPU cores are no longer valid.
Processes are extremely self-contained but have high overhead.
Threads impose additional coordination costs on both the programmer and the application infrastructure, and are notoriously difficult to debug.
Pure event-driven programming can result in codebases that are difficult to test and extend."
Effectively, event-driven programming puts the task of sorting out process scheduling into the hands of the developer, rather than the OS, which would otherwise have to guess at the best scheduling approach. Yes, there are also speed advantages from losing the context switch overhead, but fundamentally, I believe it is the programmer-controlled scheduling that makes event driven apps faster, all other things being held equal.
But, I think that Linux kernel is more Battle Tested than this. Also, Compiled Linux kernel can weigh 6.7MB or even less.
What i am thinking is that we can compile linux kernel , drivers needed for the particular device, compile runtime and application and deploy on bare machine. It will be monolithic approach too.
Further i think that unikernel applications are more helpful if application take features of underlying kernel and vice versa . I also think that unikernel approach is not much helpful if that unikernel acts like shim layer to run other applications. Briefly,i am saying that unikernel applications must be built from bottom unikernel to application all integrated.
So,how is this unikernel better than Linux based monolithic deployment that i mentioned above? what about security updates in this unikernel ?
> What i am thinking is that we can compile linux kernel , drivers needed for the particular device, compile runtime and application and deploy on bare machine. It will be monolithic approach too.
Interesting. Would like to see some compatibility matrix against gVisor/runsc. I know they're not the same thing, but they attempt to solve the same goal (lightweight VMM-based isolation providing a Linux ABI).
Is there a summary list of supported syscalls somewhere?
However, it does directly deal with container security issues. If you absolutely must use containers you probably should be using something like that as containers inherently weaken security architecturally.
You've got it backwards. Containers strenghen security. Container mgmt runtimes executing as root causes security problems, like the setuid programs of old.
Fuck me dead, I can't stand you HN losers who can't manage to let someone else have an opinion, and yet can't defend their own. If you want to disagree, say why and I'll happily read your arguments and possibly tear them to pieces with words and not the cowards down arrow.
Yeh, I beg to differ. Containers have an absolutely horrible security record. Just in the past few weeks we've seen yet another container breakout, an "un-fixable" k8s 'issue', another cryptojacker exclusively attacking container infrastructure and more.
The huge problem with "containers" is that they break well known security boundaries by sharing the same kernel. This is further exacerbated by software like k8s by allowing an attacker to further extend their attack as soon as they compromise a single instance.
Compare the following situations: (A) two processes running in the same operating system; (B) two processes running inside containers in the same operating system. How exactly is situation B less safe than A?
You are instead comparing B with (C): two VMs, each running a process.
Whether k8s / docker or VMWare Workspace / custom Xen etc is worse is another question. In my limited experience they are all terrible. But that is distinct from containerisation (confinement at the syscall layer + tooling).
You are quite correct - I am comparing multiple programs running in different vms - which is the very very common situation almost every single company finds themselves in simply because there is too much software out there today.
Even the most basic case of a database talking to an app server typically runs in more than one vm if for no other reason that it is easier to manage and easier to isolate performance issues, however, there are obviously many many other reasons as well - does it shard? does it replicate? Most websites aren't exposed to the world directly - they sit behind a load balancer because a single app can't take the entire load.
Most companies I know have workloads that don't just span one virtual machine - they span tens or hundreds or thousands or in the case of the hyperscalers hundreds of thousands. Most of the container users out there live on virtual machines! I can't keep emphasizing this point enough.
I'm not arguing for the hobbyist at home. I'm arguing for the companies that on the low end spend tens of thousands of dollars a month on cloud infrastructure. Just as a case point - Lyft alone was spending something like $80M a year on cloud infrastructure.
One of the key unikernel questions is that we don't live in the 1990s anymore - Zeus the database server doesn't live on the same server as Mars the webserver anymore. Multi-process/single-server architecture makes absolutely no sense in the 2020s.
It's almost 2021. Things have changed. Our operating systems much change too.
Interesting project. Just wondering is there any ongoing effort to somehow allow users to build a minimum linux kernel that is just enough to support a specified application?
I did once spend a few boring hours on a plane on a genetic-algorithm-based optimiser that (tried to) minimise a Linux image (I believe I used Gentoo), eliminating packages first, followed by files, segments, and individual lines of code.
From what I remember, (a) you can probably get to an extremely low size, (b) it won't be worth it compared to similar efforts taking a different route and getting 90 % there, because (c) your test will never be sufficient for any real-world usage: I remember going through some of the codepaths that had been disabled, and it included handling of specific dates, any unicode character I had not used, handling of networking and disk access failures that I did not even know of, etc.
I suppose you could profile your app, watching the syscalls it makes over a full cycle, and then build a stripped kernel with nothing but those calls implemented. Any drivers not touched would also be removed. That would not eliminate the syscall overhead, but it would certainly minimize boot time and improve security somewhat by reducing the attack surface.
Unsurprisingly, they benchmarked against the latest that was available at the time. It's not some nefarious trick to make the benchmarks look good. We can see that the heyday of OSv development work was about 7 years ago:
Hopefully, OSv is a stable product that continues to work well, so heavy investment in development hasn't been needed on an ongoing basis, primarily just maintenance work. Or, it could be abandonware... but the README gives the impression that it is still being maintained.
I would still be hesitant to deploy such a deeply invasive solution, but the idea is interesting.
> We can see that the heyday of OSv development work was about 7 years ago
Yeah, the startup who developed this ending up pivoting to a different product instead, ScyllaDB.
I think they found that while OSv is cool technology, there wasn't enough market demand to make a successful business out of it.
BEA used to have a product called JRockit Virtual Edition (JRVE). It was a thin operating system designed to only run one process, the JVM. So conceptually quite similar to this, but closed source. Then Oracle bought BEA, and then after a while (not straight away after the acquisition), Oracle killed it. Rumour has it that the Linux and Solaris teams were opposed to the idea of Oracle providing a third operating system with far less features than theirs did, which is part of why it was killed. I don't think it was ever hugely popular with customers either.
(Disclaimer: Used to work for Oracle, but not directly on the products mentioned.)
Here's a post from one of the OSv developers on performance compared to Linux in 2020 [1]. It paints quite a different picture compared to those benchmarks from 2013.
>While ideally a unikernel like OSv could provide better performance than traditional kernels because of things like lower system call and context switch overhead, less locking, and other things, there are many things working against this ideal, and resulting in disappointing performance comparisons:
1. Most modern high-performance software has evolved on Linux, and evolved with its limitations in mind.
So for example, if Linux's context switches are slow, application developers start writing software which lowers the number of context switches - even to the point of just one thread per core. If Linux's system calls are slow, application developers start to batch many operations in one system call, starting with epoll() some 20 years ago, and culminating with io_uring recently introduced to Linux. With these applications, it is pointless to speed up system calls or context switches, because these take a tiny percentage of the runtime.
2. They say a chain is as weak as its weakest link.
This is even more true in many-core performance (due to Amdahl's law). Complex software uses many many OS features. If OSv speeds up context switches and system calls and networking (say) by 10%, but then some other thing the software does is 2 times slower than in Linux, it is very possible that OSv's overall performance will be lower than Linux. Unfortunately, this is exactly what we saw a few years ago when ScyllaDB (then "Cloudius Systems") was actively benchmarking and developing OSv: Many benchmarks we tried were initially slower in OSv than in Linux. When we profiled what happened, we discovered that although many things in OSv were better than Linux, one (or a few) specific things in OSv which were significantly efficient in OSv than in Linux. This could be some silly filesystem feature we never thought was very important but was very frequently used in this application, it could be that OSv's scheduler wasn't as clever as Linux's to handle this specific use case. It could be that some specific algorithm was lock-free in Linux but uses locks in OSv, so becomes increasingly worse on OSv with the more CPUs you have. The main point is that if an application uses 100 different OS features - Linux had hundreds of developers optimizing each of these 100 features for years. The handful OSv developers focused on specific features and made clever improvements to them - but the rest are probably less optimized than Linux.
3. Many-core development is hot
When the OSv project started 7 years ago, it was already becoming clear that many-core machines were the future, but it wasn't as obvious as it is today. So OSv could get some early wins by developing some clever lock-reducing improvements to its networking stack and other places. But the Linux developers are not idiots, and spent the last 7 years improving Linux's scalability on many-core systems. And they went further than OSv ever got - they support NUMA configurations, multi-queue network cards, and a plethora of new ideas for improving scalability on many core systems. On modern many-core, multi-socket, multi-queue-network-card systems, there is a high chance that Linux will be faster than OSv.
4. Posix API is slow
This is related to the first issue (of software having evolved on Linux), but this time for more "traditional" software and not really state-of-the-art applications using the latest fads like io_uring. This traditional software is using the Posix API - filesystem, networking, memory handling etc. that was designed decades ago, and make various problematic guarantees. Just as one example, the possibility to poll the same file descriptor from many threads requires a lock every time this file descriptor is used. This slows down both Linux and OSv, but not giving OSv any advantage over Linux, because both need to correctly support the same slow API. Or even the contrary - Linux and its hundreds of developers continue to come up with clever tricks for each of these details (e.g., use RCU instead of locks for file descriptors) while OSv's few (today, very few) developers only had time to optimize a few specific cases.
So it's no longer clear that if raw performance is your goal, OSv is the right direction. OSv can still be valuable for other reasons - smaller self-contained images, smaller code base, faster boot, etc. For raw performance, our company (ScyllaDB) went on a different direction: The Seastar library (http://nadav.harel.org.il/seastar/, https://github.com/scylladb/seastar) allows writing high-performance applications on regular Linux, by avoiding or minimizing all the features that makes Linux slow (like context switches) or cause scalability problems on modern many-core machines. Initially Seastar ran on both Linux and OSv (with identical performance, because it avoided the heavy parts of both), but unfortunately today it is using too many new Linux features which don't work on OSv - so it no longer runs on OSv.
We have now come full circle. People want DOS.
I have officially lived "long enough." Not too long, mind you, just long enough.