> QEMU is often the subject of bugs affecting its reliability and security.
{{citation needed}}?
When I ran the numbers in 2019, there hadn't been guest exploitable vulnerabilities
that affected devices normally used for IaaS for 3 years. Pretty much every cloud outside the big three (AWS, GCE, Azure) runs on QEMU.
Here's a talk I gave about it that includes that analysis:
> When I ran the numbers in 2019, there hadn't been guest exploitable vulnerabilities that affected devices normally used for IaaS for 3 years.
So there existed known guest-exploitable vulnerabilities as recently as 8 years ago. Maybe that, combined with the fact that QEMU is not written in Rust, is what is causing Oxide to decide against QEMU.
I think it's fair to say that any sufficiently large codebase originally written in C or C++ has memory safety bugs. Yes, the Oxide RFD author may be phrasing this using weasel words; and memory safety bugs may not be exploitable at a given point in a codebase's history. But I don't think that makes Oxide's decision invalid.
That would be a damn good record though, isn't it? (I am fairly sure that more were found since, but the point is that these are pretty rare). Firecracker, which is written in Rust, had one in 2019: https://www.cve.org/CVERecord?id=CVE-2019-18960
Also QEMU's fuzzing is very sophisticated. Most recent vulnerabilities were found that way rather than by security researchers, which I don't think it's the case for "competitors".
You're not wrong, and that is very impressive. There's nothing like well-applied fuzzing to improve security.
But I still don't think that makes Oxide's decision or my comment necessarily invalid, if only because of an a priori decision to stick with Rust system-wide -- it raises the floor on software quality.
1. Oxide made an unproven statement ("QEMU is often the subject of bugs affecting its reliability and security.")
2. OP (bonzini) has given specific and valid arguments that that statement is wrong.
3. You're not answering to that specific arguments, but defending Rust and bashing C++ generally without giving any prove.
4. bonzini again provides specific arguments that your generalization is not correct in that context. That despite Firecracker being written in Rust it had a security issue.
5. You still insist without given any solid argument. You just insist that Rust is superior. Not helpful in any discussion. Think about it.
You're right on (1) and (2) - Oxide used weasel words when explaining the decision. My point is that their poor explanation doesn't necessarily mean it was the wrong decision. A bad defense attorney doesn't imply the defendant committed the crime.
I'm not bashing C++ beyond saying "any sufficiently large codebase originally written in C or C++ has memory safety bugs". I did not say those bugs are exploitable, just that they're present.
I'm also not insisting Rust is superior, except to say that it raises the floor of software quality, because it nearly eliminates a class of memory safety bugs.
Do you disagree? Neither of those statements implies C++ sucks or that Rust is awesome. Just 2 important data points (among many others) to consider in whatever context you're writing code in.
I believe TypeScript and Rust are both strong examples of languages that do this (for different reasons and in different ways).
It's also possible for a language to raise the ceiling of software quality, and Zig is an excellent example.
I'm thinking of "floors" and "ceilings" as the outer bounds of what happens in real, everyday life within particular software ecosystems in terms of software quality. By "quality" I mean all of capabilities, performance, and absence of problems.
It takes a team of great engineers (and management willing to take a risk) to benefit from a raised ceiling. TigerBeetle[0] is an example of what happens when you pair a great team, great research, and a high-ceiling language.
> possible for a language to raise the ceiling of software quality
Cargo is widely recognized as low quality. The thesis fails within it's own standard packaging. It's possible for a language to be used by _more people_ and thus raise the quality _in aggregate_ of produced software but the language itself has no bearing on quality in any objective measure.
> to benefit from a raised ceiling
You're explicitly putting the cart before the horse here. The more reasonable assertion is that it takes good people to get good results regardless of the quality of the tool. Acolytes are uncomfortable saying this because it also destroys the negative case, which is, it would be impossible to write quality software in a previous generation language.
> TigerBeetle[0] is an example
Of a protocol and a particular implementation of that protocol. It has client libraries in multiple languages. This has no bearing on this point.
Other than random weirdos who think allowing dependencies is a bad practice because you could hurt yourself, while extolling the virtues of undefined behavior - I've never heard much serious criticism of it.
Other software providing the same features produce better results for those users. It's dependency management is fundamentally broken and causes builds to be much slower than they could otherwise be. Lack of namespaces which is a lesson well learned before the first line of Cargo was ever written.
I could go on.
> evidence of this "wide regard"
We are on the internet. If you doubt me you can easily falsify this yourself. Or you could discover something you've been ignorant of up until now. Try "rust cargo sucks" as a search motif.
> random weirdos
Which may or may not be true, but you believe it, and yet you use your time to comment to us. This is more of a criticism of yourself than of me; however, I do appreciate your attempt to be insulting and dismissive.
Im not attempting to insult you, i didn't know you held such a hypocritical position - sorry pointing out that it is weird for someone working a field that is so dependent on logic to hold such a self-contradictory position insults you. Maybe instead of weird i should use the words unusual and unexpected. My bad.
You're right, I'm being dismissive of weasely unbacked claims of "wide regard". It's very clear now that you can't back your claim and I can safely ignore your entire argument as unfounded. Thanks for confirming!
> The more reasonable assertion is that it takes good people to get good results regardless of the quality of the tool.
> Acolytes are uncomfortable saying this because it also destroys the negative case, which is, it would be impossible to write quality software in a previous generation language.
Not impossible, just a lot harder. It's as if you're thinking in equations that are true/false, while I'm thinking in statistical distributions.
Have you used Macintosh System 5? How about Windows 3.1? Those were considered quality systems at the time, but standards are up, way up since then.
Why are modern systems better? Is it because we have better developers today? -- I don't think so. It took a "real" programmer to write quality apps in Pascal for early Macintosh systems, or apps in C for Windows 3.1.
I think the difference is in the tooling that is available to us -- and modern programming languages (and libraries) are surely a very large part of that tooling.
If you disagree, I challenge you to find a seasoned modern desktop app developer who can write a high-quality app for MacOS or Windows that looks and functions great by modern standards and doesn't use any modern languages or directly invoke any non-vendor libraries built after the year 2000. It's possible[0]. They may be able to do it, but you must certainly concede that doing a great job requires a much better developer than the average modern desktop app developer to be able to work well under those kinds of constraints.
That's what I mean by "raising the floor" -- all software gets better when languages, libraries, and tooling improve.
I do think that only having one CVE in six years is a pretty decent record, especially since that vulnerability probably didn't grant arbitrary code execution in practice.
Rust is an important part of how Firecracker pulls this off, but it's not the only part. Another important part is that it's a much smaller codebase than QEMU, so there are fewer places for bugs to hide. (This, in turn, is possible in part because Firecracker deliberately doesn't implement any features that aren't necessary for its core use case of server-side workload isolation, whereas QEMU aims to be usable for anything that you might want to use a VM for.)
Why is it the only people who say this at all are people saying it sarcastically or quoting fictional strawmen (and can never seem to provide evidence of it being said in earnest)?
If they are being precise, then “reliability and security” means something different than “security and reliability”.
How many reliability bugs has QEMU experienced in this time?
The man power to go on site and deal with in the field problems could be crippling. You often pick the boring problems for this reason. High touch is super expensive. Just look at Ferrari.
>Pretty much every cloud outside the big three (AWS, GCE, Azure) runs on QEMU.
QEMU typically uses KVM for the hypervisor, so the vulnerabilities will be KVM anyway. The big three all use KVM now. Oxide decided to go with bhyve instead of KVM.
No, QEMU is a huge C program which can have its own vulnerabilities.
Usually QEMU runs heavily confined, but remote code execution in QEMU (remote = "from the guest") can be a first step towards exploiting a more serious local escalation via a kernel vulnerability. This second vulnerability can be in KVM or in any other part of the kernel.
AWS uses KVM in the kernel but they have a different, non-open source userspace stack for EC2; plus Firecracker which is open source but is only used for Lambda, and runs on EC2 bare metal instances.
Google also uses KVM with a variety of userspace stacks: a proprietary one (tied to a lot of internal Google infrastructure but overall a lot more similar to QEMU than Amazon's) for GCE, gVisor for AppEngine or whatever it is called these days, crosvm for ChromeOS, and QEMU for Android Emulator.
It could be that it's not all over and tied to specific machine types still, or there's something they've done to make it report to the guest still that it's xen based for some compatibility reasons.
I think some older instance types are still on xen, later types run kvm (code named nitro.. perhaps?). I can’t remember the exact type but last year we ran into some weird issues related to some kernel regression that only affected some instances in our fleet, turns out they where all the same type and apparently ran on xen according to aws support
unless something has changed in the past year, fargate still runs each task in a single use ec2 vm with no further isolation around containers in a task.
QEMU can use a number of different hypervisors, KVM and Xen being the two most common ones. Additionally it can also emulate any architecture if one would want/need that.
You can of course assume that all of them heavily customize the underlying implemenation for their own needs and for their own hardware. And then they have stuff like Firecracker, GVisor etc. layered on top depending on the product line.
Instead of stating more or less irrelevant reasons, I'd prefer to read something like "I am (have been?) one of the core maintainers and know Illumos and Bhyve, so even if there would be 'objectively' better choices, our familiarity with the OS and hypervisor trump that". A "I like $A, always use $A and have experience using $A" is almost always a better argument than "$A is better than $B because $BLA", because that doesn't tell me anything about the depth of knowledge of using $A and $B or the knowledge of the subject of decision - there is a reason half of Google's results is some kind of "comparison" spam.
But everyone at Oxide already knows that back story. At least if you list some other reasons list you can have a discussion about technical merits if you want to.
But that doesn't make sense if you have specialists for $A that also like to work with $A.
Why should I as a customer trust Illumos/Bhye developers that are using Linux/KVM instead of "real" Linux/KVM developers? The only thing that such a decision would tell me is to not even think about using Illumos or Bhyve.
The difference between
"Buy our Illumos/Bhye solution! Why? I have been an Illumos/Bhyve Maintainer!"
and
"Buy our Linux/KVM solution! Why? I have been an Illumos/Bhyve Maintainer!"
> Of course, but that is less of unique selling point.
Who cares about uniqueness? That's not a goal.
> If you are selling Bhyve you better say that whether it's true or not. So why should I, as a reader or employee or customer, trust them?
They are not selling Bhyve. This is an internal document. Their costumers don't care about the implementation details. And if they do, then they will do their own evaluation based on their own evaluation.
As an employee you trust it because you know how the company heirs and who wrote these RFDs.
As a reader, its literally like any other thing on the internet.
But Bryan also ported KVM to Illumos. And Joyand used KVM and they supported KVM there for years, I assume Bryan knows more about KVM then Bhyve as he seemed very hands on in the implementation (there is nice talk on youtube). So the idea that he isn't familiar with KVM isn't the case. So based on that KVM or Bhyve on Illumos, KVM would suggest itself.
In the long term if $A is actually better then $B, then it makes sense to start with $A even if you don't know $A. Because if you are trying to building a company that is hopefully making billions in revenue in the future, then long term matters a great deal.
Now the question is can you objectively figure out if $A or $B is better. And how much time does it take to figure out. Familiarity of the team is one consideration but not the most important one.
Trying to be objective about this, instead of just saying 'I know $A' seems quite like a smart thing to do. And writing it down also seems smart.
In a few years you can look back and actually say, was our analysis correct, if no what did we misjudge. And then you can learn from that.
If you just go with familiarity you are basically saying 'our failure was predetermined so we did nothing wrong', when you clearly did go wrong.
For what it's worth, we at _Joyent_ were seriously investing in bhyve as our next generation of hypervisor for quite a while. We had been diverging from upstream KVM, and most especially upstream QEMU, for a long time, and bhyve was a better fit for us for a variety of reasons. We adopted a port that had begun at Pluribus, another company that was doing things with OpenSolaris and eventually illumos, and Bryan lead us through that period as well.
Improvements and fixes to illumos bhyve are almost entirely done in upstream illumos-gate, rather than the Oxide downstream.
Upstreaming those changes into FreeBSD bhyve is a more complicated situation, given that illumos has diverged from upstream over the years due to differing opinions about certain interfaces.
Yes, my personal goal is to ensure that basically everything we do in the Oxide "stlouis" branch of illumos eventually goes upstream to illumos-gate where it filters down to everyone else!
> Trying to be objective about this... And writing it down also seems smart.
Mosdef.
IIRC, these RFDs are part of Oxide's commitment to FOSS and radical openness.
Whatever decision is ultimately made, for better or worse, having that written record allows the future team(s) to pick up the discussion where it previously left off.
Working on a team that didn't have sacred cows, an inscrutible backstory ("hmmm, I dunno why, that's just how it is. if it ain't broke, don't fix it."), and gatekeepers would be so great.
While it's fair to say this does describe why Illumos was chosen, the actual RFD title is not presented and it is about Host OS + Virtualization software choice.
Even if you think it's a foregone conclusion given the history of bcantrill and other founders of Oxide, there absolutely is value in putting decision to paper and trying to provide a rational because then it can be challenged.
The company I co-founded does an RFD process as well and even if there is 99% chance that we're going to use the thing we've always used, if you're a serious person, the act of expressing it is useful and sometimes you even change your own mind thanks to the process.
> […] Speaking only for us (I work for Joyent), we have deployed hundreds of thousands of zones into production over the years -- and Joyent was running with FreeBSD jails before that […]
And I’ve seen some other primary sources (people who worked at Joyent) write that online too.
And Bryan Cantrill, and several other people, came from Sun Microsystems to Joyent. Though I’ve never seen it mentioned which order that happened in; was it people from Sun that joined Joyent and then Joyent switched from FreeBSD to Illumos and creating SmartOS? Or had Joyent already switched to Illumos before the people that came from Sun joined?
I would actually really enjoy a long documentary or talk from some people that worked at Joyent about the history of the company, how they were using FreeBSD and when they switched to Illumos and so on.
Joyent also merged with TextDrive, which is where the FreeBSD part came from. TextDrive was an early Rails host, and could even do it in a shared hosting environment, which is where I think a lot of the original user base came from (also TextPattern)
As I recall they were also the original host of Twitter, which if I recall was Rails back in the day.
KVM got more and more integrated with the rest of Linux as more virtualization features became general system features (e.g. posted interrupts). Also Google and Amazon are working more upstream and the pace of development increased a lot.
Keeping a KVM port up to date is a huge effort compared to bhyve, and they probably had learnt that in the years between the porting of KVM and the founding of Oxide.
Yeah I came here to say that Bryan worked at Sun so why do they even need to write this post (yes, I appreciate the techinical reasons, just wanted to highlight the fact via a subtle dig :-))
Linux has a rich ecosystem, but the toolkit is haphazard and a little shakey. Sure, everyone uses it, because when we last evaluated our options (in like 2009) it was still the most robust solution. That may no longer be the case.
Given all of that, and taking into account building a product on top of it, and thus needing to support it and stand behind it, Linux wasn't the best choice. Looking ahead (in terms of decades) and not just shipping a product now, it was found that an alternate ecosystem existed to support that.
Culture of the community, design principles, maintainability are all things to consider beyond just "is it popular".
> Xen: Large and complicated (by dom0) codebase, discarded for KVM by AMZN
1. Xen Type-1 hypervisor is smaller than KVM/QEMU.
2. Xen "dom0" = Linux/FreeBSD/OpenSolaris. KVM/bhyve also need host OS.
3. AMZN KVM-subset: x86 cpu/mem virt, blk/net via Arm Nitro hardware.
4. bhyve is Type-2.
5. Xen has Type-2 (uXen).
6. Xen dom0/host can be disaggregated (Hyperlaunch), unlike KVM.
7. pKVM (Arm/Android) is smaller than KVM/Xen.
> The Service Management Facility (SMF) is responsible for the supervision of services under illumos.. a [Linux] robust infrastructure product would likely end up using few if any of the components provided by the systemd project, despite there now being something like a hundred of them. Instead, more traditional components would need to be revived, or thoroughly bespoke software would need to be developed, in order to avoid the technological and political issues with this increasingly dominant force in the Linux ecosystem.
Is this an argument for Illumos over Linux, or for translating SMF to Linux?
> Is this an argument for Illumos over Linux, or for translating SMF to Linux?
I'd certainly like that! I had spent some time working with Solaris a lifetime ago, and ran a good amount of SmartOS infrastructure slightly more recently. I really enjoyed working with SMF. I really do not enjoy working with the systemd sprawl.
I've been using xen in production for at least 18 years, and although there is been some development, it is extremely hard to get actual documentation on how to do things with it.
There is no place documenting how to integrate the Dom0less/Hyperlaunch in a distribution or how to build infrastructure with it, at best you will find a github repo, with the last commit dated 4 years ago, with little to no information on what to do with the code.
> github repo, with the last commit dated 4 years ago
Some preparatory work shipped in Xen 4.19.
Aug 2024 v4 patch series [1] + Feb 2024 repo [2] has recent dev work.
> hard to get actual documentation
Hyperlaunch: this [3] repo looks promising, but it's probably easier to ask for help on xen-devel and/or trenchboot-devel [4]. Upstream acceptance is delayed by competing boot requirements for Arm, x86, RISC-V and Power.
Talking about "technological and political issues" without mentioning any, or without mentioning which components would need to be revived, sounds a lot like FUD unfortunately. Mixing and matching traditional and systemd components is super common, for example Fedora and RHEL use chrony instead of timesyncd, and NetworkManager instead of networkd.
> Talking about "technological and political issues" without mentioning any
I don't know why you think none were mentioned - to name one, they link a GitHub issue created against the systemd repository by a Googler complaining that systemd is inappropriately using Google's NTP servers, which at the time were not a public service, and kindly asking for systemd to stop using them.
This request was refused and the issue was closed and locked.
Behaviour like this from the systemd maintainers can only appear bizarre, childish, and unreasonable to any unprejudiced observer, putting their character and integrity into question and casting doubt on whether they should be trusted with the maintenance of software so integral to at least a reasonably large minority of modern Linux systems.
Using pool.ntp.org requires a vendor zone. systemd does not consider itself a vendor, it's the distros shipping systemd which are the vendor and should register and use their own vendor zone.
I don't care about systemd either way, but your own false representation of facts makes your last paragraph apply to your "argument".
The Oxide folks are rather vocal about their distaste for the Linux Foundation. FWIW I think they went with the right choice for them considering they'd rather sign up for maintaining the entire thing themselves than saddling themselves with the baggage of a Linux fork or upstreaming
But do they? Oxide targets the enterprise, and people there don't care that much about how the underlying OS works. It's been ten years since a RHEL release started using systemd and there has been no exodus to either Windows or Illumos.
I don't mean FUD in a disparaging sense, more like literal fear of the unknown causing people to be excessively cautious. I wouldn't have any problem with Oxide saying "we went for what we know best", there's no need to fake that so much more research went into a decision.
Exactly, then why would they be dragged into systemd-or-not-systemd discussion? If you want to use Linux, use either Debian or the CentOS hyperscaler spin (the one that Meta uses) and call it a day.
I am obviously biased as I am a KVM (and QEMU) developer myself, but I don't see any other plausible reason other than "we know the Illumos userspace best". Founder mode and all that.
As to their choice of hypervisor, to be honest KVM on Illumos was probably not a great idea to begin with, therefore they used bhyve.
FWIW, founder mode didn't exist five years ago when we were getting started! More seriously, though, this document (which I helped write) is an attempt specifically to avoid classic FUD tropes. It's not perfect, but it reflects certainly aspects of my lived experience in trying to get pieces of the Linux ecosystem to work in production settings.
While it's true that I'm a dyed in the wool illumos person, being in the core team and so on, I have Linux desktops, and the occasional Linux system in lab environments. I have been supporting customers with all sorts of environments that I don't get to choose for most of my career, including Linux and Windows systems. At Joyent most of our customers were running hardware virtualised Linux and Windows guests, so it's not like I haven't had a fair amount of exposure. I've even spent several days getting SCO OpenServer to run under our KVM, for a customer, because I apparently make bad life choices!
As for not discussing the social and political stuff in any depth, I felt at the time (and still do today) that so much ink had been split by all manner of folks talking about LKML or systemd project behaviour over the last decade that it was probably a distraction to do anything other than mention it in passing. As I believe I said in the podcast we did about this RFD recently: I'm not sure if this decision would be right for anybody else or not, but I believe it was and is right for us. I'm not trying to sell you, or anybody else, on making the same calls. This is just how we made our decision.
Founder mode existed, it just didn't have a catchy name. And I absolutely believe that it was the right choice for your team, exactly for "founder mode" reasons.
In other words, I don't think that the social or technological reasons in the document were that strong, and that's fine. Rather, my external armchair impression is simply that OS and hypervisor were not something where you were willing to spend precious "risk points", and that's the right thing to do given that you had a lot more places that were an absolute jump in the dark.
I would agree with that. Given the history of the Oxide team, they chose what they viewed was the best technology for THEM, as maintainers. The rest is mostly justification of that.
That's just fine, as long as they're not choosing a clearly inferior long term option. The technically superior solution is not always the right solution for your organization given the priorities and capabilities of your team, and that's just fine! (I have no opinion on KVM vs bhyve, I don't know either deep enough to form one. I'm talking in general.)
Honestly, SMF is superior to SystemD and it’s ironic it came earlier (and, that shows based on the fact that it uses XML as its configuration language.. ick).
However, two things are an issue:
1) The CDDL license of SMF makes it difficult to use, or at least that’s what I was told when I asked someone why SMF wasn’t ported to Linux in 2009.
2) SystemD is it now. It’s too complicated to replace and software has become hopelessly dependent on its existence, which is what I mentioned was my largest worry with a monoculture and I was routinely dismissed.
So, to answer your question. The argument must be: IllumOS over Linux.
SMF is OSS. The CDDL is an OSI approved licence. I'm not aware of any reason one couldn't readily ship user mode CDDL software in a Linux distribution; you don't even have the usual (often specious) arguments about linking and derivative works and so on in that case.
Maybe 15 years ago, not by a mile now. systemd surpassed SMF years ago and it's not even close now. No one in their right mind would pick SMF over systemd in 2024.
I regularly pick significantly less featured init systems over systemd whenever it is feasible, because systemd and it's related components have caused some of the largest amounts of work for me over the past decade.
I don't really want to litigate the systemd vs. everything else argument, but as someone that has issues with systemd but is not particularly in love with sysvinit derivatives, I wouldn't mind SMF as an alternative.
The fact that its less opinionated about logging and networking and doesnt ever force any reload of itself are all reasonable reasons to prefer it.
You don’t lose socket activation or supervison. SMF is designed to help work in the event of hardware failure too, which systemd definitely cant handle.
> There is not a significant difference in functionality between the illumos and FreeBSD implementations, since pulling patches downstream has not been a significant burden. Conversely, the more advanced OS primitives in illumos have resulted in certain bugs being fixed only there, having been difficult to upstream to FreeBSD.
curious about what bugs are being thought of there. Sounds like a very interesting situation to be in
According to the founders and early engineers on their podcast - no, they tried to fairly evaluate all the oses and were willing to go with other options.
Practically speaking, its hard to do it completely objectively and the in-house expertise probably colored the decision.
Tried to, sure, but when you evaluate other products strictly against the criteria under which you built your own version, you know what the conclusion will be. Never mind that you are carrying your blind spots with you. I would say that there was an attempt to evaluate other products, but not so much an attempt to be objective in that evaluation.
In general, being on your own private tech island is a tough thing to do, but many engineers would rather do that than swallow their pride.
"• Emerging VMMs (OpenBSD’s vmm, etc): Haven’t been proven in production"
It's a small operation, but https://openbsd.amsterdam/ have absolutely proven that OpenBSD's hypervisor is production-capable in terms of stability - but there are indeed other problems that rule against it on scale.
For those who are unfamiliar with OpenBSD: the primary caveat is that its hypervisor can so far only provide guests with a single CPU core.
Yes, to be clear this is not meant to be a criticism of software quality at OpenBSD! Though I don't necessarily always agree with the leadership style I have big respect for their engineering efforts and obviously as another relatively niche UNIX I feel a certain kinship! That part of the document was also written some years ago, much closer to 2018 when that service got started than now, so it's conceivable that we wouldn't have said the same thing today.
I will say, though, that single VCPU guests would not have met our immediate needs in the Oxide product!
> I will say, though, that single VCPU guests would not have met our immediate needs in the Oxide product!
Could Oxide not have helped push multi-vcpu guests out the door by sponsoring one of the main developers working on it, or contributing to development? From a secure design perspective, OpenBSD's vmd is a lot more appealing than bhyve is today.
I saw recently that AMD SEV (Secure Encrypted Virtualization) was added, which seems compelling for Oxide's AMD based platform. Has Oxide added support for that to their bhyve fork yet?
> Could Oxide not have helped push multi-vcpu guests out the door by sponsoring one of the main developers working on it, or contributing to development?
Being that vmd's values are aligned with OpenBSD's (security above all else), it is probably not a good fit for what Oxide is trying to achieve. Last I looked at vmd (circa 2019), it was doing essentially all device emulation in userspace. While it makes total sense to keep as much logic as possible out of ring-0 (again, emphasis on security), doing so comes with some substantial performance costs. Heavily used devices, such as the APIC, will incur pretty significant overhead if the emulation requires round trips out to userspace on top of the cost of VM exits.
> I saw recently that AMD SEV (Secure Encrypted Virtualization) was added, which seems compelling for Oxide's AMD based platform. Has Oxide added support for that to their bhyve fork yet?
SEV complicates things like the ability to live-migrate guests between systems.
Illumos makes sense as a host OS—it’s capable, they know it, they can make sure it works well on their hardware, and virtualization means users don’t need that much familiarity with it.
If I were Oxide, though, I’d be sprinting to seamless VMWare support. Broadcom has turned into a modern-day Oracle (but dumber??) and many customers will migrate in the next two years. Even if those legacy VMs aren’t “hyperscale”, there’s going to be lots of budget devoted to moving off VMWare.
Oracle is a $53 billion company, and never had a mass exodus, just less greenfield deployments.
Broadcom also isn't all that dumb, VMware was fat and lazy and customers were coddled for a very long time. They've made a bet that it's sticky. The competition isn't as weak as they thought, that's true, but it will take 5+ years to catch up, not 2 years, in general. Broadcom was betting on it taking 10 years: plenty of time to squeeze out margins. Customers have been trying and failing to eliminate the vTax since OpenStack. Red Hat and Microsoft are the main viable alternatives.
I don’t disagree much. Still, there’s a sudden weakening of the main incumbent in the on-prem virtualization market… and that is _all_ Oxide does. It will be interesting to see whether Oxide can convert some VMWare customers.
Oxide is a moonshot, it requires companies to fire a lot more suppliers than VMware. Which is easier said than done. A lot will depend on if Oxide can find their early niches: they're a long play (years) before they crack the mainstream or are truly competitive with, say, VCF on VxRail, for the average talent ops team. Long run it's clearly going to be better, but short run they have to pick their battles.
Maybe some, but the main difference between Oxide and the others is that they sell an integrated platform, so if a customer is just looking to replace the VM software, that seems like a harder sell.
Illumos and ZFS sounds completely sensible for a company that runs on specific hardware. They mention the specific epyc cpu their systems are running on which suggests they're all ~ identical.
Linux has a massive advantage where it comes to hardware support for all kinds of esoteric devices. If you don't need that, and you've got engineers that are capable of patching the OS to support your hardware, yep, have at it. Good call.
Certainly they already had experience with ZFS (as it is built into Illumos/Solaris), but as it was told to them by someone they trusted who ran a lot of Ceph: "Ceph is operated, not shipped [like ZFS]".
There's more care-and-feeding required for it, and they probably don't want that as they want to treat product in a more appliance/toaster-like fashion.
Oxide is shipping an on-prem 'cloud appliance'. From the customer's/user's perspective of calling an API asking for storage, it does not matter what the backend is—apple or orange—as long as "fruit" (i.e., a logical bag of a certain size to hold bits) is the result that they get back.
Yes, it could be NTFS behind the scenes, but this is still an apples to oranges comparison because the storage service Oxide created is Crucible[0], not ZFS. Crucible is more of an apples to apples comparison with Ceph.
It’s very possible to run a light/small layer on top of ZFS (either userspace daemon or via FUSE) to get you most of the way to scaling ZFS-backed object storage within or across data centers depending on what specific availability metrics you need.
Ceph is sadly not very good at what it does. The big clouds have internal versions of object store that are far better (no single point of failure, much better error recovery story, etc.). ZFS solves a different problem, though. ZFS is a full-featured filesystem. Like Ceph it is also vulnerable to single points of failure.
Single-monitor is a common way to run Ceph. On top of that, many cluster configurations cause the whole thing to slow to a crawl when a very small minority of nodes go down. Never mind packet loss, bad switches, and other sorts of weird failure mechanisms. Ceph in general is pretty bad at operating in degraded modes. ZFS and systems like Tectonic (FB) and Colossus (Google) do much better when things aren't going perfectly.
Do you know how many administrators CERN has for its Ceph clusters? Google operates Colossus at ~1000x that size with a team of 20-30 SREs (almost all of whom aren't spending their time doing operations).
> Our Configuring ceph section provides a trivial Ceph configuration file that provides for one monitor in the test cluster. A cluster will run fine with a single monitor; however, a single monitor is a single-point-of-failure. To ensure high availability in a production Ceph Storage Cluster, you should run Ceph with multiple monitors so that the failure of a single monitor WILL NOT bring down your entire cluster.
This is complete nonsense. No one running business critical installs of Ceph runs single-monitor.
You can also tell Ceph to use a single disk as your failure domain. No one does that either. Homelabbers maybe, but then why are you comparing such setups with Google?
We run Ceph with a failure domain of an entire rack. We can literally take down (scheduled or unscheduled) an entire rack of 40 servers, and continue to serve critical, latency sensitive applications, with no noticeable performance loss.
We have a Ceph footprint 5x larger than CERN run by a team of 4-5 people.
I wonder if CockroachDB abandoning the open source license[0] will have an impact on their choice to use it. It looks like the RFD was posted 1 day before the license switch[1], and the RFD has a section on licenses stating they intended to stick to the OSS build:
> To mitigate all this, we’re intending to stick with the OSS build, which includes no CCL code.
> Nested virtualisation [...] challenging to emulate the underlying interfaces with flawless fidelity [...] dreadful performance
It is so sad that we've ended up with designs where this is the case. There is no intrinsic reason why nested virtualization should be hard to implement or should perform poorly. Path dependence strikes again.
It doesn't perform poorly in fact. It can be tuned at 90% of non-nested virtualization, and for workloads where it doesn't, that's more than anything else a testimony to how close virtualized performance is to bare-metal.
Letting licensing alone, I think they have a couple of killer reasons to choose Illumos: It's a fine operating system, and it's much easier for them to land the fixes/features they need on the Illumos kernel than if they've built on Linux.
Point 1.1 about QEMU seems even less relevant today, with QEMU adding support for the microvm machines, hence greatly reducing the amount of exposed code. And as bonzini said in the thread, the recent vulnerability track record is not so bad.
Been running Bhyve on FreeBSD (technically FreeNAS). Found PCIe pass-through of NMVe drives was fairly straight forward once the correct incantations were found, but network speed to host has been fairly abysmal. On my admittedly aging Threadripper 1920X, I can only get ~2-3 Gbps peak from a Linux guest.
That's with virtio, the virtual intel "card" is even slower.
They went with Illumos though, so curious if the poor performance is a FreeBSD-specific thing.
I just spun up a VNET jail (so it should be essentially using the same network stack and networking isolation level as a bhyve guest would) and tested with iperf3 and without any tweaking or optimization and without even using jumbo frames I'm able to get 24+ Gbps with iperf3 (32k window size, tcp, single stream) between host/guest over the bridged and virtualized network interface. My test hardware is older than yours, it's a Xeon E5-1650 v3 and this is even with nested virtualization since the "host" is actually an ESXi guest running pf!
But I think you might be right about something because, playing with it some more, I'm seeing an asymmetry in network I/O speeds; when I use `iperf3 -R` from the VNET jail to make the host connect to the guest and send data instead of the other way around, I get very inconsistent results with bursts of 2 Gbps traffic and then entire seconds without any data transferred (regardless of buffer size). I'd need to do a packet capture to figure out what is happening but it doesn't look like the default configuration performs very well at all!
It's been a minute since I messed with bhyve on FreeBSD, but I'm pretty sure you have to switch out the networking stack to something like Netgraph if you intend to use fast networking.
Hmmm I'm not the OP, but I run my personal site on a kubernetes cluster hosted in bhyve VMs running Debian on a FreeBSD machine using netgraph for the networking. I just tested by launching iperf3 on the FreeBSD host and launching an alpine linux pod in the cluster, and I only got ~4Gbit/s. This is surprising to me since netgraph is supposed to be capable of much faster networking but I guess this is going through multiple additional layers that may have slowed it down (off the top of my head: kubernetes with flannel, iptables in the VM, bhyve, and pf on the FreeBSD host).
Thanks, but I am not using if_bridge. I am creating a netgraph bridge[0] which is connected to the host via a netgraph eiface. Then on the host, the packet passes to my real physical interface because I have gateway_enable and I have pf perform NAT[1]. It looks like that blog post connected the netgraph bridge directly to the external interface, so my guess is my slowdown is from either pf performing NAT or the packet forwarding from gateway_enable.
While we have a common ancestor in the original UNIX, so much of illumos is really more from our SVR4 heritage -- but then also so much of that has been substantially reworked since then anyway.
The section about Rust as a first class citizen seems to contain references to its potential use in Linux that are a few years out of date; with nothing more current than 2021.
> As of March 2021, work on a prototype for writing Linux drivers in Rust is happening in the linux-next tree.
Bryan Cantrill, ex-Sun dev, ex-Joyent CTO, now CTO of Oxide, is the reason they chose Illumos. Oxide is primarily an attempt to give Solaris (albeit Rustified) a second life, similar to Joyent before. The company even cites Sun co-founder Scott McNealy for its principles:
It's not a blog post, it's an RFD. We have a strong focus on writing as part of thinking and making decisions, and when we can, we like to publish our decision making documents in the spirit of open source. This is not a defence of our position so much as a record of the process through which we arrived at it. This is true of our other RFDs as well, which you can see on the site there.
> It also should not matter to their customers. They get exposed APIs and don't have to care about the implementation details.
Yes, the whole product is definitely designed that way intentionally. Customers get abstracted control of compute and storage resources through cloud style APIs. From their perspective it's a cloud appliance. It's only from our perspective as the people building it that it's a UNIX system.
So at no point did anyone even suspect that Illumos was under consideration because it's been corporate leadership's pet project for decades? That seems like a wild thing to omit from the "RFD" process. Or were some topics not open to the "RFD" process?
We are trying to build a business here. The goal is to sell racks and racks of computers to people, not build a menagerie of curiosities and fund personal projects. Everything we've written here is real, at least from our point of view. If we didn't think it would work, why would we throw our own business, and equity, and so on, away?
The reason I continue to invest myself, if nothing else, in illumos, is because I genuinely believe it represents a better aggregate trade off for production work than the available alternatives. This document is an attempt to distill why that is, not an attempt to cover up a personal preference. I do have a personal preference, and I'm not shy about it -- but that preference is based on tangible experiences over twenty years!
Furthermore, your team have already demonstrated that you can reevaluate things that you had strong opinions on, and come to a different conclusion. I'm thinking of the decision to exclusively run hardware-virtualized VMs rather than LX-branded zones, as we discussed on Oxide and Friends before.
I don't think working at Oxide would be for me, but I respect the team's values and process.
Folks are working on it! I believe it boots on some small systems and under QEMU, but it's still relatively early days. I'm excited for the port to eventually make it into the gate, though!
I don’t mean to downplay the importance for you personally but I do want to clarify that while it might be a non-starter for you, all of arm64 is so new that it’s hardly a non-starter for anyone considering putting it into (traditional) production.
{{citation needed}}?
When I ran the numbers in 2019, there hadn't been guest exploitable vulnerabilities that affected devices normally used for IaaS for 3 years. Pretty much every cloud outside the big three (AWS, GCE, Azure) runs on QEMU.
Here's a talk I gave about it that includes that analysis:
slides - https://kvm-forum.qemu.org/2019/kvmforum19-bloat.pdf
video - https://youtu.be/5TY7m1AneRY?si=Sj0DFpRav7PAzQ0Y