I don't see it as a big deal - rather, I see it as a huge amount of venture cap spent on some very bright people to build something no one really wants, or, at best, is niche.
Also, it has little to do with the cloud; it is yet another hyperconverged infra.
Weirdly, it is attached to something very few people want: Solaris. This relates to the people behind it who still can't figure out why Linux won and Solaris didn't.
When you're deploying VMs, which is the use case here, the substrate OS becomes significantly less important. Those VMs will mostly just be linux.
Yes they are using illumos/Solaris to host this but they don't sell on that, they sell on the functionality of this layer — allowing people to deploy to owned infra in a way that is similar to how they'd deploy to AWS or Azure. How much do you ever think about the system hosting your VM on those clouds? You think about your VMs, the API or web interface to deploy and configure, but not the host OS. With Oxide racks the customers are not maintaining the illumos substrate (as long as Oxide is around).
You could be right about demand, there is risk in a venture like this. But presumably the team thought about this - I think folks who worked at Sun, Oracle, Joyent, and Samsung and made SmartOS probably developed a decent sense of market demand, enough to make a convincing case to their funders.
I have a feeling they knew exactly from the start who their customers would be: People who have the budget to care about things like trust and observability in a complex system. But these would also be the kind of customers who require absolute secrecy and so this why you don't hear about them even though they might have bankrolled a sizable portion of the operation. Just like the first Cray to officially be shipped was actually serial number 2...
> When you're deploying VMs, which is the use case here, the substrate OS becomes significantly less important. Those VMs will mostly just be linux.
Now you need to know both the OS they chose and the OS you chose...
(No, I don't believe it'll be 100% hands-off for the host. This is an early stage product, with a lot of custom parts, their own distributed block storage, hypervisor, and so on.)
This true for other hypervisors too. Enterprises are still paying hundreds of millions to VMware, who knows what's going on in there?
I wouldn't have picked Opensolaris, but it's a lot better than other vendors that are either fully closed source, or thin proprietary wrappers over Linux with spotty coverage and you're not allowed to touch the underlying OS for risk of disrupting the managed product.
What's more important is that the team actually knows Illumos/Solaris inside out. You can work wonders with a less than ideal system. That said, Illumos is of high quality in my opinion.
Seems risky considering how small of a developer pool actively works on illumos/Solaris. The code is most definitely well engineered and correct, but there are huge teams all around the world deploying on huge pools of Linux compute that have contributed back to Linux.
They had a bug in the database they are using that was due to a Go system library not behaving correctly specifically on illumos. They've got enough engineering power to deal with such a thing but damn..
Linux grew up in the bedrooms of teenagers. It was risky in the era of 486 and Pentiums. The environment and business criticality of a $1-2M rack-size computer is quite different.
I had similar thoughts about VMware (large installations) back in the day. Weird proprietary OS to run other operating systems? Yet they turned out fine.
This appears to be a much better system than VMware, is free as in software, and it builds upon a free software operating system with lineage that predates Linux.
I say this in the most critical way possible, as someone who has built multiple Linux-based "cloud systems", and as a GNU/Linux distribution developer: I love it!
It was totally a risky choice for companies in the 1990s and early 2000s to put all their web stuff onto Linux on commodity hardware instead of proprietary Unix or Windows servers. Many did it when their website being up was totally mission critical. Lots did it on huge server farms. It paid off very quickly but it's erasing history to suggest that it didn't require huge amounts of guts, savvy and agility to even attempt it.
Indeed, for me GNU/Linux was always a cheap way to have UNIX at home, given that Windows NT POSIX support never was that great.
The first time I actually saw GNU/Linux powering something in production was in 2003, when I joined CERN and they were replacing their use of Solaris, and eventually alongside Fermilabs came up with Scientific Linux in 2004.
Later at Nokia, it took them until 2006 to consider Red-Hat Linux a serious alternative to their HP-UX infrastructure.
Completely tangential, but this reminds me of an interview I had for my first job out of college in 1995. I mentioned to the interviewer that I had some Linux experience. "Ah, Linux" he said. "A cool little toy that's gonna take over the world".
In hindsight of course it was remarkably prescient. This from a guy at a company that was built entirely around SGI at the time.
This is a skewed view - the critical piece that made Linux "enterprise-ish" was the memory management system that was contributed by IBM, part of the SCO lawsuit
Back in the day... Sun Micro was a GOAT and pushed the envelope on Unix computing 20-30 years ago. Solaris was stable and high performing.
I don't run on-prem clusters or clouds but know a couple people who do and, at large enough scale, it is a constant "fuck-shit-stack on top of itself" (to quote Reggie Watts). There is almost always something wrong and some people upset about it.
The promise of a fully integrated system (compute HW, network HW, all firmware/drivers written by experts using Rust wherever possible) that pays attention to optimizing all your OpEx metrics is a big deal.
It may take Oxide a couple more years to really break into the market in a big way, but if they can stick it out, they will do very well.
It won't. In the same way that AWS customers aren't debugging hypervisor, or Dell customers aren't debugging the BIOS, or Samsung SSD customers aren't debugging the firmware. Products choose where to draw the line between customer-serviceable parts and those that require a support call. In this case, expect Oxide to fix it when something doesn't work right.
When Apple supports OSX for consumers, they don't exactly surface the fact that there's BSD semi-hidden in there somewhere.
That's because they own the whole stack, from CPU to GUI and support it as a unit. That's the benefit of having a product where a single owner builds and supports it as a whole.
My impression of Oxide is that that's the level of single source of truth they are bringing to enterprise in-house cloud. So, I strongly doubt the innards would ever become customer-facing (unless the customer specifically wants that, being open source after all).
Apple is a horrible example, with Apple when you have a problem, you often end up with an unfixable issue that Apple won't even acknowledge. You definitely don't want to taint Oxide's reputation with that association.
As for why I think Helios will become customer facing: Oxide is a small startup. They have limited resources. Their computers expensive enough to be very much business critical. You'll get some support by Oxide logging in remotely to customer systems and digging around, but pretty soon the customer will want to do that themselves to monitor/troubleshoot the problems as they happen.
Imagine you're observing a recurring but rare I/O slowdown that seems to trigger under some certain conditions, and tell me a competent sysadmin wouldn't want to log in on all the related boxes (client Helios, >=3 server Helioses for the block store) and look at the logs & stats.
> As for why I think Helios will become customer facing: Oxide is a small startup. They have limited resources.
Have you looked at the pedigree of many of the people behind the project? I don't say this because "these guys smart", but because these guys bent over backwards for their customers when they were Sun engineers. Bryan didn't write dtrace for nothing.
> Imagine you're observing a recurring but rare I/O slowdown that seems to trigger under some certain conditions, and tell me a competent sysadmin wouldn't want to log in on all the related boxes (client Helios, >=3 server Helioses for the block store) and look at the logs & stats.
I think you're simultaneously over-estimating and under-estimating the people who will deploy this. There's a lot of companies who would want a "cloud in a box" that would happily plug hardware in and submit a support ticket if they ever find an issue, because their system engineers either don't have the time, desire, or competence (unfortunately common) to do anything more. The ones who are happy to start debugging stuff on their own would have absolutely wonderful tooling at their fingertips (dtrace) and wouldn't have any issue figuring out how to adapt to something other than Linux (hell, I've been running TrueNAS for the better part of a decade and being on a *BSD has never bothered me).
Apple is a great example of the benefits of an integrated system where the hardware and software are designed together. There are tons of benefits to that.
What makes Apple evil (IMO, many people disagree) is how everything is secret and proprietary and welded shut. But that doesn't take away from the benefits of an integrated hardware/software ecosystem.
Oxide is open source so it doesn't suffer from the evil aspect but benefits from the goodness of engineered integration. Or so I hope.
In practice I don't think it's as good as in theory. I had Apple Macbook Pro with Apple Monitor, and 50% of the time when unplugging the monitor the laptop screen would stay off. Plugging back in to the monitor wouldn't work at that point so all I could do was hold the power button to force it off and reboot. That's with Apple controlling the entire stack - software, hardware, etc.
I think the real benefit is being able to move/deprecate/expand at will. For example, want an app that would require special hardware? You can just add it. Want to drop support for old drivers? Just stop selling them and then drop (deprecate) the software support in the next release.
I fully agree about the evilness, and it baffles me how few people do!
Android is potentially a better example. Compare Android to trying to get Linux working on <some random laptop>. You might get lucky and it works out of the box or you might find yourself in a 15 page "how to fix <finger print reader, ambient light sensor, etc>" wiki where you end up compiling a bunch of stuff with random patches.
Afaik Android phones tend to have a lot more hardware than your average laptop, too (cell modem, gps, multiple cameras, gyro, accelerometer, light sensors, finger print readers)
Apple is the survivor of 16 bit home micros integration, PC clones only happened as IBM failed to prevent Compaq's reverse engineering to take over their creation, they even tried to get hold of it afterwards via PS/2 and MCA.
As we see nowadays on tablets and laptops, most OEMs are quite keen in returning back to those days, as otherwise there is hardly any money left on PC components.
Funny how you mentioning BSD got me to thinking of Sony Playstation and Nintendo Switch. Which are proprietary and not user serviceable. A Steam Deck, Fairphone, or Framework laptop is each less proprietary and more FOSS stack, and user serviceable. Which a user may or may not want to do themselves; at the very least they can pay someone and have them manage it.
Also, Apple is just the one who survived. Previously I'd have thought of SGI, DEC, Sun, HP, IBM, Dell some of whom survived some not.
Those three consumer products I mentioned each provide a platform for a user and business space to floroush and thrive. I expect a company doing something similar for cloud computing to want the same. But it will require some magick: momentum, money, trust. That kind of stuff, and loads of it. (With some big names behind it and a lot of FOSS they got me excited, but I don't matter.)
If you have a bug in how a lambda function is run on AWS, do you find yourself looking for the bug in firecracker? It is open source, so you technically could, but I just don't see many customers doing that. Same can be said about KNative on GCP.
Their choice in foundation OS (for lack of a better term) really should not matter to any customer.
Ok but then that is purely additive then, right? Like, "have to find someone with Illumos expertise to fix something that was never intended to be customer-facing" may not be easy, but is still easier than the impossibility of doing the same thing on AWS / Azure / Google Cloud.
Right, who wants or benefits from open source firmware anyway.
Also there are many situations where renting, for example a flat makes a lot of sense. And there are many situations where the financials and or enabled options of owning something make a lot of sense. Right now, the kind of experience you get with AWS and co. can only be rented, not bought. Some people want to buy houses instead of renting them.
Well, you can buy your own hardware and set it up with OpenStack and use it as a private cloud. Companies like Canonical or Redhat make a lot of money by providing software (mostly open source) to support exactly that use case.
> Well, you can buy your own hardware and set it up with OpenStack and use it as a private cloud. Companies like Canonical or Redhat make a lot of money by providing software (mostly open source) to support exactly that use case.
Sure you can, but then who will diagnose and fix your hardware/OS interaction problems when you have parts from five vendors in the mix?
If you haven't lived through this, the answer is: nobody. Everyone points fingers at the other 4 and ignore your calls.
Back in the day you could buy a fully integrated system (from CPU to hardware to OS) from Sun or SGI or HP and you had a single company to answer all the calls, so it was much better. Today you can't really get this level of integration and support anymore.
(Actually, you probably can from IBM, which is why they're still around. But I have no experience in the IBM universe.)
This is why Oxide is so exciting to me. I hope I can be in a company that becomes a customer at some point.
>Sure you can, but then who will diagnose and fix your hardware/OS interaction problems when you have parts from five vendors in the mix?
Dell is a single vendor that will diagnose and fix all of your hardware issues.
With Oxide you're locked into what looks like a Solaris derivative OS running on the metal and you're only allowed to provision VMs which is a huge disadvantage.
I run a fleet of over 30,000 nodes in three continents and the majority is Flatcar Linux running on bare metal. Also have a decent amount of RHEL running for specific apps. We can pick and choose our bare metal OS which is something you cannot do with Oxide. That's a tough pill to swallow.
> Dell is a single vendor that will diagnose and fix all of your hardware issues.
I've been a Dell customer at a previous company. I know for a fact that's not true.
I had a support ticket for a weird firmware bug open for two years, they could never figure it out. I left that job but for all I know the case is still open many years later.
Dell doesn't know how to fix things like that because they don't design and engineer the systems they sell. Dell is a reseller who puts components together from a bunch of vendors and it mostly works but when it doesn't, there's nobody on staff who can fix it.
I've been a Dell customer for decades at this rate and I know for a fact it's true.
I've had support tickets open for all kinda of weird firmware, hardware, etc. bugs and they've been well resolved, even if it meant Dell just replaced the part with something comparable (NIC swap).
>Dell doesn't know how to fix things like that because they don't design and engineer the systems they sell.
Of course they do. That's like saying Oxide doesn't know how to fix stuff because they don't design the CPU, NVMe, DIMMs, etc. Oxide is still going to vendors for these things.
Ironically, it was Dell's total inability to resolve a pathological rash of uncorrectable memory errors very much is part of the origin story of Oxide: this issue was very important to my employer (who was a galactic Dell customer) and as the issue endured and Dell escalated internally, it became increasingly clear that there was in fact no one at Dell who could help us -- Dell did not understand how their own systems work.
At Oxide, we have been deliberate at every step, designing from first principles whenever possible. (We -- unlike essentially everyone else -- did not simply iterate from a reference design.)
To make this concrete with respect to the CPU in particular, we have done our own lowest-level platform enablement software[0] -- we have no BIOS. No one -- not the hyperscalers, not the ODMs and certainly not Dell -- has done this, and even AMD didn't think we could pull it off. Why did we do it this way? Because all along our lodestar was that problem that Dell was useless to us on -- that we wanted to understand these systems from first principles, because we have felt that that is essential to deliver the product that we ourselves wanted to by.
There are plenty of valid criticisms of Oxide -- but that we don't understand our system simply isn't one of them.
As a side question, what's the name of your custom firmware that is the replacement of the AGESA bootloader? I tried searching on the oxide github page but couldn't find anything that seemed to fit that description.
(The AGESA bootloader -- or ABL -- is in the AMD PSP.) In terms of our replacement for AGESA: the PSP boots to our first instruction, which is the pico host bootloader, phbl[0]. phbl then loads the actual operating system[1], which performs platform enablement as part of booting. (This is pretty involved, but to give you a flavor, see, e.g. initialization of the DXIO engine.[2])
Thanks, are the important oxide branches of illumos-gate repo (and any other cloned repos) defined anywhere? I definitely wouldn't have found that branch without you mentioning it here.
Interesting enough I also ran into something somewhat related with Dell that they were not able to resolve so they ended up working in a replacement from another vendor.
Nonetheless, it is quite interesting what you've built, but as the end user I'm not quote convinced that it matters. Sure you can claim it reduces attack vectors and such but we'll still see Dells and IBMs in the most restricted and highest security postured sites in the world. Think DoD and such. Core/libreboot with RoT will get me through compliance the same.
The software management plane y'all built is the headlining feature IMHO, not so much what happens behind the scenes that the vast majority of the time will not have a fatal catastrophic upstream effect.
>There are plenty of valid criticisms of Oxide -- but that we don't understand our system simply isn't one of them.
That's not what I said. There's a line in the sand that you must cross when it comes to understanding the true nature of the componentry that you're using. At the end of the day, your AMD CPUs may be lying to you, to all of us, but we just don't know it yet.
> Off by a few orders of magnitude. Dell on-site SLA with pre-purchased spares was about 6 hours.
You're talking about replacement parts. Yes Dell is good about that.
The discussion above is asking them to diagnose and fix a problem with the interaction of various hardware components (all of which come from third parties).
But they _are_ writing the firmware that runs most of them and need to understand those devices at a deep level in order to do that, unlike Dell. Dell slaps together hardware and firmware from other vendors with some high level software of their own on top. They don't do the low level firmware and thus don't understand the low level intricacies of their own systems.
No they're not unless I'm mistaken. They're not writing the firmware that runs on the NVMe drives, nor the NICs (they're not even writing the drivers for some of the NICs), etc.
There's a line in the sand that you must cross when it comes to understanding the true nature of the componentry that you're using. At the end of the day, your AMD CPUs may be lying to you, to all of us, but we just don't know it yet.
I'm not speaking hypothetically. If you hit a "zero-day" bug that Dell has never seen it's going to take time. And somehow every large customer finds bugs that Dell certification didn't.
> And somehow every large customer finds bugs that Dell certification didn't.
It's a law of computer engineering.
In the Apollo 11 decent sequence the Rendezvous Radar experienced a hardware bug[0] not uncovered during simulation. They found it later, but until then, the solution was adding a "turn off Rendezvous Radar" checklist item.
[0] The Rendezvous Radar would stop the CPU, shuttle some data into areas it could be read, and woke the CPU back up to process it. The bug caused it to supuriously do this dance just to tell it "no new data", which then caused other systems to overload.
It's ironic coming from a company who's CTO has harped about containers on bare metal for years. Maybe a large swath only need to deploy VMs, but the future will most definitely involve bare metal for many use cases, and oddly Oxide doesn't support that currently.
See the pattern? Dell only care about the big guys.
Set aside the childish tone ...
> Dell is a single vendor that will diagnose and fix all of your hardware issues.
There are two anecdotes here disagreeing with you, and frankly that's enough to say what you said above isn't true, not universally so. I doubt Odixe is targeting big deployment like yours, but more like theirs. Whether they will succeed is another matter, but they do have a valid sales pitch and the expertise to pull it off.
So OpenBMC is fine (happy for them!), but having open firmware is much deeper and broader than that: yes, it's the service processor (in contrast to the BMC which is a closed part on Dell machines) -- but it's also the root-of-trust and (especially) the host CPU itself. We at Oxide have open source software from first instruction out of the AMD PSP; I elaborated more on our approach in my OSFC 2022 talk.[0]
Dell uses trusted platform modules (TPM). It's a separate chipset than the BMC chipset.
For a mostly open source solution, not only would you need open source BMC firmware, you must have an open source UEFI/BIOS/boot firmware like CoreBoot, LinuxBoot, Oreboot, Uboot, etc.
The fact that it's not on linux is one of the great things about it. There is too much linux on critical infrastructure already and the monoculture just keeps on growing.
At least with Oxide there is a glimmer of hope for a better future in this regard.
Also, it has little to do with the cloud; it is yet another hyperconverged infra.
Weirdly, it is attached to something very few people want: Solaris. This relates to the people behind it who still can't figure out why Linux won and Solaris didn't.