I don't see it as a big deal - rather, I see it as a huge amount of venture cap ...

eduction · on Oct 26, 2023

When you're deploying VMs, which is the use case here, the substrate OS becomes significantly less important. Those VMs will mostly just be linux.

Yes they are using illumos/Solaris to host this but they don't sell on that, they sell on the functionality of this layer — allowing people to deploy to owned infra in a way that is similar to how they'd deploy to AWS or Azure. How much do you ever think about the system hosting your VM on those clouds? You think about your VMs, the API or web interface to deploy and configure, but not the host OS. With Oxide racks the customers are not maintaining the illumos substrate (as long as Oxide is around).

You could be right about demand, there is risk in a venture like this. But presumably the team thought about this - I think folks who worked at Sun, Oracle, Joyent, and Samsung and made SmartOS probably developed a decent sense of market demand, enough to make a convincing case to their funders.

speed_spread · on Oct 27, 2023

I have a feeling they knew exactly from the start who their customers would be: People who have the budget to care about things like trust and observability in a complex system. But these would also be the kind of customers who require absolute secrecy and so this why you don't hear about them even though they might have bankrolled a sizable portion of the operation. Just like the first Cray to officially be shipped was actually serial number 2...

jeffrallen · on Oct 27, 2023

Can you imagine trying to investigate Bryan for a security clearance?

"Sir, if you have nothing to hide, why do you talk so fast, and pronounce words like a foreigner who learned English from a book?"

:)

We love you Bryan, never change.

eduction · on Oct 27, 2023

Oh to be a fly on the wall for the first Oxide<>Oracle partnership discussion…

jeffrallen · on Oct 27, 2023

Bryan: "Steve, I think you should take this meeting with Larry..."

yencabulator · on Oct 26, 2023

> When you're deploying VMs, which is the use case here, the substrate OS becomes significantly less important. Those VMs will mostly just be linux.

Now you need to know both the OS they chose and the OS you chose...

(No, I don't believe it'll be 100% hands-off for the host. This is an early stage product, with a lot of custom parts, their own distributed block storage, hypervisor, and so on.)

tinco · on Oct 26, 2023

This true for other hypervisors too. Enterprises are still paying hundreds of millions to VMware, who knows what's going on in there?

I wouldn't have picked Opensolaris, but it's a lot better than other vendors that are either fully closed source, or thin proprietary wrappers over Linux with spotty coverage and you're not allowed to touch the underlying OS for risk of disrupting the managed product.

sgt · on Oct 26, 2023

What's more important is that the team actually knows Illumos/Solaris inside out. You can work wonders with a less than ideal system. That said, Illumos is of high quality in my opinion.

Always_Anon · on Oct 26, 2023

Seems risky considering how small of a developer pool actively works on illumos/Solaris. The code is most definitely well engineered and correct, but there are huge teams all around the world deploying on huge pools of Linux compute that have contributed back to Linux.

tinco · on Oct 27, 2023

They had a bug in the database they are using that was due to a Go system library not behaving correctly specifically on illumos. They've got enough engineering power to deal with such a thing but damn..

mbakke · on Oct 27, 2023

If everyone had this mindset we would be running our workloads on Microsoft Windows by now.

GNU/Linux was also "risky" at some point.

yencabulator · on Oct 27, 2023

Linux grew up in the bedrooms of teenagers. It was risky in the era of 486 and Pentiums. The environment and business criticality of a $1-2M rack-size computer is quite different.

mbakke · on Oct 27, 2023

I had similar thoughts about VMware (large installations) back in the day. Weird proprietary OS to run other operating systems? Yet they turned out fine.

This appears to be a much better system than VMware, is free as in software, and it builds upon a free software operating system with lineage that predates Linux.

I say this in the most critical way possible, as someone who has built multiple Linux-based "cloud systems", and as a GNU/Linux distribution developer: I love it!

kawhah · on Oct 27, 2023

It was totally a risky choice for companies in the 1990s and early 2000s to put all their web stuff onto Linux on commodity hardware instead of proprietary Unix or Windows servers. Many did it when their website being up was totally mission critical. Lots did it on huge server farms. It paid off very quickly but it's erasing history to suggest that it didn't require huge amounts of guts, savvy and agility to even attempt it.

pjmlp · on Oct 30, 2023

Indeed, for me GNU/Linux was always a cheap way to have UNIX at home, given that Windows NT POSIX support never was that great.

The first time I actually saw GNU/Linux powering something in production was in 2003, when I joined CERN and they were replacing their use of Solaris, and eventually alongside Fermilabs came up with Scientific Linux in 2004.

Later at Nokia, it took them until 2006 to consider Red-Hat Linux a serious alternative to their HP-UX infrastructure.

usefulcat · on Oct 27, 2023

Completely tangential, but this reminds me of an interview I had for my first job out of college in 1995. I mentioned to the interviewer that I had some Linux experience. "Ah, Linux" he said. "A cool little toy that's gonna take over the world".

In hindsight of course it was remarkably prescient. This from a guy at a company that was built entirely around SGI at the time.

ddalex · on Oct 31, 2023

This is a skewed view - the critical piece that made Linux "enterprise-ish" was the memory management system that was contributed by IBM, part of the SCO lawsuit

sanderjd · on Oct 27, 2023

I have no clue what OS runs my VMs on EC2.

cashsterling · on Oct 26, 2023

Back in the day... Sun Micro was a GOAT and pushed the envelope on Unix computing 20-30 years ago. Solaris was stable and high performing.

I don't run on-prem clusters or clouds but know a couple people who do and, at large enough scale, it is a constant "fuck-shit-stack on top of itself" (to quote Reggie Watts). There is almost always something wrong and some people upset about it.

The promise of a fully integrated system (compute HW, network HW, all firmware/drivers written by experts using Rust wherever possible) that pays attention to optimizing all your OpEx metrics is a big deal.

It may take Oxide a couple more years to really break into the market in a big way, but if they can stick it out, they will do very well.

icedchai · on Oct 26, 2023

I used to love Sun and Solaris. Then the dot-com bubble burst, and Linux ate its lunch. I haven't seen a new Solaris system deployed in over 20 years.

steveklabnik · on Oct 26, 2023

Just to be clear, Illumos (it hasn't been Solaris in a very long time) is an implementation detail. It's not customer facing.

burnte · on Oct 26, 2023

> Just to be clear, Illumos (it hasn't been Solaris in a very long time) is an implementation detail. It's not customer facing.

Solaris is still Solaris, as of the latest release last month. OpenSolaris hasn't been OpenSolaris in a while and is Illumos, yes.

steveklabnik · on Oct 26, 2023

Yes, thanks. I didn't even realize my comment could be read that way, but I was speaking of Illumos only, Solaris is still Solaris :)

yencabulator · on Oct 26, 2023

It'll become customer facing the moment something doesn't work right.

ahl · on Oct 26, 2023

It won't. In the same way that AWS customers aren't debugging hypervisor, or Dell customers aren't debugging the BIOS, or Samsung SSD customers aren't debugging the firmware. Products choose where to draw the line between customer-serviceable parts and those that require a support call. In this case, expect Oxide to fix it when something doesn't work right.

jjav · on Oct 26, 2023

When Apple supports OSX for consumers, they don't exactly surface the fact that there's BSD semi-hidden in there somewhere.

That's because they own the whole stack, from CPU to GUI and support it as a unit. That's the benefit of having a product where a single owner builds and supports it as a whole.

My impression of Oxide is that that's the level of single source of truth they are bringing to enterprise in-house cloud. So, I strongly doubt the innards would ever become customer-facing (unless the customer specifically wants that, being open source after all).

yencabulator · on Oct 26, 2023

Apple is a horrible example, with Apple when you have a problem, you often end up with an unfixable issue that Apple won't even acknowledge. You definitely don't want to taint Oxide's reputation with that association.

As for why I think Helios will become customer facing: Oxide is a small startup. They have limited resources. Their computers expensive enough to be very much business critical. You'll get some support by Oxide logging in remotely to customer systems and digging around, but pretty soon the customer will want to do that themselves to monitor/troubleshoot the problems as they happen.

Imagine you're observing a recurring but rare I/O slowdown that seems to trigger under some certain conditions, and tell me a competent sysadmin wouldn't want to log in on all the related boxes (client Helios, >=3 server Helioses for the block store) and look at the logs & stats.

snuxoll · on Oct 27, 2023

> As for why I think Helios will become customer facing: Oxide is a small startup. They have limited resources.

Have you looked at the pedigree of many of the people behind the project? I don't say this because "these guys smart", but because these guys bent over backwards for their customers when they were Sun engineers. Bryan didn't write dtrace for nothing.

> Imagine you're observing a recurring but rare I/O slowdown that seems to trigger under some certain conditions, and tell me a competent sysadmin wouldn't want to log in on all the related boxes (client Helios, >=3 server Helioses for the block store) and look at the logs & stats.

I think you're simultaneously over-estimating and under-estimating the people who will deploy this. There's a lot of companies who would want a "cloud in a box" that would happily plug hardware in and submit a support ticket if they ever find an issue, because their system engineers either don't have the time, desire, or competence (unfortunately common) to do anything more. The ones who are happy to start debugging stuff on their own would have absolutely wonderful tooling at their fingertips (dtrace) and wouldn't have any issue figuring out how to adapt to something other than Linux (hell, I've been running TrueNAS for the better part of a decade and being on a *BSD has never bothered me).

jjav · on Oct 26, 2023

> Apple is a horrible example,

Apple is a great example of the benefits of an integrated system where the hardware and software are designed together. There are tons of benefits to that.

What makes Apple evil (IMO, many people disagree) is how everything is secret and proprietary and welded shut. But that doesn't take away from the benefits of an integrated hardware/software ecosystem.

Oxide is open source so it doesn't suffer from the evil aspect but benefits from the goodness of engineered integration. Or so I hope.

freedomben · on Oct 27, 2023

In practice I don't think it's as good as in theory. I had Apple Macbook Pro with Apple Monitor, and 50% of the time when unplugging the monitor the laptop screen would stay off. Plugging back in to the monitor wouldn't work at that point so all I could do was hold the power button to force it off and reboot. That's with Apple controlling the entire stack - software, hardware, etc.

I think the real benefit is being able to move/deprecate/expand at will. For example, want an app that would require special hardware? You can just add it. Want to drop support for old drivers? Just stop selling them and then drop (deprecate) the software support in the next release.

I fully agree about the evilness, and it baffles me how few people do!

nijave · on Oct 27, 2023

Android is potentially a better example. Compare Android to trying to get Linux working on <some random laptop>. You might get lucky and it works out of the box or you might find yourself in a 15 page "how to fix <finger print reader, ambient light sensor, etc>" wiki where you end up compiling a bunch of stuff with random patches.

Afaik Android phones tend to have a lot more hardware than your average laptop, too (cell modem, gps, multiple cameras, gyro, accelerometer, light sensors, finger print readers)

pjmlp · on Oct 27, 2023

Apple is the survivor of 16 bit home micros integration, PC clones only happened as IBM failed to prevent Compaq's reverse engineering to take over their creation, they even tried to get hold of it afterwards via PS/2 and MCA.

As we see nowadays on tablets and laptops, most OEMs are quite keen in returning back to those days, as otherwise there is hardly any money left on PC components.

Always_Anon · on Oct 26, 2023

Exactly right, Apple is actually a poor example. Watch enough Louis Rossmann and you'll grasp just how bad some of their shit can be.

Fnoord · on Oct 26, 2023

Funny how you mentioning BSD got me to thinking of Sony Playstation and Nintendo Switch. Which are proprietary and not user serviceable. A Steam Deck, Fairphone, or Framework laptop is each less proprietary and more FOSS stack, and user serviceable. Which a user may or may not want to do themselves; at the very least they can pay someone and have them manage it.

Also, Apple is just the one who survived. Previously I'd have thought of SGI, DEC, Sun, HP, IBM, Dell some of whom survived some not.

Those three consumer products I mentioned each provide a platform for a user and business space to floroush and thrive. I expect a company doing something similar for cloud computing to want the same. But it will require some magick: momentum, money, trust. That kind of stuff, and loads of it. (With some big names behind it and a lot of FOSS they got me excited, but I don't matter.)

pjmlp · on Oct 27, 2023

Contrary to urban myths, Nintendo Switch OS is a microkernel OS, not something based on BSD.

fomine3 · on Oct 30, 2023

IBM mainframe, they survive in specific category

throw0101c · on Oct 26, 2023

> When Apple supports OSX for consumers, they don't exactly surface the fact that there's BSD semi-hidden in there somewhere.

Or Linux running underneath all the Java-y Android stuff.

nosequel · on Oct 26, 2023

If you have a bug in how a lambda function is run on AWS, do you find yourself looking for the bug in firecracker? It is open source, so you technically could, but I just don't see many customers doing that. Same can be said about KNative on GCP.

Their choice in foundation OS (for lack of a better term) really should not matter to any customer.

yencabulator · on Oct 26, 2023

I am unable to do so.

Now imagine a multi-million dollar mission critical pile of computers running on premises, and your sysadmin being able to do so.

Oxide is closer to a rack of Supermicros than AWS.

sanderjd · on Oct 27, 2023

Ok but then that is purely additive then, right? Like, "have to find someone with Illumos expertise to fix something that was never intended to be customer-facing" may not be easy, but is still easier than the impossibility of doing the same thing on AWS / Azure / Google Cloud.

Voultapher · on Oct 26, 2023

Right, who wants or benefits from open source firmware anyway.

Also there are many situations where renting, for example a flat makes a lot of sense. And there are many situations where the financials and or enabled options of owning something make a lot of sense. Right now, the kind of experience you get with AWS and co. can only be rented, not bought. Some people want to buy houses instead of renting them.

necovek · on Oct 26, 2023

Well, you can buy your own hardware and set it up with OpenStack and use it as a private cloud. Companies like Canonical or Redhat make a lot of money by providing software (mostly open source) to support exactly that use case.

And Canonical played with a cluster-in-a-box all the way back in 2013-2014: https://www.zdnet.com/article/canonicals-cloud-in-a-box-unde...

You could turn it into an OpenStack cloud in ~20 mins with an automated Juju OpenStack install.

jjav · on Oct 26, 2023

> Well, you can buy your own hardware and set it up with OpenStack and use it as a private cloud. Companies like Canonical or Redhat make a lot of money by providing software (mostly open source) to support exactly that use case.

Sure you can, but then who will diagnose and fix your hardware/OS interaction problems when you have parts from five vendors in the mix?

If you haven't lived through this, the answer is: nobody. Everyone points fingers at the other 4 and ignore your calls.

Back in the day you could buy a fully integrated system (from CPU to hardware to OS) from Sun or SGI or HP and you had a single company to answer all the calls, so it was much better. Today you can't really get this level of integration and support anymore.

(Actually, you probably can from IBM, which is why they're still around. But I have no experience in the IBM universe.)

This is why Oxide is so exciting to me. I hope I can be in a company that becomes a customer at some point.

Always_Anon · on Oct 26, 2023

>Sure you can, but then who will diagnose and fix your hardware/OS interaction problems when you have parts from five vendors in the mix?

Dell is a single vendor that will diagnose and fix all of your hardware issues.

With Oxide you're locked into what looks like a Solaris derivative OS running on the metal and you're only allowed to provision VMs which is a huge disadvantage.

I run a fleet of over 30,000 nodes in three continents and the majority is Flatcar Linux running on bare metal. Also have a decent amount of RHEL running for specific apps. We can pick and choose our bare metal OS which is something you cannot do with Oxide. That's a tough pill to swallow.

jjav · on Oct 26, 2023

> Dell is a single vendor that will diagnose and fix all of your hardware issues.

I've been a Dell customer at a previous company. I know for a fact that's not true.

I had a support ticket for a weird firmware bug open for two years, they could never figure it out. I left that job but for all I know the case is still open many years later.

Dell doesn't know how to fix things like that because they don't design and engineer the systems they sell. Dell is a reseller who puts components together from a bunch of vendors and it mostly works but when it doesn't, there's nobody on staff who can fix it.

Always_Anon · on Oct 26, 2023

I've been a Dell customer for decades at this rate and I know for a fact it's true.

I've had support tickets open for all kinda of weird firmware, hardware, etc. bugs and they've been well resolved, even if it meant Dell just replaced the part with something comparable (NIC swap).

>Dell doesn't know how to fix things like that because they don't design and engineer the systems they sell.

Of course they do. That's like saying Oxide doesn't know how to fix stuff because they don't design the CPU, NVMe, DIMMs, etc. Oxide is still going to vendors for these things.

bcantrill · on Oct 27, 2023

Ironically, it was Dell's total inability to resolve a pathological rash of uncorrectable memory errors very much is part of the origin story of Oxide: this issue was very important to my employer (who was a galactic Dell customer) and as the issue endured and Dell escalated internally, it became increasingly clear that there was in fact no one at Dell who could help us -- Dell did not understand how their own systems work.

At Oxide, we have been deliberate at every step, designing from first principles whenever possible. (We -- unlike essentially everyone else -- did not simply iterate from a reference design.)

To make this concrete with respect to the CPU in particular, we have done our own lowest-level platform enablement software[0] -- we have no BIOS. No one -- not the hyperscalers, not the ODMs and certainly not Dell -- has done this, and even AMD didn't think we could pull it off. Why did we do it this way? Because all along our lodestar was that problem that Dell was useless to us on -- that we wanted to understand these systems from first principles, because we have felt that that is essential to deliver the product that we ourselves wanted to by.

There are plenty of valid criticisms of Oxide -- but that we don't understand our system simply isn't one of them.

[0] https://www.osfc.io/2022/talks/i-have-come-to-bury-the-bios-...

mlindner · on Oct 27, 2023

As a side question, what's the name of your custom firmware that is the replacement of the AGESA bootloader? I tried searching on the oxide github page but couldn't find anything that seemed to fit that description.

bcantrill · on Oct 27, 2023

(The AGESA bootloader -- or ABL -- is in the AMD PSP.) In terms of our replacement for AGESA: the PSP boots to our first instruction, which is the pico host bootloader, phbl[0]. phbl then loads the actual operating system[1], which performs platform enablement as part of booting. (This is pretty involved, but to give you a flavor, see, e.g. initialization of the DXIO engine.[2])

[0] https://github.com/oxidecomputer/phbl

[1] https://github.com/oxidecomputer/illumos-gate/tree/stlouis

[2] https://github.com/oxidecomputer/illumos-gate/blob/stlouis/u...

mlindner · on Oct 27, 2023

Thanks, are the important oxide branches of illumos-gate repo (and any other cloned repos) defined anywhere? I definitely wouldn't have found that branch without you mentioning it here.

Always_Anon · on Oct 27, 2023

Interesting enough I also ran into something somewhat related with Dell that they were not able to resolve so they ended up working in a replacement from another vendor.

Nonetheless, it is quite interesting what you've built, but as the end user I'm not quote convinced that it matters. Sure you can claim it reduces attack vectors and such but we'll still see Dells and IBMs in the most restricted and highest security postured sites in the world. Think DoD and such. Core/libreboot with RoT will get me through compliance the same.

The software management plane y'all built is the headlining feature IMHO, not so much what happens behind the scenes that the vast majority of the time will not have a fatal catastrophic upstream effect.

>There are plenty of valid criticisms of Oxide -- but that we don't understand our system simply isn't one of them.

That's not what I said. There's a line in the sand that you must cross when it comes to understanding the true nature of the componentry that you're using. At the end of the day, your AMD CPUs may be lying to you, to all of us, but we just don't know it yet.

wmf · on Oct 26, 2023

Dell is a single vendor that will diagnose and fix all of your hardware issues.

And you'll be down for weeks or months while they do it.

Always_Anon · on Oct 26, 2023

>And you'll be down for weeks or months while they do it.

Off by a few orders of magnitude. Dell on-site SLA with pre-purchased spares was about 6 hours.

With Oxide, you'd be lucky to get same day service.

jjav · on Oct 26, 2023

> Off by a few orders of magnitude. Dell on-site SLA with pre-purchased spares was about 6 hours.

You're talking about replacement parts. Yes Dell is good about that.

The discussion above is asking them to diagnose and fix a problem with the interaction of various hardware components (all of which come from third parties).

Always_Anon · on Oct 26, 2023

Oxide also has various hardware components from AMD, Intel, Samsung, etc. They are not manufacturing every component.

mlindner · on Oct 27, 2023

But they _are_ writing the firmware that runs most of them and need to understand those devices at a deep level in order to do that, unlike Dell. Dell slaps together hardware and firmware from other vendors with some high level software of their own on top. They don't do the low level firmware and thus don't understand the low level intricacies of their own systems.

Always_Anon · on Oct 27, 2023

No they're not unless I'm mistaken. They're not writing the firmware that runs on the NVMe drives, nor the NICs (they're not even writing the drivers for some of the NICs), etc.

There's a line in the sand that you must cross when it comes to understanding the true nature of the componentry that you're using. At the end of the day, your AMD CPUs may be lying to you, to all of us, but we just don't know it yet.

wmf · on Oct 26, 2023

I'm not speaking hypothetically. If you hit a "zero-day" bug that Dell has never seen it's going to take time. And somehow every large customer finds bugs that Dell certification didn't.

pests · on Oct 26, 2023

> And somehow every large customer finds bugs that Dell certification didn't.

It's a law of computer engineering.

In the Apollo 11 decent sequence the Rendezvous Radar experienced a hardware bug[0] not uncovered during simulation. They found it later, but until then, the solution was adding a "turn off Rendezvous Radar" checklist item.

[0] The Rendezvous Radar would stop the CPU, shuttle some data into areas it could be read, and woke the CPU back up to process it. The bug caused it to supuriously do this dance just to tell it "no new data", which then caused other systems to overload.

Always_Anon · on Oct 26, 2023

>I'm not speaking hypothetically.

Neither am I.

>If you hit a "zero-day" bug that Dell has never seen it's going to take time.

If you hit a "zero-day" bug that Oxide has never seen it's going to take time.

>And somehow every large customer finds bugs that Dell certification didn't.

Yes, happens. And I'm sure the exact same will happen with Oxide, so it's not a differentiator.

samcat116 · on Oct 26, 2023

The vast majority of people only need to deploy VMs.

Always_Anon · on Oct 26, 2023

It's ironic coming from a company who's CTO has harped about containers on bare metal for years. Maybe a large swath only need to deploy VMs, but the future will most definitely involve bare metal for many use cases, and oddly Oxide doesn't support that currently.

pseg134 · on Oct 26, 2023

I run a battalion of 78,000 nodes and I disagree with you.

Always_Anon · on Oct 26, 2023

I used to run over 150,000 nodes and I agree with me.

jryle70 · on Oct 27, 2023

See the pattern? Dell only care about the big guys.

Set aside the childish tone ...

> Dell is a single vendor that will diagnose and fix all of your hardware issues.

There are two anecdotes here disagreeing with you, and frankly that's enough to say what you said above isn't true, not universally so. I doubt Odixe is targeting big deployment like yours, but more like theirs. Whether they will succeed is another matter, but they do have a valid sales pitch and the expertise to pull it off.

jeffrallen · on Oct 27, 2023

> you had a single company to answer all the calls, so it was much better

Huh, then why did Sun, Oracle, and Veritas have to set up a shared tech support center in San Jose?

"Accelerated finger pointing", said a friend who had to do business with them.

Always_Anon · on Oct 26, 2023

>Right, who wants or benefits from open source firmware anyway.

Their competition has open source firmware as well:

https://www.dell.com/en-us/blog/enabling-open-embedded-syste...

bcantrill · on Oct 26, 2023

So OpenBMC is fine (happy for them!), but having open firmware is much deeper and broader than that: yes, it's the service processor (in contrast to the BMC which is a closed part on Dell machines) -- but it's also the root-of-trust and (especially) the host CPU itself. We at Oxide have open source software from first instruction out of the AMD PSP; I elaborated more on our approach in my OSFC 2022 talk.[0]

[0] https://www.osfc.io/2022/talks/i-have-come-to-bury-the-bios-...

Always_Anon · on Oct 26, 2023

Dell now ships with OpenBMC iDRACs and such. How does what you mention differ from the RoT in Dells?

https://www.dell.com/en-us/blog/hardware-root-trust/

c_o_n_v_e_x · on Oct 27, 2023

Dell uses trusted platform modules (TPM). It's a separate chipset than the BMC chipset.

For a mostly open source solution, not only would you need open source BMC firmware, you must have an open source UEFI/BIOS/boot firmware like CoreBoot, LinuxBoot, Oreboot, Uboot, etc.

NexRebular · on Oct 26, 2023

The fact that it's not on linux is one of the great things about it. There is too much linux on critical infrastructure already and the monoculture just keeps on growing.

At least with Oxide there is a glimmer of hope for a better future in this regard.

sanderjd · on Oct 27, 2023

> something no one really wants, or, at best, is niche

Could be! Seems too early to tell though, and remains to be seen whether it pencils. Which is the whole idea of starting a new venture, no?