> There is no way to “fix” the problem and resume the task. This was a conscious...

bcantrill · 2024-04-27T20:38:12

Hubris is not an academic exercise: it runs at the heart of every element of the Oxide rack (compute sled, switch, power shelf controller) -- and its design is informed by delivered utility above all else. Indeed -- and as Cliff elaborated in the blog -- REPLY_FAULT was something that he thought initially perhaps too aggressive, but it was our own experience in building, deploying, and (it must be said!) debugging the system that gave him the confidence that it would make our systems more robust, not capriciously faulty.

For more details on the thinking here and what it looks like in practice, see (e.g.) [0] and [1].

[0] https://www.mattkeeter.com/blog/2024-03-25-packing/

[1] https://cliffle.com/blog/who-killed-the-network-switch/

vvanders · 2024-04-27T21:57:12

> that can tolerate no real-world chaos, and I'm not aware of any commercially viable realms which would either.

Watchdog timers will happily kill/restart your processes that don't poke them often enough. Even in my hobby exercises I've seen I2C busses hang up often enough(and bring the whole system down!) when some protocol bit goes wrong that I think the design is actually quite inspired. As I understand it this isn't talking about known error cases(that are handled) but protocol mismatches and other things that shouldn't ever happen.

Many other comments touched on it but it's a purpose built OS, much in the same way I'm not going to build a UI in Erlang, Hubris seems well positioned for the space that it occupies.

crote · 2024-04-27T19:25:58

> But by what mechanism would that strategy be able to understand the fault that occurred, in order to try again better?

I think the general idea is to apply this to problems which are clearly the result of an invalid program state, and therefore not reasonably recoverable. They are either caused by bugs, an attack, or corrupted hardware. In all cases you shouldn't continue, because there's something seriously wrong with the caller. If the caller continues, it could only cause more damage.

It sounds a bit like Erlang/OTP's "let it crash" philosophy. Erlang is used in quite a bunch of mission-critical hardware and is famous for its reliability, so it might not be such a huge dealbreaker in practice.

sillywalk · 2024-04-27T21:56:30

> It sounds a bit like Erlang/OTP's "let it crash" philosophy.

Which was based partly on ideas from Tandem Computers' NonStop / Guardian. Hardware and software were fail-fast i.e. they would work correctly or stop, so they couldn't corrupt data. If there was a problem, the whole processor / process would be stopped, and a backup took over, which seems somewhat similar to the "supervisor" tasks in hubris.

Quite a bit of a different use cases - an embedded os for microcontrollers vs large OLTP applications. They both could be considered "mission critical", at least for the people who own/make money with them.

vvanders · 2024-04-27T22:05:28

From a "system engineering"(not to be confused with software engineering) perspective they seem quite similar, in my view even something like a watchdog timer(which just about every CPU/core has these days) is just a hardware version of similar philosophies. This[1] is one of my favorite overviews on Erlang and what drives some of those design decisions. You can absolutely apply the same systematic thinking to other domains/places without having to bring OTP or even Erlang into the conversation.

[1] https://ferd.ca/the-zen-of-erlang.html

cryptoxchange · 2024-04-27T19:27:49

It’s a 2000 line rust embedded systems kernel that doesn’t support adding new tasks at runtime. It is written to go deep in the guts of the 0xide server racks.