"This fits in with the kernel's policy of attempting to keep the system running ...

palotasb · on Sept 30, 2018

IIUC the heuristic here is how much of the system to take down (i.e., back out of fast path [this case] < kill thread < kill process < kernel panic) and this seems correct (but could still be a symptom of more corruption).

Anyway, there has been discussion about this issue in general previously on the Linux kernel mailing list, and Linus said [1] that the correct procedure is to first introduce by-default reporting and an opt-in kill switch, then make the kill switch the default, then remove the non-kill option. This is supposed to weed out bugs eventually without disrupting users too much, but I can imagine that it enables some expolits. There was an HN discussion about it too. [2]

[1] https://lkml.org/lkml/2017/11/21/356 [2] https://news.ycombinator.com/item?id=15754988

tonysdg · on Sept 30, 2018

It's terrible from a correctness perspective, sure. But from a business perspective, that could mean $50 million of revenue not-lost to a flaw you (as an owner/operator) can't do much about. Sure, it could lead to an exploit like this, but I'd wager a cost-benefit analysis for almost any organization (except maybe CIA/NSA types) would support this design choice.

That's not to say it's a good design choice, but it's certainly a defensible one IMO. You can have the most secure OS in the world, but if no one wants to use it, all you've got is a very secure waste of hard drive space.

jacquesm · on Sept 30, 2018

If a kernel panic can cost you $50 million you have other problems. Really, in an organization where downtime due to a server rebooting would be that expensive you'd hope they would be able to deal with that gracefully and that they would be deploying the rough equivalent of chaosmonkey to ensure that their stuff is protected against such errors.

After all, a harddrive or a CPU could die just the same.

ploxiln · on Oct 1, 2018

This is a principle borne of experience. There have been a number of cases where a new fatal-error check killed a bunch of systems that apparently were working for a long time before. Usually it is preferred to WARN_ONCE() but not BUG_ON(). Otherwise, big corporate users, or a wide random assortment of desktop users, end up going back to an older kernel to make their systems work. The linux kernel tries above all to never have regressions, even if these regressions are just catching probably very bad bugs that were just not caught before. This happens, often enough!

Think of this from the lense of natural selection. There are many many subtle tricky low-level bugs that can result in memory corruption - many drivers, many features, many optimizations, many "cleanups", a surprisingly high rate of code changes. Many of them are caught when they cause problematic corruption. Others, just due to luck, very rarely cause any visible problem, - for months or years. We do want to know about them. We do not want to suicide the whole system, it may be one of those rare bugs that made it this far because it's usually (surprisingly) not fatal on its own.

sangnoir · on Sept 30, 2018

Your implicit assumption is that the server will only reboot once. I think if the Linux kernel is upfront about its security posture and your threat-model is incompatible, it is OK to go with OpenBSD.

jacquesm · on Sept 30, 2018

The more frequent the reboots the quicker the problem will get diagnosed!

Especially if the system is still stable enough to write a log entry to disk.

cryptonector · on Oct 1, 2018

The problem is that downtime sucks, but if you don't take the time to make the system more correct then these buglets pile up and then you have a huge technical debt problem. It's all about economics, really, and if you have the resources early on in a project, then aiming for correctness is better than sweeping things under the rug.

pcwalton · on Sept 30, 2018

It's extremely useful for developing drivers. I've developed drivers on kernels that are hardwired to panic at the slightest problem and it's a real nuisance.

Whether you should leave it on in production is debatable, but I like at least having the option of making exceptions raised in drivers nonfatal.

jacquesm · on Sept 30, 2018

And the reason for that is that these drivers are linked into the monolith. If they were standalone processes you'd love the ability to do post-mortem debugging on them, or maybe even the luxury of being able to run them directly under the debugger.

Whenever I had to write device drivers I wrote them under QnX first and only then ported them to other OS's, that saved so much time, at least I knew the hardware interface would be up and running, and the data structures would all work as intended. After that all I then had to do was glue it to whatever calling convention the various *NIX flavors had.

That trick served me well, for motion capture devices as well as for high speed serial cards and some more exotic devices.

im_down_w_otp · on Sept 30, 2018

This is fairly analogous to our approach too, just substituting QNX w/ seL4.

NightMKoder · on Sept 30, 2018

IMO - it's actually a really hard call. Error isolation is hard but we all want it. One of the somewhat less controversial versions is failing a single user request (in node or your web framework) with a 5xx on an exception. You're essentially betting that the exception is a logic bug and not somewhere that will affect other users. That's probably true, unless, you know, your database driver is responsible. We bet that's usually not the case - and we're mostly right.

Honestly I'm ok with trying to keep the system limping with a huge but - you must, at the point you first detect an error, dump...everything you know. You can try to limp along because you're trying to be a good host, but debugging after that state is, as you said, not trivial.

If you separate the two concerns (post-mortem debugging & uptime) there's a nice medium to be found. Ideally kernel panics aren't the only source of observability. You can have a daemon running that files a bug report to your favorite error tracker (sentry, etc) and (attempts to) gracefully reboot the system. That would be pretty sweet.

jacquesm · on Oct 1, 2018

Erlang is the only environment that I know about that gets this right.

amaccuish · on Sept 30, 2018

> I'd much rather have a kernel panic than a kernel that continues to run with known bad datastructures

I mean, that is literally what an Oops is.

quotemstr · on Sept 30, 2018

I agree. Efforts to continue past unexpected states frequently just make the situation worse and end up causing more user pain than a deterministic restart. This kind of recovery also masks real bugs and leads to their persistence in the codebase. (Heuristic error recovery also makes debugging more difficult.) A fail-fast approach both preserves user invariants and provides an impetus to fix bugs fast.

xyproto · on Sept 30, 2018

To reboot when this happen could be a kernel configuration option. This way users could choose.