I don't think it is surprise that Linux is about adding more features in an orga...

maltalex · on Jan 7, 2019

> Linux is basically the kind of project that is big enough and impactful enough that naturally gets testing for free from the community.

That makes intuitive sense, but is it really true? Is Linux being throughly tested by the community?

notacoward · on Jan 7, 2019

This has always been my problem with the "many eyes make all bugs shallow" theory. It's a random walk. The most common features and configurations will get tested over and over and over well beyond the point of usefulness. Less common features might not get tested at all. In fact it's worse than a random walk, because usage is even more concentrated in a few areas. Yes, the fuzzers etc. do find some bugs, but relative to the size and importance of the project itself they're smaller than they would be on most other kinds of projects.

There's just no substitute for rigorous unit and/or functional tests, constructed with an eye toward what can happen instead of just what commonly does happen in the most vanilla usage. Unfortunately, UNIX kernel development - not just Linux - has never been strong on that kind of testing. Part of the reason is the inherent difficulty of doing those sorts of tests on such a foundational component. Part of it is ignorance, which I mean as simply "not knowing" and not as an insult. Part of it is macho kernel-hacker culture. Whatever the reasons, it needs to change.

nraynaud · on Jan 7, 2019

I have been in a few situations where I could have found bugs, because I was exploring deep behavior, but the lack of legible documentation on the expected behavior made me back off. I have no clue on what’s expected or not, the man pages are just a joke and lkml is mostly unreadable.

kokada · on Jan 7, 2019

> The most common features and configurations will get tested over and over and over well beyond the point of usefulness. Less common features might not get tested at all.

Not that I wouldn't want a featured Unix-like kernel with a comprehensive[1] test suite (and I don't know any major OS that have this, not Windows, macOS or Linux), however I think this is fine.

Common setups works fine, uncommon configurations may have some problems, with possible workarounds. Most people will not run mainline kernels anyway, so a workaround is acceptable.

[1]: What I mean by comprehensive is basically having tests for every possible configuration, that is basically impossible anyway. Probably the closest thing we can get is formally proven OS, however I don't think we will even have a general purpose OS formally proven.

notacoward · on Jan 7, 2019

> tests for every possible configuration, that is basically impossible anyway

Agreed. Expecting 100% across the entire kernel would be totally unreasonable. OTOH, coverage could be better on a lot of components considered individually.

Any component as complex as XFS is going to have tons of bugs. I don't mean that as an insult. I was a filesystem developer myself for many years, until quite recently. It's the nature of the beast. The problem is how many of those bugs remain latent. All the common-case bugs are likely to be found and fixed pretty quickly, but that only provides a false sense of security. Without good test coverage, those latent less-common-case bugs start to reappear every time anything changes - as seems to have been the case here. That actually slows the development of new features, so even the MOAR FEECHURS crowd end up getting burned. Good testing is worth it, and users alone don't do good testing.

kokada · on Jan 7, 2019

I think filesystems in kernel have automated tests, I know xfstests [1] exist, at least. And they exist exactly because filesystems bugs generally are critical: a filesystems bug generally means that someone will lose data.

[1]: different from what the name may suggest, xfstests is run in other filesystems too. Here is an example of xfstests ported to ZoL (ZFS on Linux): https://github.com/zfsonlinux/xfstests

notacoward · on Jan 7, 2019

Yes, xfstests exists. I've used it myself and it's actually pretty good for a suite that doesn't include error injection. But part of today's story is that xfstests wasn't being updated to cover the last several features that were added to XFS. The result is exactly the kind of brittleness that is characteristic of poorly tested software. Something else changed, and previously latent bugs started popping out of the rotten woodwork.

snazz · on Jan 7, 2019

What’s crazy for me, after reading all this, is how wonderfully stable my Linux (kernel-level)[0] experience has been. I’ve never used any non-ext2/3/4 filesystems, granted, so I haven’t used this code, but I find it hard to believe that these findings are indicative of the code I have used on a relatively run-of-the-mill amd64 machine. So maybe if you’re like me, using a fairly standard distro with the official kernel on somewhat normal hardware, you would have the benefit of millions others testing the same code.

[0]: I have had more than my fair share of user land problems, but I have come to expect that on any platform.

chubot · on Jan 7, 2019

Yeah I've also had good experiences with Linux reliability.

But that's because I intentionally stay on the "happy path" that's been tested by millions of others. I avoid changing any kernel settings and purposely choose bog-standard hardware (Dell).

When you're on the other side, you're not just maintaining the happy path. You're maintaining every path! And I'm sure it is unbelievably complex and frustrating to work with.

-----

Personally I would like software to move beyond "the happy path works" but that seems beyond the state of the art.

I also think there is a big component of this:

Operant Conditioning by Software Bugs

https://blog.regehr.org/archives/861

Over time you get trained not to do anything "weird" on your computer, because you know that say opening too many programs at once can cause a lockup. Or you don't want to aggressively move your mouse too much when doing other expensive operations. (This may be in user space or the kernel, but either way you're trained not to do it.)

There is another post that I can't find that is about "changing defaults". I used to be one of those people who tried to configure my system, but I've given up on that. The minute you have a custom configuration, you run into bugs, with both open source and commercial software.

The kernel has thousands of runtime and compile-time options, so I have no doubt that there are thousands upon thousands of bugs available for you to experience if you change them in a way that nobody else does. :)

TeMPOraL · on Jan 7, 2019

> Or you don't want to aggressively move your mouse too much when doing other expensive operations. (This may be in user space or the kernel, but either way you're trained not to do it.)

Operant conditioning by software bugs is totally a thing, but for this particular example I was trained into exactly opposite behaviour. I do move my mouse a lot during very resource-intensive computations, because that lets me gauge the load on my system (is there UI animation lag? is there cursor movement lag?), and in extreme cases, it can tell me when there's time to do a hard reboot. I've also learned through experience that screensavers, auto-locking, and even auto-poweroff of the screen can all turn what was a long computation into forced reboot, so avoiding long inactivity periods is important.

This conditioning comes from me growing up with Windows, but I hear people brought up on Linux have their own reason - apparently it used to be the case (maybe it still is?) that some computations relying on PRNG would constantly deplete OS's entropy pool, and so just moving your mouse around would make those computations go faster.

kev009 · on Jan 7, 2019

http://www2.rdrop.com/~paulmck/RCU/Validation.2018.03.15a.DA...

Your lens is probably too small, Paul does a great job quantifying bug probability to more human terms in the above presentation.

I ran all OS dev and support in an environment with 10000 FreeBSD systems and 3000 Linux systems in high scale production. The ratio of kernel panics was similar (although Linux tended to exhibit additional less rigorous failure modes due to more usage of work queues and other reasons). You could expect at least a couple faults per day depending on the quality of hardware and how far off the beaten path your system usage is at this scale. The big benefit I found with FreeBSD is that I could understand and fix the bugs, and I was generally intimidated by doing so on Linux.

barrkel · on Jan 7, 2019

I have never used a less reliable operating system than Linux in the presence of a changing count of monitors.

You might classify that as userland, but it manifests as hard lockups on suspend, resume, monitor disconnect or reconnect.

qlk1123 · on Jan 7, 2019

How do you define "throughly" anyway? Just like a comment in the original post,

>>But anything less than 100% coverage guarantees that some part of the code is not tested...

> And anything less than 100% race coverage similarly guarantees a hole in your testing. As does anything less than 100% configuration-combination coverage. As does anything less than 100% input coverage. As does anything less than 100% hardware-configuration testing. As does ...

m0zg · on Jan 7, 2019

It is combinatorially impossible to "thoroughly" test something as large and complex as Linux kernel. Other than e.g. sqlite I struggle to think of a single substantial piece of software that is truly thoroughly tested.

brohee · on Jan 7, 2019

I qualify Oracle JDK as pretty thoroughly tested, Sun/Oracle investment in test has been enormous. How do you define thoroughly?

Shorel · on Jan 7, 2019

Ok, I will check these links to try to generate log files.

Because the Linux kernel for sure is going crazy in my new laptop. The lack of AMD support is mindblowing, to say the least.

kev009 · on Jan 7, 2019

AMD laid off a lot of their Linux devs in 2012, and it still hasn't recovered [1]

[1] https://linux.slashdot.org/story/12/11/07/1634237/amd-closes...

int_19h · on Jan 7, 2019

What do you mean by "lack of AMD support"? CPU? GPU?

Shorel · on Jan 7, 2019

I wish I knew.

Using a Dell Inspiron with AMD Ryzen CPU and AMD Vega 8 GPU.

Sometimes it just does not boot.

Sometimes it works fine for some hours, and totally locks up when I click a button in Firefox.