Nasty Lockup Issue Still Being Investigated for Linux 3.18

jacquesm · on Nov 30, 2014

The more mature a system is the harder the bugs will be. It's a simple bit math: easy bugs can be found by just reasoning about the code, harder bugs can be found with some persistence and sometimes a debugger. Those are all found relatively early on in the life of a codebase. But a really hard bug that shows up only under stressful situations on a system that has been running stable for days, weeks or even months will have a cycle time roughly equivalent to the number of incidences that you can generate on a single machine. Hard to reproduce -> hard to analyze -> therefore hard to fix.

Anything you can do to speed up the occurrence of such a bug should be reached for first because just speeding up the frequency with which the bug appears guarantees that you'll be able to fix it sooner, and to be more certain that you have fixed the bug once you think that you have identified the problem.

You can waste months on little nasty ones like this. I love it though, digging in on a bug and not letting go until the sucker is nailed for good. Especially hardware related bugs (interrupt handlers that are not 100% transparent for instance) and subtle timing bugs can be super hard to fix. But that feeling when you finally find it is absolutely priceless, it's like solving a very complex puzzle and finding the solution. I wished it did not take such a toll on my nights though :)

Someone · on Nov 30, 2014

"easy bugs can be found by just reasoning about the code"

I probably live in a dream world, but I would say software bugs of the 'causes deadlock in a OS that is used in millions of devices' type should (maybe even must) be avoided by reasoning about the code. If you find that you cannot reason about the whole system at the time, you either simplify it (python's GIL is a nice extreme example; you never read about race conditions inside Python's core) or you improve your logic.

In my world, you should only need the debugger for hardware related bugs, and then, it often is a combination of intuition and sheer perseverence that solves the issue. Especially if timing is involved, looking at the system can make the problem go away. But even there, having thought about the system will help a lot. For example, your reasoning might have said "between two X-es, we always get at least one Y, so…". You can turn that into a hypothesis that can be tested easily (in theory. Practice can be quite different)

On the other hand, if you must use modern PC hardware, you get a lot of unavoidable complexity (multiple interrupt levels, drivers that you cannot all check do compliance with your model of how drivers work, etc), so you may end up with the case where reasoning about the code isn't really possible.

nieve · on Dec 1, 2014

You're not just living in a dream world, you're using the phrase to excuse a snide strawman. The Linux kernel is 12-20 million lines of code depending on how you count it (and how far up the top end has gone recently). It is not possible for humans to know and comprehend the entire codebase, much less avoid all lockup bugs by reasoning about the code. In fact I'm pretty sure all extant methods of creating a correctness proof fail at that scale (even if you can ignore drivers). Much of the kernel requires specialized skills and experience the majority of kernel developers don't have.

Beyond that the kernel developers value both correctness and performance. Most of these bugs involve multiprocessing and a huge amount of effort has gone into getting rid of them global locks, but even they aren't easy to reason about at that scale - I think you're missing the common thread in "deadlock" and "Global Interpreter Lock". Pretending that a world could exist where we can simply reason about something like the Linux kernel and then using it to state that devs can simplify or improve the logic is taking a completely unsupported cheap shot and the disclaimer about the real world doesn't change that. I'm not even sure what your motivation for the equivalent of "if you pretend that friction & drag don't exist it's really easy to figure out aerodynamics, so plane designers need to do better" (hint, it's not) is, but it certainly isn't to inform, contribute, or facilitate the debate.

Someone · on Dec 1, 2014

Divide and conquer. For every modular part in the kernel, define what it is supposed to do: are calls reentrant? What combinations of calls are allowed? What interrupts can it generate? Etc.

If you don't make rules as to how the parts are to behave, how are writers of device drivers going to know how to make a working driver? Tweak it until it doesn't hang the system in a few days?

And how are you going to fix that deadlock you find during debugging? You may know (simple exampe) that thread T1 holds lock L1 while trying to get L2, while thread T2 holds L2 while trying to get L1, but without guidance on how things are supposed to work, how do you even know whether to fix T1 or T2?

IMO, the more complex the system, the greater the need for simpe rules that limit what different parts can do.

Chances are there will still be lockup bugs, but that is not a reason not to try and avoid them.

mrich · on Dec 1, 2014

Who says a kernel must have 15 million lines of code? I'm pretty sure there are kernels out there that have been verified correct. Of course the hardware could still be broken (can also be proven correct) or a cosmic ray could flip some bits...

lomnakkus · on Dec 1, 2014

L4 was verified (mostly, at least): http://ertos.nicta.com/research/l4.verified/approach.pml

encoderer · on Nov 30, 2014

Once a program overflows a single engineer's mental buffer, all bets are off. I think the size of that buffer is one thing that separates good from great in our line of work. We engineer systems to not rely on a single persons understanding, and it's an important part of what we do, but that doesn't mean we don't lose something when it gets beyond the grok of the humans building it.

agency · on Nov 30, 2014

"Relative to the complexity we can create, we are all statistically at the same point in understanding it." (paraphrasing Rich Hickey)

I don't think the size of the buffer is what distinguishes the good from great, because complexity is combinatorial in nature. It's the ability to mitigate complexity by designing systems composed of simple parts.

ams6110 · on Dec 1, 2014

These sorts of discussions always remind me of the essay that "nobody knows how to make a pencil"

http://www.econlib.org/library/Essays/rdPncl1.html

jrockway · on Dec 1, 2014

This is a nice philosophy, but with Linux's extensive suite of automated tests (zero), I'm more surprised that it ever works at all.

The other day I was "backporting" some input patches to an older kernel version. The patches all applied cleanly. The resulting source code compiled. The feature I wanted worked. But it still scares me too much to check it in.

gioele · on Dec 1, 2014

> Linux's extensive suite of automated tests (zero),

What about LTP [1] and Autotest [2]?

«Linux Test Project is a joint project started by SGI, OSDL and Bull developed and maintained by IBM, Cisco, Fujitsu, SUSE, Red Hat, Oracle and others. The project goal is to deliver tests to the open source community that validate the reliability, robustness, and stability of Linux.

The LTP testsuite contains a collection of tools for testing the Linux kernel and related features.»

[1] http://linux-test-project.github.io/

[2] http://autotest.github.io/

jrockway · on Dec 1, 2014

These look pretty nice. I wish these were included with the main source tree.

cesarb · on Dec 1, 2014

There are also a few tests within the main source tree.

Locking and RCU have built-in "torture" tests (CONFIG_LOCK_TORTURE_TEST and CONFIG_RCU_TORTURE_TEST). Take also a look at the tools/testing/ directory in the kernel source tree.

bjourne · on Nov 30, 2014

Hey, I've had that bug! The whole computer freezes for up to a minute and then you get a dmesg stack trace:

    Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3
    CPU: 3 PID: 17176 Comm: trinity-c95 Not tainted 3.17.0+ #87
    0000000000000000 00000000f3a61725 ffff880244606bf0 ffffffff9583e9fa
    ffffffff95c67918 ffff880244606c78 ffffffff9583bcc0 0000000000000010
    ffff880244606c88 ffff880244606c20 00000000f3a61725 0000000000000000
    Call Trace:
    <NMI> [<ffffffff9583e9fa>] dump_stack+0x4e/0x7a
    ...

It was easily reproducible by running two or more cores at 100% for a few minutes. I wanted to be a good citizen and report it, but I wasn't able to figure out where or how you're supposed to report kernel bugs. If it's the same bug as the Phoronix article mentions, then it's not a regression because I've had the problem in 3.13 too.

panzi · on Nov 30, 2014

Have you tried https://bugzilla.kernel.org/?

azernik · on Nov 30, 2014

Also, any distribution worth its salt will accept bugs for any package it ships and (especially for something as important as the kernel) report the bugs upstream. This may be a slow process, but it does mean that bugs introduced by distributor patches are more likely to be caught.

bjourne · on Nov 30, 2014

Afaict, that site only wants bugs for the mainline kernel not for kernels modded by Ubuntu.

cpach · on Nov 30, 2014

https://wiki.ubuntu.com/Kernel/Bugs

davidgerard · on Dec 1, 2014

There is a PPA for Linus's kernel, it's great for this stuff - makes upstreamable bug reports really easy. (Also for discovering "hey, my proprietary wifi now has a free driver!")

cbsmith · on Nov 30, 2014

If you had it with 3.13 too, it might not be the same bug. It could well be a bug in a driver.

pwr22 · on Nov 30, 2014

I have an Xubuntu box with older hardware (running a pentium D something or other) which frequently experiences something similar. I have no idea if it is the same issue as either of these and always chalked it up to faulty hardware

jrockway · on Dec 1, 2014

Hey, at least the watchdog worked.

digi_owl · on Nov 30, 2014

Just spitballing here, but it sounds like it could be a issue with the scheduler.

near · on Nov 30, 2014

So it's said to take days of stress testing for the crash to occur each time.

I've had to trace bugs under emulation that took substantial amounts of time to trigger. Thankfully in my case, I was able to serialize the entire system into snapshots. And so by just taking a snapshot every few minutes, the one right before a crash allowed me to be able to quickly reproduce the problem.

Does this crash event occur inside of VMs as well, and if so, could VM snapshots be used to accelerate the bug hunt here?

heywire · on Nov 30, 2014

This is an approach I use for reproducing issues on userspace applications as well. I happen to maintain a fairly large Windows-based application at my workplace, and sometimes the steps to reproduce an issue could take minutes to hours just to get you to the point of being able to try and replicate the issue. I personally use VMware Workstation at work, and since I'm on a quick SSD, taking and reverting multiple snapshots has little performance cost.

near · on Nov 30, 2014

I've always thought it would be nice if the kernel exposed this ability to regular userland applications. On Unix systems, Ctrl+Z of a terminal application will suspend it, but not across reboots. Core files are very close, yet typically they are only produced in response to a crash, and aren't really meant for resuming a process. Obviously there's lots of limitations (eg shared memory could cause issues, hardware states could change), but for a lot of cases it would probably work pretty well.

Not only is it great for triggering crashes, but if you then broke out into a debugger, you could toy with toggling register and memory contents and then single-step through the instructions to see what happens. It's really substantially easier to debug software under emulation because of features like these.

amaranth · on Dec 1, 2014

Sounds like you're looking for the checkpoint/restore functionality that has been in the works for a few years now. Check out http://en.wikipedia.org/wiki/CRIU for more details, seems they have things mostly working now.

ams6110 · on Dec 1, 2014

DragonFly BSD does this. They call it process checkpointing.

Processes under DragonFly may be "checkpointed" or suspended to disk at any time. They may later be resumed on the originating system, or another system by "thawing" them.

http://www.dragonflybsd.org/features/#index8h2

hiphopyo · on Dec 1, 2014

OpenBSD and FreeBSD call it suspend/resume I think:

http://openbsd.org/papers/zzz.pdf

http://wiki.freebsd.org/SuspendResume

kasabali · on Dec 1, 2014

Those links are talking about suspend/resume of the computer, which is a totally different thing.

sweettea · on Nov 30, 2014

You might enjoy rr, 'record-replay', which works kind of like gdb (and I believe is based on it) after you record a program execution --- you can then use gdb multiple times on the stream of execution. rr-project.org.

xuhu · on Nov 30, 2014

Honest question: what guarantees that if the bug reproduced 10 minutes after a snapshot, it will reproduce again if you resume running nondeterminstically after that snapshot ?

near · on Nov 30, 2014

That's always an issue, but in my testing with emulators, once you rule out external inputs (eg keyboard input), it tends to become predictable and reproducible. And using a VM should help dramatically in reducing randomness. And you keep capturing states and narrowing the window until you can reproduce it in eg ten seconds; as too much can happen in the span of ten minutes. The trick is that you have to get a state right before the problem actually begins, but you don't yet know where that is. So it's possible your state capture may be a bit too late, after the issue already eg corrupted memory somewhere.

I know that there are certainly bugs that this kind of technique would never work on. I've hit a few bugs that could only be triggered on live hardware before. But I'm curious if they've tried this kind of approach for this bug yet or not.

rwmj · on Nov 30, 2014

Could you post more about your technique? Like, what emulator are you using (qemu)? How do you trigger the snapshots? Are you snapshotting memory or disk or both? How much disk space is consumed by all these snapshots? Do you discard snapshots? How do you know when the bug has been triggered?

near · on Nov 30, 2014

Sure, it's my own software. I hook up saving and loading states to key presses, so eg F5 would save, F7 would load. And then F6/F8 would increment or decrement the save slot number.

You basically have to capture everything possible: all memory, the state of all the CPU registers, the state of all hardware registers, etc. Obviously disk would be a real challenge, where you'd have to keep a delta list of disk changes since program start, or simply not serialize that state. If you miss anything, you can have problems loading states correctly. However, there is quite a bit of tolerance between theory and practice, so if there is something that you really can't capture the state of for some reason (like a hardware write-only register when you weren't logging what was previously written to it), a lot of times you can get away with it anyway.

Because the system I am using is so old, snapshots are only 300KB each. Sometimes I dump them to disk, sometimes I just keep them in RAM. I know that a PC would be much more challenging, given how much more hardware is at play, and because VMs aren't quite the same as pure software emulation like qemu (though you could potentially use qemu for this too), but VM software does implement this snapshot system, so clearly it's possible.

You know when the bug is triggered through the visual output. And what's cool is that by saving periodic snapshots automatically to a ring buffer, you can code a special keypress to "rewind" the program. So it crashes, then you go back a bit, and save a disk snapshot there, and wait to see if the bug repeats. If it does, then you turn on trace logging and dump all of the CPU instructions between your save point and the crash. Then you go to the crash point, and slowly work your way back to try and find out where things went wrong.

CJefferson · on Nov 30, 2014

Byuu is the author of the excellent SNES (Super Nintendo) emulator bsnes. I would guess he is referring to emulation errors in emulators, where snapshot support is common, and usually of good quality.

alexhill · on Nov 30, 2014

Well, nothing, but that's bug hunting: you try it and see. Whether or not this helps to recreate the problem on demand, you know more than you did before.

rhelmer · on Dec 1, 2014

Reversible debugging might help more here than snapshots, since it allows you to record a failure and then debug it: http://rr-project.org/

panzi · on Nov 30, 2014

Probably nothing, but it's worth a try.

washadjeffmad · on Nov 30, 2014

Nothing, but it reduces the conditions of the variables, which is handier than nothing.

Animats · on Nov 30, 2014

Is this an out-of-memory lockup or something else?

Linux has a history of out-of-memory lockups where disk caching has tied up most of memory. Linux uses free memory as disk cache. When there's a need for memory, clean cache pages are reused. Sometimes the I/O systems have the cache locked at the moment there's a need for memory, resulting in an out of memory lockup. This has been fixed in the past, and broken in the past.

tveita · on Nov 30, 2014

So has anyone besides Dave Jones managed to reproduce the issue? The article doesn't mention.

This doesn't look like front page news until it is confirmed to be a widespread problem.

spydum · on Nov 30, 2014

I find it interesting someone posted a patch to the KERNEL with a comment of "doesnt always work?"

It would seem prudent to invest a bit more time to understand why a particular thing doesn't work, for such a critical bit of code (I'm not implying I know that this is the cause or related to these new found bugs, but the article mentions it, so I'm just a little surprised).

cbsmith · on Nov 30, 2014

> It would seem prudent to invest a bit more time to understand why a particular thing doesn't work, for such a critical bit of code (I'm not implying I know that this is the cause or related to these new found bugs, but the article mentions it, so I'm just a little surprised).

It would indeed. That's no reason not to post the patch though. Welcome to the imperfect world of software development.

hga · on Nov 30, 2014

Following the links, the full comment text is:

  /* On Xen the line below does not always work. Needs investigating! */

So this is in theory a lower priority issue for the mainline kernel. Especially since it apparently hasn't caused problems for ~ 8-9 years during which Xen and Linux have been thoroughly put to the test (e.g. Amazon AWS).

Could be a "tearing out your hair" desperate sort of thing where you cast around for any possible cause....

ghshephard · on Nov 30, 2014

The title is a little off - it appears that this is a 3.17 issue as well, so not a 3.18 regression. What's interesting, is that in attempting to track down what exactly is causing the kernel lockup, the team is likely to track down other, quite likely unrelated kernel issues as well.

dkhenry · on Nov 30, 2014

I don't think this is new to 3.18. I have had a lockup issue in every version since 3.2

https://bugzilla.kernel.org/show_bug.cgi?id=29162

0x0 · on Nov 30, 2014

That looks like a reiserfs bug. Is there any reason to go with reiserfs these days instead of ext4, xfs, btrfs?

lmm · on Dec 1, 2014

Versus ext4: online resize, better at handling large directories. Versus xfs: better at handling power loss/crashes (no "trailing zeroes problem"), reiserfsck is not 100% reliable but it can sometimes save you whereas if you reach the same situation with xfs then you have 0% chance of recovering your data. Versus btrfs: more mature (isn't btrfs still tagged as experimental?), same situation wrt fsck.

I use ZFS where I can, but for root filesystems on linux machines reiser still seems like the best tradeoff.

kasabali · on Dec 1, 2014

> Versus xfs: better at handling power loss/crashes (no "trailing zeroes problem")

I heard they fixed it in 2012. Actually there have been a lot of developments in XFS since 2011, I'm not really interested in filesystems but you should have a look at it again if you're interested.

0x0 · on Dec 1, 2014

Ext4 seems to support online growing resize (although needs offline to shrink). Didn't ext3 or ext4 also add directory hashing?

feld · on Dec 1, 2014

reiserfsck used to be a game of russian roulette -- it either sort-of works and you can mount your filesystem, or it eats all your data.

I can't say I'd trust my data on it these days.

lmm · on Dec 1, 2014

It's a gamble sure. But if you get into the same situation with XFS or btrfs there's no gamble, just guaranteed data loss.

feld · on Dec 2, 2014

XFS is not as bad as it once was, but I don't use either of those anyway.

dkhenry · on Dec 1, 2014

no I after dealing with it I would recommend avoiding it like the plague. However with several petabytes deployed at remote sites its not a simple matter to up and abandon it.

bato · on Dec 1, 2014

Confirming we also saw a lot of lockup issues since 3.12 (with both ext4 and btrfs).