How I Found a 20-Year-Old Linux Kernel Bug

dom0 · on June 14, 2017

Linux ~4.7 or so fixed a bug in fadvise, specifically FADV_DONTNEED, that incorrectly rounded page boundaries to the effect of making some calls less effective. Found and fixed by a developer who wondered why his page cache was filling up, even though the backup software he used made use of DONTNEED :)

The bug was in from day 1.

carussell · on June 14, 2017

The thing I wonder about in instances like this is how many people ran into the problem and thought, "Huh, there must be some quirky rationale here, and it's just an idiosyncrasy I'll have to deal with", or, "Huh, that's definitely wrong... oh well".

aisofteng · on June 15, 2017

I can definitely say that I've been guilty of seeing a subtle but not deal breaking bug in third party open source software and then failing to file a report because there's just so much to do in a day that it's easy to forget. It's on all of us to be good open source citizens.

ethbro · on June 15, 2017

I think the counter point of this is being kind when people report bugs that aren't bugs.

If users encounter a weird result, report it, and have someone call them an idiot because they misunderstood a nuance of the system... they're probably not going to take the time to report next time around.

(And I get it, there's that common issue that people consistently misunderstand and you continually get reports about. But each one of those users might also be a user who finds a real bug next time.)

lomnakkus · on June 15, 2017

I think your parenthetical probably explains the dismissive attitude many upstreams have, but I would argue that if so many people are (consistently) misunderstanding your software then there's probably a UI/UX/education problem that needs fixin'...

(Not saying it's easy, but it's a sign...)

alkonaut · on June 15, 2017

Someone said, somewhere: User files bug that isn't a bug:

bad/rookie dev: omg dumb user

good dev: closes bug issue, files new issue to fix docs.

13of40 · on June 15, 2017

Years back, I spent some time doing QA for an operating system, and I noticed that there was a strong tendency in people to deny the possibility that the operating system had a bug when they noticed something wrong and instead put all their energy into fixing test automation, finding workarounds, etc. And a lot of these were people whose sole purpose at work was finding and reporting bugs in the OS.

dom0 · on June 14, 2017

Ah, I almost got it right. It discarded too much, thus reducing throughput - even more subtle.

https://github.com/torvalds/linux/commit/18aba41cbfbcd138e9f...

wyc · on June 14, 2017

Great work!

Segfaults are horrible, but bugs like these are even more so. Crawling, sneaking, living in your walls. Stealing precious CPU cycles and memory from millions of machines at once--petabytes and petaflops when you add it all up.

Worst yet, they don't make a peep until you shine the holy light of benchmarks, code reviews, automated testing, or the (un)lucky corner case on them.

At least segfaults let you know something's definitely wrong. These bugs? Now that's insidious.

glangdale · on June 14, 2017

The 'rr' project gets a side mention, but if you're into debuggers, it's really worth a look - a very practical "record and replay" debugger that allows you to move around in time in a debug session. http://rr-project.org/

hashhar · on June 15, 2017

rr is indispensable to me specially as I started hacking around on Firefox and other C++ codebases (lack of interactive debuggers).

alkonaut · on June 15, 2017

Why are there no interactive debuggers if you develop firefox?

hultner · on June 14, 2017

It's always nice to see fixes of problems found with improved testing. It would be nice to see something like Haskells QuickCheck rigorously applied on the majority of the kernel functions/interfaces.

galapago · on June 14, 2017

Is there a Haskell type of Linux syscalls? If it is the case, we can automatically derive Haskell code to generate an arbitrary list of syscalls.

(disclamer: I'm one of the QuickFuzz [0] developers and I'm very interested in testing for this kind of bugs)

[0]: http://quickfuzz.org/

pdimitar · on June 15, 2017

Sorry for a bit of off-topic but can you point me at a good tutorial on what must be done exactly so I can use your software to test mine?

galapago · on June 20, 2017

If you already know Haskell, you can take a look to our article explaining how it works:

https://labdcc.fceia.unr.edu.ar/~amista/article.pdf

Feel free to contact us by email in case you need.

aleksi · on June 14, 2017

Something like https://github.com/google/syzkaller?

mastax · on June 15, 2017

Wow, they found 458 bugs in the linux kernel: https://github.com/google/syzkaller/blob/master/docs/found_b...

nathan_f77 · on June 15, 2017

Terrifying to think that tiny, subtle bugs like this could exist in pacemakers and planes. Especially if something like this only happens once every decade or so:

> I guess once in a while it would fail if your allocator happens to land one at the end of a page.

wolfgang42 · on June 15, 2017

My understanding is that pacemakers and planes (and similarly high-reliability systems) tend to statically assign everything and avoid allocators altogether, for precisely this reason. It's much easier to prove that you don't have memory problems if you simply assign every byte of RAM to a specific task and then make sure it's always used for that task and only that task.

wextsucks · on June 15, 2017

Title should have been "How i found a bug in some deprecated 802.11g-era API that is disabled by default"

I always wonder why the kernel does not just warn when this API is used or yank it. Any software stuck with this API can barely know about 802.11n and is probably wondering what is 802.11ac or 802.11ax. Only some old or broken device drivers require this API.

Linux distros should just stop enabling this API.

aneutron · on June 15, 2017

I actually use a 802.11g because work. So no thanks.

metalliqaz · on June 14, 2017

I have not ready anything about this bug, other than the very short description in the linked blog, however this seems like a bug that would have been flagged by a static analysis tool. I know they've been used on the kernel (e.g. Coverity) Very surprised it survived until now.

benmmurphy · on June 14, 2017

the syscall function takes a void* parameter then does a copy from user space into the kernel using the wrong target type/sizeof. i think it works because the incorrect type was a superset of the correct type. i don't think any static analyser could catch this.

proposed patch here:

https://bugzilla.kernel.org/attachment.cgi?id=256997&action=...

jjnoakes · on June 14, 2017

A static analyzer with sufficient inter-procedural analysis (i.e. across the user/kernel boundary) could certainly see the type of the allocated structure in userspace, and flag the kernel read of a larger type.

rullelito · on June 15, 2017

I've never done any kernel/low level stuff, so sorry for a probably stupid question. Will this bug only happen when there is less than 40 bytes but more than 31 bytes available? (i.e. less than 32 bytes will fail anyway)

teddyknox · on June 15, 2017

Love the modesty implicit in the short post.

aaronchall · on June 15, 2017

Brilliant post. Horrific page UX - half of the text was blocked by a large blue box. Work firewall may be responsible for something not loading, but I don't see why I should have to delete nodes from the dom to read the text.

dsego · on June 14, 2017

given enough eyeballs, all bugs are shallow

geofft · on June 14, 2017

Sure, but it's not at all clear eyeballs are the most efficient way to find bugs. They seem remarkably inefficient compared to computers, which have generally shown themselves to be good at monotonous mechanical work that requires good attention to detail and no creativity.

In particular it seems to me like this could have been fixed with a better, machine-readable description of the types/structures for each ioctl, plus a static analysis tool that makes sure that the kernel does a copy_from_user on exactly what the documented input types are and no more or less. There is already a halfhearted attempt to encode type information in ioctls (the _IOR, _IOW, etc. macros), so I think this is doable. I'm not sure how much work is required to trace copy_from/to_user statically, but it certainly seems like it would be far less work than 20 years of people using these syscalls.

As another example, I think "given enough eyeballs, all bugs are shallow" would be a poor reason to eschew writing tests for your code.

db48x · on June 15, 2017

The quoted soundbite is intended to be pithy, not 100% literal. It's ok if some of the eyeballs are in fact implemented with automated static analysis.

userbinator · on June 15, 2017

This bug was in code which very few applications would ever use, and its manifestation is quite specific, so it's no surprise that it was never discovered. I don't doubt that more of these "benign" bugs exist in other exotic pieces of code in the kernel; in light of this, perhaps the saying should be "given enough bugs, all eyeballs are shallow."

reeboo · on June 14, 2017

I agree. This is a great example of Linus' Law.

gghhvgx · on June 14, 2017

[flagged]

dom0 · on June 14, 2017

Since I have to squint my eyes to read your comment the principle you postulate seems to work.

WhiteSource1 · on June 15, 2017

And clearly no one noticed it in 20 years....

_e · on June 14, 2017

did this guy get a bug bounty?

__jal · on June 14, 2017

The consolation prize was an HN front-page link.

bonzini · on June 14, 2017

It's not a security-sensitive issue.

jwilk · on June 15, 2017

FWIW, not all bug bounty programs are about security.

squalor1a · on June 14, 2017

Is there any way you can get paid a bounty?