Hacker News new | past | comments | ask | show | jobs | submit login
How I Found a 20-Year-Old Linux Kernel Bug (ocallahan.org)
274 points by cesarb on June 14, 2017 | hide | past | favorite | 40 comments



Linux ~4.7 or so fixed a bug in fadvise, specifically FADV_DONTNEED, that incorrectly rounded page boundaries to the effect of making some calls less effective. Found and fixed by a developer who wondered why his page cache was filling up, even though the backup software he used made use of DONTNEED :)

The bug was in from day 1.


The thing I wonder about in instances like this is how many people ran into the problem and thought, "Huh, there must be some quirky rationale here, and it's just an idiosyncrasy I'll have to deal with", or, "Huh, that's definitely wrong... oh well".


I can definitely say that I've been guilty of seeing a subtle but not deal breaking bug in third party open source software and then failing to file a report because there's just so much to do in a day that it's easy to forget. It's on all of us to be good open source citizens.


I think the counter point of this is being kind when people report bugs that aren't bugs.

If users encounter a weird result, report it, and have someone call them an idiot because they misunderstood a nuance of the system... they're​ probably not going to take the time to report next time around.

(And I get it, there's that common issue that people consistently misunderstand and you continually get reports about. But each one of those users might also be a user who finds a real bug next time.)


I think your parenthetical probably explains the dismissive attitude many upstreams have, but I would argue that if so many people are (consistently) misunderstanding your software then there's probably a UI/UX/education problem that needs fixin'...

(Not saying it's easy, but it's a sign...)


Someone said, somewhere: User files bug that isn't a bug:

bad/rookie dev: omg dumb user

good dev: closes bug issue, files new issue to fix docs.


Years back, I spent some time doing QA for an operating system, and I noticed that there was a strong tendency in people to deny the possibility that the operating system had a bug when they noticed something wrong and instead put all their energy into fixing test automation, finding workarounds, etc. And a lot of these were people whose sole purpose at work was finding and reporting bugs in the OS.


Ah, I almost got it right. It discarded too much, thus reducing throughput - even more subtle.

https://github.com/torvalds/linux/commit/18aba41cbfbcd138e9f...


Great work!

Segfaults are horrible, but bugs like these are even more so. Crawling, sneaking, living in your walls. Stealing precious CPU cycles and memory from millions of machines at once--petabytes and petaflops when you add it all up.

Worst yet, they don't make a peep until you shine the holy light of benchmarks, code reviews, automated testing, or the (un)lucky corner case on them.

At least segfaults let you know something's definitely wrong. These bugs? Now that's insidious.


The 'rr' project gets a side mention, but if you're into debuggers, it's really worth a look - a very practical "record and replay" debugger that allows you to move around in time in a debug session. http://rr-project.org/


rr is indispensable to me specially as I started hacking around on Firefox and other C++ codebases (lack of interactive debuggers).


Why are there no interactive debuggers if you develop firefox?


It's always nice to see fixes of problems found with improved testing. It would be nice to see something like Haskells QuickCheck rigorously applied on the majority of the kernel functions/interfaces.


Is there a Haskell type of Linux syscalls? If it is the case, we can automatically derive Haskell code to generate an arbitrary list of syscalls.

(disclamer: I'm one of the QuickFuzz [0] developers and I'm very interested in testing for this kind of bugs)

[0]: http://quickfuzz.org/


Sorry for a bit of off-topic but can you point me at a good tutorial on what must be done exactly so I can use your software to test mine?


If you already know Haskell, you can take a look to our article explaining how it works:

https://labdcc.fceia.unr.edu.ar/~amista/article.pdf

Feel free to contact us by email in case you need.



Wow, they found 458 bugs in the linux kernel: https://github.com/google/syzkaller/blob/master/docs/found_b...


Terrifying to think that tiny, subtle bugs like this could exist in pacemakers and planes. Especially if something like this only happens once every decade or so:

> I guess once in a while it would fail if your allocator happens to land one at the end of a page.


My understanding is that pacemakers and planes (and similarly high-reliability systems) tend to statically assign everything and avoid allocators altogether, for precisely this reason. It's much easier to prove that you don't have memory problems if you simply assign every byte of RAM to a specific task and then make sure it's always used for that task and only that task.


Title should have been "How i found a bug in some deprecated 802.11g-era API that is disabled by default"

I always wonder why the kernel does not just warn when this API is used or yank it. Any software stuck with this API can barely know about 802.11n and is probably wondering what is 802.11ac or 802.11ax. Only some old or broken device drivers require this API.

Linux distros should just stop enabling this API.


I actually use a 802.11g because work. So no thanks.


I have not ready anything about this bug, other than the very short description in the linked blog, however this seems like a bug that would have been flagged by a static analysis tool. I know they've been used on the kernel (e.g. Coverity) Very surprised it survived until now.


the syscall function takes a void* parameter then does a copy from user space into the kernel using the wrong target type/sizeof. i think it works because the incorrect type was a superset of the correct type. i don't think any static analyser could catch this.

proposed patch here:

https://bugzilla.kernel.org/attachment.cgi?id=256997&action=...


A static analyzer with sufficient inter-procedural analysis (i.e. across the user/kernel boundary) could certainly see the type of the allocated structure in userspace, and flag the kernel read of a larger type.


I've never done any kernel/low level stuff, so sorry for a probably stupid question. Will this bug only happen when there is less than 40 bytes but more than 31 bytes available? (i.e. less than 32 bytes will fail anyway)


Love the modesty implicit in the short post.


Brilliant post. Horrific page UX - half of the text was blocked by a large blue box. Work firewall may be responsible for something not loading, but I don't see why I should have to delete nodes from the dom to read the text.


given enough eyeballs, all bugs are shallow


Sure, but it's not at all clear eyeballs are the most efficient way to find bugs. They seem remarkably inefficient compared to computers, which have generally shown themselves to be good at monotonous mechanical work that requires good attention to detail and no creativity.

In particular it seems to me like this could have been fixed with a better, machine-readable description of the types/structures for each ioctl, plus a static analysis tool that makes sure that the kernel does a copy_from_user on exactly what the documented input types are and no more or less. There is already a halfhearted attempt to encode type information in ioctls (the _IOR, _IOW, etc. macros), so I think this is doable. I'm not sure how much work is required to trace copy_from/to_user statically, but it certainly seems like it would be far less work than 20 years of people using these syscalls.

As another example, I think "given enough eyeballs, all bugs are shallow" would be a poor reason to eschew writing tests for your code.


The quoted soundbite is intended to be pithy, not 100% literal. It's ok if some of the eyeballs are in fact implemented with automated static analysis.


This bug was in code which very few applications would ever use, and its manifestation is quite specific, so it's no surprise that it was never discovered. I don't doubt that more of these "benign" bugs exist in other exotic pieces of code in the kernel; in light of this, perhaps the saying should be "given enough bugs, all eyeballs are shallow."


I agree. This is a great example of Linus' Law.


[flagged]


Since I have to squint my eyes to read your comment the principle you postulate seems to work.


And clearly no one noticed it in 20 years....


did this guy get a bug bounty?


The consolation prize was an HN front-page link.


It's not a security-sensitive issue.


FWIW, not all bug bounty programs are about security.


Is there any way you can get paid a bounty?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: