Hacker News new | past | comments | ask | show | jobs | submit login
Uncovering a 24-year-old bug in the Linux Kernel (2021) (skroutz.gr)
497 points by endorphine on Oct 15, 2022 | hide | past | favorite | 77 comments



Okay so I got to the wrap-up at the end, about "why did nobody else find this", the author sets up some logical dominoes but doesn't knock them down. Allow me to try:

Earlier in the article, the author mentions that they recently upgraded some network hardware, and the problem seemed to become more frequent after that.

Packet loss or other network issues would force the stack to fall out of fast-path and update the counter, avoiding the bug.

Running over ssh would avoid the bug. The only time you'd run rsync not over ssh would be within your own network.

So it sounds like (this is my conjecture here) this would only appear to someone running rsync internally, over a high-performance network with no packet loss, and upgrading the switches might've finally gotten the network good enough to expose the bug?


That sounds plausible. But also, most software (browsers, web service SDKs, RPC frameworks) treat TCP connections as fallible by setting read/write timeouts and aggressively reopening broken connections. So, I’m totally not surprised this issue went unnoticed for this many years.


One might expect this to have been hit by HPN (high performance networking) users, but perhaps if they are storage I/O bound rather than CPU or network I/O bound, then probably not.


As someone who thrives on tracking down rare but annoying bugs in a debugger, I love stories like this. It is not just bugs that cause real failures which can be headaches; but also bugs that just slow things down unexpectantly. They can sometimes go undetected for decades like this one.

I wrote an article this past year that talks about silent bugs that slowly eat resources and collectively can be very expensive in terms of wasted time and energy: https://didgets.substack.com/p/finding-and-fixing-a-billion-...


> but also bugs that just slow things down unexpectantly. They can sometimes go undetected for decades like this one.

Reminds me of the GTA Online quadratic time JSON parsing bug


Okay but where's the bug story? Did I miss the story?


I wrote the article right after I fixed a huge inefficiency problem in a function within my own project. I neglected to give the specifics in the article, but here they are since you asked.

My Didgets tool lets you create pivot tables against relational database tables, even very large ones. For the pivot values, you can choose to just count the occurrence of each value or if it is a number type you can add them up. You can also add up the values in a separate number column. Here is a quick demo video: https://www.youtube.com/watch?v=2ScBd-71OLQ

When adding up numbers in a separate column, I had just a few lines of unnecessary code that ended up being called exponentially. For smaller tables it was barely noticeable, but for tables with 30 million+ rows it really bogged down.

A simple fix to the affected lines caused a certain test against a large table to go from over 10 minutes down to under 20 seconds. The effects of just a few lines of code when applied to a big enough data set can really impact performance. It is the old Einstein equation E=mc2 in effect which is discussed here: https://didgets.substack.com/p/musings-from-an-old-programme...


I guess there is a lost art of writing for optimal code/memory/execution time, especially as our resources increase.

I think the idea here is to write code quickly that's inefficient, and re-write it to be efficient if the performance is required down the line. For companies where there's bigger fish to fry, i.e. customer acquisition, it's more useful to pump out more features (even at the expense of bugs) because that draws customers.

But in places where performance is important, you do see developers squeeze out more cycles/memory. I.e. kernel/OS development, database servers, video games. It's just that most developers aren't in those areas of specialty anymore.

Btw, have you heard of https://handmade.network/ and https://en.wikipedia.org/wiki/Demoscene ? Wondering what your thoughts are in those areas. There are probably more communities like the ones I mentioned, where developers are interested in writing the kind of code that you are talking about.


> As someone who thrives on tracking down rare but annoying bugs in a debugger,

As someone that is cursed to inevitably find some obscure bug the second I start using some piece of software I'm happy I'm not the only one

> I wrote an article this past year that talks about silent bugs that slowly eat resources and collectively can be very expensive in terms of wasted time and energy

"Using JS for backend is ecoterrorism" lmao


Even for frontend, repurposing on the frontend stuff that were done in the past on the backend can be ecoterrorism squared.

For instance, a small team of 40 people found no issue sending 4MB of json english to chinese string localisation to each website visitor for angular to translate. 1 million visitor a month in Hong Kong alone, 4 million MB + a few second of mapping per user per month completely pointless...


I love when you're using open source software and can find the bug yourself, even if it's deep down the stack.

Imagine if this bug were somewhere in closed source software. You'd have to reach out to the software's customer support team. Every time I reach out to customer support I expect to have an unpleasant experience. It is rarely otherwise.


And even if you did reach out to customer support, it would rarely ever get dev attention unless most people have the issue. Even in that case, it sometimes still gets a fat wontfix, like the famous OneDrive file corruption bug.


Raising this bug in windows (how? Microsoft sells support, barely, but you can't talk to the ipv4 stack dev anyway) woul get you laughed out of the chat room because it can't posibly be the ip stack's fault.


IDK; I found bugs in Oracle database software in early 2000s, contacted the (corporate) tech support, and got the bugs confirmed, and fixed in subsequent releases.


Kinda why I'm not a fan of cloud, same black box problem.


OTOH if you find a bug, the cloud provider likely has more clout to demand a fix from the vendor, if the software us not open source. Possibly the bug affects the cloud provider's bottom line.

But if the bug is obscure and has little impact, bad luck!


Awesome breakdown - as someone who is fairly familiar with TCP theoretically but not with the details of the TCP implementation in the Linux kernel, this was just the right balance of detail. Great technical writing IMO!


This was a cool example of a class of bugs that are both hard to find with no active example, and hard to prevent in complex systems. The optimization that was added many years ago for performance didn't update something that had a use case that was incompatible with not being updated in a very small number of circumstances.

It is an interesting thought experiment to consider what kind of tool or automated detection could have found this. Some type of dependency linking between variables might have shed some light, but I'm not sure that would have really highlighted this kind of issue.

Great description of both the bug and the path to the solution!


Probably the only way to prevent this type of issue in an automated fashion is to change your perspective from proving that a bug exists, to proving that it doesn't exist. That is, you define some properties that your program must satisfy to be considered correct. Then, when you make optimizations such as bulk receiver fast-path, you must prove (to the static analysis tool) that your optimizations to not break any of the required properties. You also need to properly specify the required properties in a way that they are actually useful for what people want the code to do.

All of this is incredibly difficult, and an open area of research. Probably the biggest example of this approach is the Sel4 microkernel. To put the difficulty in perspective, I checkout out some of the sel4 repositories did a quick line count.

The repository for the microkernel itself [0] has 276,541

The testsuite [1] has 26,397

The formal verification repo [2] has 1,583,410, over 5 times as much as the source code.

That is not to say that formal verification takes 5x the work. You also have to write your source-code in such a way that it is ammenable to being formally verified, which makes it more difficult to write, and limits what you can reasonably do.

Having said that, this approach can be done in a less severe way. For instance, type systems are essentially a simple form of formal verification. There are entire classes of bugs that are simply impossible in a properly typed programs; and more advanced type systems can eliminate a larger class of bugs. Although, to get the full benefit, you still need to go out of your way to encode some invariant into the type system. You also find that mainstream languages that try to go in this direction always contain some sort of escape hatch to let the programmer assert a portion of code is correct without needing to convince the verifier.

[0] https://github.com/seL4/seL4

[1] https://github.com/seL4/sel4test

[2] https://github.com/seL4/l4v


> That is not to say that formal verification takes 5x the work. You also have to write your source-code in such a way that it is ammenable to being formally verified, which makes it more difficult to write, and limits what you can reasonably do.

Also hire significantly more skilled people. Write formal verification on job requirement and the pool of candidates will shrink massively.

Explains why it is so rare really. "Spend 5-10x on developers to have some bugs not happen" is not a great sell.


It's a great question! Thinking back...

At the time this bug was introduced it would probably have been cost prohibitive to create a test case. We were proud of 100mbit networks, had flaky nics the vendors didn't help maintain much of the time (and which were often broken in hardware) and the filesystem max file size was something like 2tb, and most drives wee're in the handful of gbs. Conceiving of testing for something like this would have been expensive. And none of the big system vendors took Linux seriously then.

Though perhaps flooding zeros across a TCP socket could work, I really think that a kernel hacker would have found a lot of other hardware and driver issues before ever being able to trigger this.


> perhaps flooding zeros across a TCP socket could work

Unconstrained zero sending is too fast; you would tend to flood the connection, causing packet loss, breaking you out of the fast path long before counter loop around. You would need to avoid saturating the network while still causing the recipient to fall behind.


Love deep dive troubleshooting like this. I haven't heard of systemtap before; looks nice. When I had to troubleshoot a kernel bug [1] I used perf [2] probes which are also really nice for this kind of debugging.

[1] https://www.spinics.net/lists/xdp-newbies/msg01231.html

[2] https://www.brendangregg.com/perf.html


I remember when this was originally posted, but I voted it up again because I think it's such an excellent story, and excellent programming. We need more people and companies like this, who are willing to go beyond "oh it fails randomly sometimes" and track down the underlying issues.

=> https://news.ycombinator.com/item?id=26102241 Previous Discussion (497 points - 41 comments)


For the record, this is one of the top Greek employers. This is Greece's Amazon essentially. The C-team are intact since day-1 and AFAIK still writing (some) code.

It is not unheard of to have 4-day weeks and developer-first mindset at that place.


> We need more people and companies like this, who are willing to go beyond "oh it fails randomly sometimes" and track down the underlying issues.

I absolutely disagree. Most capable engineers I know have this urge to go down rabbit holes and fix any issue, this is nothing special.

Everyone wants to be the hero that found a bug deep in the stack, make a glorious pull request, and be celebrated in the community.

I much more value people who have enough self-control to pick meaningful battles, and follow the right priorities.


I think this was well prioritized; they struggled with the issue at times, found a temporary workaround, but when that workaround stöd being efficient and the bug hit them everyday, they decided to track down the source. Then they reported upstream, it was reproduced, and someone patched it, and rolled out new, fixed kernels.

That is a perfect example of how things works and should work. They contributed to the community. I think it was a great prioritization.

I'm certain there were lots of other people hitting this bug and killing processes or rebooting to get around it. The troubleshooting and reporting done here, silently saved a lot of of other people a lot of efforts - now and in the future. I don't think they were after it to be heroes; they just shared their story, which I'm sure will encourage others to maybe do the same one day.


[Author here] Your comment resonates with me. Indeed, we had been working around the issue for a number of years before we decided to tackle it head-on and even then we set a deadline for debugging and had a fallback plan as well: if we didn't manage to figure something out in a couple of days, we would switch transports or ditch rsync altogether.

In the end I believe we struck a good balance between time spent and result achieved: we gathered enough information for someone more familiar with the code to identify and fix the root cause without the need for a reproducer. We could have spent more time trying to patch it ourselves (and to be honest I would probably have gone down that route 10 years ago), but it would be higher risk in terms of both, time invested and patch quality.

Finally, I'm always encouraging our teams to contribute upstream whenever possible, for three reasons:

a) minimizing the delta vs upstream pays off the moment you upgrade without having to rebase a bunch of local patches

b) doing a good write-up and getting feedback on a fix/patch/PR from people who are more familiar with the code will help you understand the problem at hand better (and usually makes you a better engineer)

c) everyone gets to benefit from it, the same way we benefit from patches submitted by others


This opinion is a popular one these days (particularly since it complements the demands of business nicely by maximizing personal/company profit), but it is a big part of the reason why the majority of software these days is so unreliable and buggy. It results in hacks on top of hacks to paper over problems in the lower levels of the abstraction tower that is modern software, and it results in tons of "WTF" bugs that are just accepted and never fixed.


It's popular because these war stories you find in blog posts are pure survivorship bias.

If I'd let every fucking team member go on an exploratory bug hunt whenever they feel like it (hint: that would be always) we would never get anything done.

What if they don't find anything? Is this issue really worth 2 weeks of dev time? That's 15k down the drain for a senior engineer, if not more.


From a short-term business perspective, sure, it doesn't make financial sense.

As a user of software, though, I want someone to fix the bug. I want software that doesn't have bugs. So let me repeat my original statement. We need more like this that are willing to spend engineer time fixing bugs, even upstream bugs in open source projects. Instead of prioritizing shoving half-baked features out the door for next week's press release.


[Author here] Even from a short-term business perspective, it actually might make sense to fix things and contribute upstream. When you face a problem with something you built using FOSS, essentially you have three choices:

- Work around it, most likely creating technical debt inside your organization in the process

- Invest the time to fix it yourself

- Pay someone else to fix it for you (e.g. the original authors via a support contract)

None of these options is for free, and which one is the most cost-effective depends largely on the complexity of the issue at hand, the skillset and availability of the people involved and the criticality of the impacted system.


I wish, brother. I wish it was more like that...


When you need working software at the end of the sprint, there's often no time to carefully control for all edge cases.


At the company level, it is indeed more expensive to fix upstream rather thank work around it, but on a macro scale it is much more beneficial.

In my opinion fixing upstream whenever possible even if not the best short-term solution should be considered the price to pay for using OSS.


In my experience, the “oh it fails randomly sometimes” bugs are often in some random dull legacy infrastructure component where there is zero attention or celebration for fixing them, and so engineers tend to tolerate losing a bit of time once a week due to them for years rather than someone spending half a day to fix it for everyone.


GP’s comment is also odd because the article notes they took your approach. They documented the problem when they first noticed it happening infrequently and moved on to higher priorities. When it started happening every single day it became mission critical to investigate.


Eh, right, many bugs we have don't really matter.

Oh what is that you say, security vulnerabilities are also just bugs that get exploited? Oh well...


Exactly. I could fix any complex bug. I just choose not to.


This _is_ the meaningful stuff. Engineers might have the urge, but most don’t have the opportunity, because they need to focus on the currently fashionable framework.

A good rule of thumb regarding meaningful battles is to ignore everything promoted by companies like Google or Facebook - everything they do is either going to be abandoned in five years, or makes sense only in the context of solving problems nobody else have.


seems like something an engineer might fix on their own time if they were feeling feisty about the matter. Something tells me if it went on for 20 years it was an edge case that only very rarely came up and was mostly a non-issue.


I suspect it was definitely an issue, it’s just that most companies like Google don’t care about reliability, only availability, and it might just not show up in their stats.


Could someone provide link(s) on how regular snapshots of databases can be taken like this? (Googling didn't help much, maybe I'm googling for the wrong keywords.) For me, backing up the database is a few-hour-long process. Restoring it for a developer again is a few hours process. I read about snapshots before but haven't realized they could be this effective.


Because it isn't a backup. They put the database into a quiescent state on disk, take a file system snapshot, let the dbms resume working, and send the snapshot data via rsync.

This requires the cooperation of the dbms software to get the on-disk data quiesced. Then your snapshot has to go fast enough that the dbms doesn't end up with too many spinning plates before you let it start writing normally.


Got it. Thank you!


for mariadb :

0) make sure the the database data volume is on lvm or zfs

in a sql prompt:

  1) BACKUP STAGE START; BACKUP STAGE BLOCK_COMMIT;
  2) \! the shell command to take the snapshot
  3) BACKUP STAGE END;
you can now mount your snapshot, copy it offsite and delete it. The restore procedure is left as an exercise!


Very helpful. Thank you!


can't most COW dilesystems like BTRFS or ZFS take a snapshot at a point in time instantly?



It’s the lack of clarity on how they manage access control for what should be regulated data that surprises me, more than the technology achievement.


Article says data is anonymized before dev use. Pretty standard practice.


I've frequently run into problems with Postgres streaming replication that looks exactly like the issues encountered here. I was never able to find the source of the issue, so I'm very curious if this fix will also fix the issues I encountered.


Please keep us updated.


> These snapshots are updated daily through a pipeline that involves taking an LVM snapshot of production data, anonymizing the dataset by stripping all personal data, and transferring it via rsync to the development database servers.

I don’t know what sort of data these people process, but most datasets about people are not anonymized by simply removing the PII.


Yes they are. Any information that can be used to identify a person by definition is PII.

Once all the PII is removed, by definition the dataset is anonymized.


This is obviously true, as you are stating an axiom. But what I think the grand parent is trying to say is that databases with PII can often be deanonymized by looking at the other data that isn't obviously PII.

Take for example a database over all mobile phone positions over time, this can be 'anonymized' by removing all connections from the phones to information on who owns the phones.

But it can still be trivially deanonymized by analyzing where the phones are at night and during office hours, not very many persons work in the same building and sleep in the same house.


This is a good case for formal verification.


I struggle because I want to upvote these comments, because that's the world I want to live in. But the opposite side of that coin is who is going to author the incredibly arcane specification of TCP against which any such implementation is formally verified?

Maybe TCP stacks are one of the few cases where that make sense, but I'd suspect if it was "worth the cost" it would have already been done


There are certain guarantees you want such a formal specification to give, like for example not getting permanently stuck in some state as with the present bug. You can formalize the proofs for those guarantees and have their correctness machine-checked. Something like TLA+/PlusCal is likely suitable for that.

A formal specification is less ambiguous than a prose specification. Formalizing the TCP specification will, if anything, expose aspects where the specification is unclear, or corner cases where the specification actually leads to unwanted behavior and doesn’t provide the desired guarantees.

So, while you can’t prove that the formal specification matches the prose specification a 100%, you can prove that it provides all the guarantees the original prose specification was aiming for (once you’ve formalized those desired guarantees), which is something you can’t do for the prose specification.


It seems to me that we should design for (formal) verification. Much like we should design for testability. But am noob, so what do I know?

That said, my quick search shows some academic efforts to formally verify QUIC, both in whole and in parts.

I would hope that bespoke (boutique?) TCP replacements, like Homa (specifically for datacenters), are verified as part of the design process. From a quick scan, I gleaned that Homa, and other aspirants, are simulated, compared, and benchmarked against each other. Maybe that's sufficient.

https://homa-transport.atlassian.net/wiki/spaces/HOMA/overvi...


It's worth remembering that before Linux (and to some extent Unix) we only had proprietary operating systems from hardware companies - who probably wouldn't have had the resources to find and fix a bug like this, and often customers wouldn't have had access to source to fix it themselves


> It's worth remembering that before Linux (and to some extent Unix) we only had proprietary operating systems from hardware companies - who probably wouldn't have had the resources to find and fix a bug like this, and often customers wouldn't have had access to source to fix it themselves

UC Berkeley was a hardware company? TIL.

Companies like Sun had thousands of engineers working on their operating systems, and they very much could and did find and fix obscure bugs, both on their own as well as based on customer bug reports. Some customers did have access to source code -- I know because I've seen that myself. And customers that didn't have access to source code could still do clever things to diagnose problems. In this case, for example, a packet trace should be enough to diagnose the nature of the bug, though one would indeed need source code to write a fix.

Of course it's much better for customers facing obscure bugs to have access to the source code.


I had never heard about systemtap, but it sounds magical, and extremely useful. I would actually need this in my day job, but for a regular but huge embedded C++ application.

I have no idea how to "hot-patch" a C++ application though, are there libraries for this?


how odd to see a write up from skroutz.gr blog being at the first page of HN...


Also these!

Speeding Up Our Build Pipelines - https://news.ycombinator.com/item?id=20775297 - Aug 2019 (24 comments)

The infrastructure behind one of the most popular sites in Greece - https://news.ycombinator.com/item?id=9982361 - July 2015 (5 comments)

Working with the ELK stack - https://news.ycombinator.com/item?id=9008119 - Feb 2015 (35 comments)


They're one of the largest employers of web programmers in Greece, though, right?


Yeap, it's a bit strange, but the post was very well written, with a nice breakdown and easily understandable steps that can be followed by most software engineers.

There have been some sporadic posts from Skroutz in the past, but nothing that gained so much attention.

For those that don't know it, Skroutz is the biggest Greek online price aggregator/e-commerce market/price comparison site.


Which kernel version has this patch?


it says it was put in to 5.10-rc1; however i noticed weird network issues beginning in june or july. I am wondering if it was put in to the kernel for 5.10 and then left "off by default" until this year.

5.15.32 was around the kernel where i noticed the issues start. If i'm just connected via SSH and streaming video from a LAN server, everything is great. if i go on youtube.com (or whatever), i'll get "network unreachable" on ping within a minute. I swapped NICs to make sure my NIC wasn't the issue; now youtube doesn't cause this issue, but i tested rsync oddly enough and the NIC goes AWOL after a few gigabytes of transfer. I have to physically unplug and replug the NIC (or a reboot if it was PCI).

I haven't had time to track down why, but it has stayed with newer kernels, too: 5.15.41, 5.15.59 also have this issue. I compiled 5.15.72 last night but i haven't rebooted yet.


The fix (commit 18ded910b58) went indeed into 5.10-rc1, and was effective day one (there is no provision for it to be turned off), so it's unlikely to have caused trouble when upgrading from 5.10 to 5.15. That said, there's a ton of changes in the networking stack between 5.10 and 5.15 - more than 10k commits for the stack and the drivers - any of which might introduce some breakage.


>I am wondering if it was put in to the kernel for 5.10 and then left "off by default" until this year.

Any idea how we might check that?

>I haven't had time to track down why, but it has stayed with newer kernels, too: 5.15.41, 5.15.59 also have this issue. I compiled 5.15.72 last night but i haven't rebooted yet.

I'm fairly certain we've seen it on `Linux 5.15.0-1017-aws x86_64`.


I could, i guess, try to load an old kernel. i tend to clean out /boot, and if the package manager deletes my kernel source all i have is a .config. I would lean toward .config diffs, but it is possible that my distribution kernel maintainers made specific patches that made this networking stuff "optional" - as is spectre and rowhammer and other speculative execution mitigation "optional". I just know my torment with networking started about 5 months ago at the closest. I didn't upgrade kernels between February and May 31st of 2022, so if it happened in there, i wouldn't be able to track that down without some sort of distro-specific archive of kernel releases.


How is it possible for a TCP bug that leads to stuck connections to go unnoticed for 24 years?

It's because the fools responsible never rewrite their code, use a broken language, and don't even try to prove half of the broken garbage they write. Then, when it turns out to have been broken for decades, they chuckle and shove another finger into another crack, never understanding how they misuse computers.


This is not an unsafe language failure! The same logic, ported to Python, would exhibit the same error.


This is a bad language failure, however. The same logic, ported to Brainfuck, would exhibit the same stupidity.

The C language makes it unreasonably difficult to write anything, even before proving it to be correct.


"This setup has worked rather well for the better part of a decade and has managed to scale from 15 developers to 150"

LOL


Could you please stop creating accounts for every few comments you post? We ban accounts that do that. This is in the site guidelines: https://news.ycombinator.com/newsguidelines.html.

You needn't use your real name, of course, but for HN to be a community, users need some identity for other users to relate to. Otherwise we may as well have no usernames and no community, and that would be a different kind of forum. https://hn.algolia.com/?sort=byDate&dateRange=all&type=comme...

Also, could you please stop posting unsubstantive and/or snarky and/or flamebait comments? It's not what this site is for, and it destroys what it is for. If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: