I think this was well prioritized; they struggled with the issue at times, found a temporary workaround, but when that workaround stöd being efficient and the bug hit them everyday, they decided to track down the source. Then they reported upstream, it was reproduced, and someone patched it, and rolled out new, fixed kernels.
That is a perfect example of how things works and should work. They contributed to the community. I think it was a great prioritization.
I'm certain there were lots of other people hitting this bug and killing processes or rebooting to get around it. The troubleshooting and reporting done here, silently saved a lot of of other people a lot of efforts - now and in the future. I don't think they were after it to be heroes; they just shared their story, which I'm sure will encourage others to maybe do the same one day.
[Author here] Your comment resonates with me. Indeed, we had been working around the issue for a number of years before we decided to tackle it head-on and even then we set a deadline for debugging and had a fallback plan as well: if we didn't manage to figure something out in a couple of days, we would switch transports or ditch rsync altogether.
In the end I believe we struck a good balance between time spent and result achieved: we gathered enough information for someone more familiar with the code to identify and fix the root cause without the need for a reproducer. We could have spent more time trying to patch it ourselves (and to be honest I would probably have gone down that route 10 years ago), but it would be higher risk in terms of both, time invested and patch quality.
Finally, I'm always encouraging our teams to contribute upstream whenever possible, for three reasons:
a) minimizing the delta vs upstream pays off the moment you upgrade without having to rebase a bunch of local patches
b) doing a good write-up and getting feedback on a fix/patch/PR from people who are more familiar with the code will help you understand the problem at hand better (and usually makes you a better engineer)
c) everyone gets to benefit from it, the same way we benefit from patches submitted by others
This opinion is a popular one these days (particularly since it complements the demands of business nicely by maximizing personal/company profit), but it is a big part of the reason why the majority of software these days is so unreliable and buggy. It results in hacks on top of hacks to paper over problems in the lower levels of the abstraction tower that is modern software, and it results in tons of "WTF" bugs that are just accepted and never fixed.
It's popular because these war stories you find in blog posts are pure survivorship bias.
If I'd let every fucking team member go on an exploratory bug hunt whenever they feel like it (hint: that would be always) we would never get anything done.
What if they don't find anything? Is this issue really worth 2 weeks of dev time? That's 15k down the drain for a senior engineer, if not more.
From a short-term business perspective, sure, it doesn't make financial sense.
As a user of software, though, I want someone to fix the bug. I want software that doesn't have bugs. So let me repeat my original statement. We need more like this that are willing to spend engineer time fixing bugs, even upstream bugs in open source projects. Instead of prioritizing shoving half-baked features out the door for next week's press release.
[Author here] Even from a short-term business perspective, it actually might make sense to fix things and contribute upstream. When you face a problem with something you built using FOSS, essentially you have three choices:
- Work around it, most likely creating technical debt inside your organization in the process
- Invest the time to fix it yourself
- Pay someone else to fix it for you (e.g. the original authors via a support contract)
None of these options is for free, and which one is the most cost-effective depends largely on the complexity of the issue at hand, the skillset and availability of the people involved and the criticality of the impacted system.
In my experience, the “oh it fails randomly sometimes” bugs are often in some random dull legacy infrastructure component where there is zero attention or celebration for fixing them, and so engineers tend to tolerate losing a bit of time once a week due to them for years rather than someone spending half a day to fix it for everyone.
GP’s comment is also odd because the article notes they took your approach. They documented the problem when they first noticed it happening infrequently and moved on to higher priorities. When it started happening every single day it became mission critical to investigate.
This _is_ the meaningful stuff. Engineers might have the urge, but most don’t have the opportunity, because they need to focus on the currently fashionable framework.
A good rule of thumb regarding meaningful battles is to ignore everything promoted by companies like Google or Facebook - everything they do is either going to be abandoned in five years, or makes sense only in the context of solving problems nobody else have.
seems like something an engineer might fix on their own time if they were feeling feisty about the matter. Something tells me if it went on for 20 years it was an edge case that only very rarely came up and was mostly a non-issue.
I suspect it was definitely an issue, it’s just that most companies like Google don’t care about reliability, only availability, and it might just not show up in their stats.
I absolutely disagree. Most capable engineers I know have this urge to go down rabbit holes and fix any issue, this is nothing special.
Everyone wants to be the hero that found a bug deep in the stack, make a glorious pull request, and be celebrated in the community.
I much more value people who have enough self-control to pick meaningful battles, and follow the right priorities.