Besides being a great technical write-up, this does an absolutely fantastic job of doing low-key recruitment for Skroutz. It shows some of the main tools and technologies that the company uses on a day-to-day basis, provides a window into the way that they approach problems, makes a compelling case that you'd be working with some talented engineers, and showcases a culture willing to engage with the open source community.
The hiring pitch isn't in your face, but there's a "We're hiring!" button in the banner, which fairly unobtrusively follows you down the page, and then ends with a "hey, if you're interested in working with us, reach out." Overall, it just feels really well done.
Great write up. Think I'll get the kids, sorry technicians to walk through this. Actually, I think I'll learn just as much but I have to keep a little bit aloof as MD!
Networks are tricky to run and networking is proper hard to do. TCP/UDP int al are pretty bloody good at shuffling data from A to B. I find it quite amusing when 20 years is considered old for a bug.
The Millenium bridge in London is a classic example of forgetting the basics - in this case resonance and being too clever for your own good. It's a rather cool design for a bridge - a sort of a suspension bridge but flatter and some funky longitudinal stuff. I'm a Civ Eng grad. It looked too flat to me from day one.
When people walk across a bridge and it starts to sway, they start to lock step and then resonance, where each step reinforces the last kicks in and more and more energy causes sway, shear and what have you forces. It gets worse and worse and then failure. Tacoma Narrows is another classic example of resonance but due to wind - that informed designs that don't fly!
Civ Eng is way, way older than IT and we are still learning. 24 years is nothing for a bug. However, IT is capable of looking inward and monitoring itself (unit tests, ping etc) in a way that Civ Eng can't (OK we have strain gauges and a few other tools).
The real difference between physical stuff and IT is that the Milli bridge rather obviously came close to failure visually and in a way that our other senses can perceive - it shook. The fix was to put hydraulic dampers along its length.
In IT, we often try to fix things by using magic or papering over flaws with "just so" stories. Sometimes we get the tools out and do the job properly and these boys and girls did just that: the job properly.
> When people walk across a bridge and it starts to sway, they start to lock step and then resonance, where each step reinforces the last kicks in and more and more energy causes sway, shear and what have you forces. It gets worse and worse and then failure. Tacoma Narrows is another classic example of resonance but due to wind - that informed designs that don't fly!
this anecdote reminds me of the story of ancient rome. (I don't know if this is actual history or a myth).
Apparently, when roman military engineers build a bridge, they where forced to stand beneath it while the rest of the cohort marched across the bridge to test it's strength.
Marching gives exactly this same resonance effect.
Your anecdote reminds me of this quote about Dupont's safety program.
"My company has had a safety program for 150 years. The program was instituted as a result of a French law requiring an explosives manufacturer to live on the premises with his family." - Crawford Greenewalt
I think it's common military training in a lot of places. I'm italian, my dad did his service in the Engineering corps (Genio) and he told me the same story. No lock step while crossing bridges.
I have, admittedly old and very vague, memories of people talking about rsync being "hard on networks" or "dealing poorly with congestion." I'd put good odds that this bug is why those statements existed.
This seems to be the opposite. You only see it when transferring titanic amounts of data over a pristine connection. If your network had congestion you wouldn't trigger this bug.
But this also explains a bit why rsync is "hard on networks". Most bulk data transfers end up with breaks in the data that give more breathing room to other protocols. Not rsync, it tries as hard as it can to keep the pipe full 100% of the time, making it hard for other TCP slow starters to get a foothold.
The uTP protocol (which has an IETF version called LEDBAT) was specifically designed to be used for background transmission, and it should slow itself down if there are competing TCP flows.
One big fat TCP connection isn't so bad on networks. Especially the default behavior of slowly creeping up in speed until it loses a packet, then dropping back down.
As I understand it significant factor in getting this bug to happen is that you're sending tons of data but in a way that's limited by the source.
Is there any truth to this? I find it hard to believe -- most of the time rsync is tunneled over ssh which seems well enough abstracted from an optimal traffic generation mechanism that i would seriously doubt it's able to outcompete other programs for
network resources in a meaningful way ... perhaps this observation evolved because there are a lot of networks that have traffic shaping rules for ssh? unfortunate effects of traffic shaping rules for ssh + low bandwidth connection + rsyncs happening over ssh + an administrator logged into an ssh port via the low bandwidth link
could maybe produce this observed (but non-sensical?) correlation?
The bug requires transferring over 2GB of data without reading from the socket, so it's unlikely; also, a hang is the opposite of being hard on the network. ;) However the uncommon characteristics of rsync traffic is probably why some congestion control algorithms may not deal well with rsync.
I just want to send kudos to them. I lost 2 years trying to write a reliable stream over UDP back in the days of Zoidcom and similar, maybe 2005. I don't know how to stress this enough but...it's basically an impossible challenge.
This writeup represents the depths that an engineer has to go to get real work done. I'm familiar with the integer wraparound comparison issue, and all of the other errata around TCP windowing. Thankfully countless people have done this work and we're able to enjoy the fruits of their labor today.
Not sure where I'm going with this, but I've been programming for 30 years, and to this day, I view kernel developers and the people who isolate these bugs as the very best among us.
Great writeup, and also thoroughly answers the first question that popped into my mind: "how on earth could a bug in the Linux network stack that causes the whole data transfer to get stuck stay undiscovered for so long?"
"Most applications will care about network timeouts and will either fail or reconnect, making it appear as a “random network glitch” and leaving no trace to debug behind."
I have seen an ancient "drop packets with zero-length tcp window" rule in iptables in my company. Funny enough to learn that zero-length tcp window can be found in normal, non-malicious packets!
the amount of firewall vendor's who drop this kind of PDU by default is astounding.
I once spend a week troubleshooting a firewall at a customer's side who had a similair issue with zero-length tcp window PDU's.
The firewalls the customers used also didn't allow a change in this behaviour.
Luckely they where able to solve this in their software, but still, these kind of things should be configurable in a networking product.
It's like watching a murder mystery unfold. It feels really daunting to dive this deep into a bug on its vague symptoms. It's probably the selection bias for what gets on the HN front page, but it feels like a large minority here can tackle something like this. I have trouble imaging having that much of a handle on Linux to feel comfortable hot patching the kernel because I suspect something is wrong in the networking stack.
As somebody who implemented a small user-space tcp long ago, I get always uneasy when people tell me they just put events into some message queue and never consider all the edge cases that can happen when either the MQ or the consuming servers choke up. The problems are pretty much the same as with TCP flow control. It is easy to build a software that only appears to be working well.
I wonder is this is the cause of a nasty NFSv3 issue I was having years ago where clients would periodically hang indefinitely, necessitating a system reboot. At the time, we were ingesting large videos on the client and transferring to a shared volume via nfs.
I'd suspect a bug in the NFS implementation. That would hardly be unheard of.
NFS's failure mode of freezing up your system and requiring a full reboot to clear is purestrain NFS though. I never understood why the idea of an eventual soft failure (returning a socket error) was considered unacceptable in NFS land.
> I never understood why the idea of an eventual soft failure (returning a socket error) was considered unacceptable in NFS land.
Problems like this are usually the result of being unable to decide on an appropriate timeout; so no timeout is chosen. I like to suggest rather long timeouts, like one day or one week, rather than forever to get beyond that. Very few people are going to say, after a read tried for a whole day that it should have tried longer.
Another issue is that POSIX file i/o doesn't have great error indicators; so it can be tricky to plumb things through in clearly correct ways.
NFS is notorious for breaking kernel and application assumptions about posix. Linux falls into this trap in various ways too in an effort to simplify the common cases. Timeouts might be appropriate for read/open/etc calls but in a way the problems are worse on the write/close/etc side.
Reading the close() manpage hints at some of those problems, but fundamentally posix sync file io isn't well suited to handling space and io errors which are deferred from the originating call. Consider write()'s which are buffered by the kernel but can't be completed due to network or out of space consideration. A naive reading of write() would imply that errors should be immediately returned so that the application can know the latest write/record update failed. Yet what really happens is that for performance reasons the data from those calls is allowed to be buffered. Leading to a situation where an IO call may return a failure as a result of failure at some point in the past. Given all the ways this can happen, the application cannot accurately determine what was actually written, if anything, since the last serialization event (which is itself another set of problems).
edit: this also gets into the ugly case about the state of the fd being unspecified (per posix) following close failures. So per posix the correct response is to retry close(), while simultaneously assuring that open()s aren't happening anywhere. Linux simplifies this a bit by implying the FD is closed, but that has its own issues.
I understand the reasoning, but at the same time wonder if this isn't perfect being the enemy of good? Since there is no case where a timeout/error style exit can be guaranteed to never lose data we instead lock the entire box up when a NFS server goes AWOL. This still causes the data to be lost, but also brings down everything else.
Well, soft mounts should keep the entire machine from dying, unless your running critical processes off the NFS mount. Reporting/debugging these cases can be fruitful.
OTOH, PXE/HTTPS+NFS root is a valid config, and there isn't really anyway to avoid machine/client death when the NFS goes offline for an extended period. Even without NFS linux has gotten better at dealing with full filesystems, but even that is still hit or miss.
Great write up. It's common to add retry or reconnect mechanism to connections even there is no requirement. It's basically "restart your computer" to see if the issue disappears. So, it actually hides bugs for decades :)
This was a great article!
I wonder why such problems don't show up on windows.. is it because they have so many developers or that windows has to reboot at least every two weeks?
Since they never described the context, Skroutz seems to be the dominant online price-comparison site for Greece. Which, I agree, would make the name make sense.
I had never heard the name before, and I felt the article lacked some context. Googling it, there seems to be very little content about them in English, which makes the nice blog post almost surprising. :)
Correct. Greeks (and a few other markets) use it to compare
prices and buy stuff. That being said, they are in the process of expanding their business right now with new products/different offerings.
There are some growing pains there but it's an interesting company overall and they experiment with the work/life balance (e.g. they did 4 day weeks at some stage, unsure if they went ahead with that or reverted to 5-day weeks).
The hiring pitch isn't in your face, but there's a "We're hiring!" button in the banner, which fairly unobtrusively follows you down the page, and then ends with a "hey, if you're interested in working with us, reach out." Overall, it just feels really well done.