And since we're adding random related things, lcamtuf's museum of broken packets is an old, but always interesting read: http://lcamtuf.coredump.cx/mobp/
What makes this story so great is how determined they were to find the real problem even though it was already "fixed" by changing the routes. Who knows how much anguish this person has spared other people by not giving up until the problem was truly solved. Doing it with such skill and ingenuity was the icing on the cake, I could read these kind of stories all day.
What is the general lesson we should learn from this?
Postel's law aka robustness principle [1] can easily lead into accumulating complexity when implementations adapt to the bugs in other implementations. How could protocol designers mitigate this problem beforehand?
The lesson is "be strict in what you accept, since the beginning".
For instance, a lesson learned by the Linux kernel developers: when adding a new system call to an operating system, if you have a flags parameter (and you should, which is another lesson they learned), if any unknown flag is set, fail with -EINVAL (or equivalent). Otherwise, buggy programs will depend on these flags being ignored, and you won't be able to use them later for new features.
But it has to be since the beginning; once the protocol is "in the field", you can't increase the strictness without pain.
Otherwise, buggy programs will depend on these flags being ignored, and you won't be able to use them later for new features.
I don't understand why that's true. Just add the new flag and use it for your new feature. Aren't buggy programs who were already sending the flag responsible for their own bugs?
Aren't buggy programs who were already sending the flag responsible for their own bugs?
When the kernel breaks userspace, it's a kernel bug. It's a philosophy for system robustness that the kernel has and other operating systems tend to adopt as well. As you get higher up the stack into 3rd party libraries and other programming tools the maintainers often take a more cavalier approach to maintaining compatibility.
Yes and no. If you push an update and sysadmins all over the world dutifully upgrade and immediately notice programs breaking, their first thought is not going to be "oh, those programs must have had latent bugs." No, they're going to blame you. Besides, having stuff work correctly is always more fun than assigning blame, and validating flags is an easy way to avoid these scenarios.
Hm, fair enough. To be clear, I wasn't arguing against validating flags. I was commenting about "This kernel function only gets 32 possible bitflags, but since they never validated their flags, they can no longer add any additional flags, ever, because it might break other programs which may or may not even exist."
That sort of mentality seems like it would push designers in the direction of poor design decisions. If a bitflag is the best design for a new feature, but they're prevented from using it out of a sense of "Let's not ever break anything ever," then the result may be a bad design that people are stuck with for the next 50+ years, which seems objectively worse.
But my reaction is based on theory and not backed by experience, so it's probably unfounded.
I'd say one of the lessons is that even a supposedly well-standardised system sees hundreds of implementations (or more!) then the accumulated bug baggage can still make it hacky with per-platform code. For comparison consider web browsers, which although are far better these days than they used to be, between just Chrome, Firefox, IE and Safari there's a bunch of quirks and platform-specifics, so I can imagine worldwide TCP deployments are "interesting".
Where possible hacks should be applied only where necessary, e.g. the specific software versions affected only, and exclude fixed versions. Then hopefully in the long run the old buggy versions die out and the hack can be removed... but as the article says, over a network it's not always possible to identify when to apply a hack.
I've to confess I've quite limited knowledge with TCP/IP stack internals, e.g. the way stack extensions work et cetera.
Does anyone know of any available online visual materials/tutorials? I'm particularly searching for tools capable of recording and replaying TCP/IP stack packets with visual representation, references to RFCs and specifications.
Richard Steven's "TCP/IP Illustrated" books are the original works, Unix network programming at its finest. More advanced than Comer's book, and much more applicable. Get used, older editions if cheaper, you will be fine.
Pretty much everything in technology is harder than it looks. Took me a day to configure Apache Solr for the first time the other day while my estimate was just one hour.
SYN-ACK which does not get retransmitted, advertises a zero window and does not have options looks like some implementation of SYN cookies. A nice (and useful in some cases) hack but a violation of the TCP spec, which is why it is disabled by default in most places that implement it.
The author uses the customized settings for his TCP stack and then laments that some nodes on the internet which aren't under his control depend on more common settings. Honestly I don't see why he could have expected any different outcome.
"Our TCP implementation has a variable base SYN retransmit timout, and in this case it was roughly 500ms. So most of the time the page load would fail with our TCP stack, but succeed with an off the shelf one that had a SYN retransmit timeout of 1 second."
If his timeout were 1 second, his connection with the other node would work. It's his own choice to insist on "variable base SYN retransmit timeout" even with the nodes which weren't tested with such a setup. He also says:
"The options are either to tell a customer that some traffic won't work (which is normally unacceptable, even if the root cause is undeniably on the other end), or to water down a useful feature a little bit to at least fail no more often than the "competition" does."
Meaning, the way they use that less common setting is "all or nothing": either all clients or none. With the assumption that that way he will be "competitive." Again, why is then surprising to discover that "the whole world" is not perfect, once you implement your part with the assumption it were?
Just because he knows what the "ideal case" is he can't expect that everything he confronts would be ideal.
I don't think it was an issue with the timeout. The other side simply could not handle retransmitted SYN. At all. The 500ms comes up only because that happened to be their RTT. If they used 1 second, this host would work, but another one using the same stack but with 1 second delay would continue to fail.
It's not about the other host not having an ideal implementation, or about testing ideal cases. The other host had a bug which cannot cope with some situations. There's nothing they could do about it, apart from tweaking the timeout to some bigger value.
Exactly. And note that we're not talking outlandish latencies here. A 2G connection, or a combination of a 3G connection + OS X + being on the wrong side of the globe would've done it. Or a single well placed packet loss (losing the SYNACK).
Often when you have these kinds of incompatibility issues, they are possible to fix with no cost at all. There's no harm in rearranging or re-aliging your TCP options for maximum compatibility. Or it's possible to notice from either the handshake or later connection behavior that the other end is dodgy, and conditionally disable whatever feature might cause trouble. But you can't possibly do that when we have no information about the other end. That's why I was particularly annoyed by this issue.
(And I certainly wasn't expecting everything to work perfectly, like "acqq" claims. That expectation would be quickly beaten out of anyone dealing with arbitrary TCP traffic.)
Could you map all such buggy devices on the internet and treat their known addresses specially? (Scan the whole public IPv4 range every month to find them?)
I worked for many years with the locations on the remote islands with the high RTT. You wouldn't believe how much software of the biggest industry players failed to work under that circumstances.
My favorite case was when one company first claimed "OK on another side is probably the device from the competition." It was their device on another side too. They "debugged" the case for a bigger part of the year and the case remained unsolved.
I'm absolutely not saying that the far side implementation was OK. I concur that the TCP is damn hard. It's just that you can't expect that "everybody else" is perfect. You have to plan to handle the special cases or not be surprised that you can't handle them.
> You have to plan to handle the special cases or not be surprised that you can't handle them.
What do you propose then? You send valid traffic and get a completely broken response - I don't see any space left for handling special cases here. Like they said in the article, it's the first packet and you have no information about the other side yet. This is not something you can avoid/workaround.
As soon as you think "I send a valid packet, everything must work" you're missing the point. Even OP at the end worries about the actual client communication actually not working. At the end, it's not who's "theoretically" right, it's "can you make it work given the real world limitations" that include the implementations not tested with your "clever better than the competition dynamic timeout modification." It can be clever, but be clever more and plan for the cases when it is against the real life limitations. And don't cry foul. It's you who move into less tested territories, expecting to be better than the "competition".
The same as I don't complain about the herd reactions here. They are probable, therefore expected.
Sigh... let me say this again clearly, because you keep repeating the same point. This case has nothing to do with timeout modification. It will happen with or without it. If you use lower timeout you will run into the issue more often. If you use higher timeout - less often. But you will run into the issue anyway and there is no way to fix it.
So just tell us what is your proposed solution / implied more tested territories. You try to start a connection, send SYN, don't see a response for X (500ms, 1s, whatever, you choose). What do you propose to do that isn't a retransmit (breaks the other side) or choosing new seq numbers (breaks good behaviour on your side by starting two connections and possibly tripping flood detection if you do it too often).
Yes, the buggy implementation is buggy in idealistic sense. But it doesn't matter. The fact is, that bug was de facto not visible until the OP introduced the more "clever" (shorter timeout) code. If the node were really, utterly problematic it wouldn't be on the internet.
What to do in this case? Well, what do you want to do? Want it to work with everything the same? You have to start with the timeout of 1s, like other TCP stacks do. That's what was tested, and then your behaviour to these nodes wouldn't be different from the rest of the internet. If you don't want to do this for all of yours connections (you're losing an advantage to the competitors), then be ready to introduce the list of nodes where you observe the behaviour, and assign the different starting timeout to them. Etc. There is always a solution, the wrong approach is "it can't be solved because it's against the RFC." The solution is making something work under the real-life limitations, and not having the world where there's no bad implementation (the state that's impossible to reach).
BTW I have an impression we don't understand each other because you have a "mathematical" approach (finding one counterexample, my theorem is disproved, let's wave my hand in the air in helplessness). I'm an engineer. There's a real life out there. Everything you can imagine will have a real life counterexample. Deal with it. Whatever you do, it's your decision. Ignore it, adapt, whatever, it depends on what you want to do. Just don't stay on "it's that others are wrong that's the problem, but I'm right and that's it." Unless you want to do this for show. But then allow me to claim it's just a show.
Imagine Google writing around 2000 "we won't index the badly formed HTML pages, because they are against the standard, and our XML processor will never work." There wouldn't be a Google today.
> I concur that the TCP is damn hard. It's just that you can't expect that "everybody else" is perfect.
There is no evidence that he expected everyone else to be perfect. This is simply a debugging story about a case where he found an unexpected problem with an interesting cause. You don't need to make it an opportunity for treating him like he was stupid for having a bug.
edit: and in fact, you say in your response, "You wouldn't believe how much software of the biggest industry players failed to work under that circumstances." In other words: "You wouldn't believe X! But if you're surprised by X, I will scoff and act like you're a clueless noob for not anticipating X."
The people who expect that "everything must work because of the RFCxxx" are of course those who wouldn't believe. It's addressed to them. Regarding "There is no evidence that he expected everyone else to be perfect," he wants to use his custom TCP stack and he explains that for this case he can't do "heuristics" in order to keep the "Our TCP implementation" (which) "has a variable base SYN retransmit timout, and in this case it was roughly 500ms. So most of the time the page load would fail with our TCP stack, but succeed with an off the shelf one that had a SYN retransmit timeout of 1 second."
So everything revolves really around the less common timeout times all the time, the wish to use them everywhere, the discovery that they don't work everywhere and the resulting feeling "from my point of view this is a particularly annoying bug." Which is exactly what I commented to from the start. True, the sentences in the original aren't dominating the whole text, so the less attentive reader (or readers in this case) can miss that.
There are two different approaches to write the article:
1) I will show you that the other side is wrong (spending 95% of time on that). It is annoying because our clever new code doesn't work with it whereas old less clever does.
2) We developed a clever new TCP code. We thought we're clever. But some nodes on the internet are less clever. I'll show you an example.
I claim that the approach 1) implies your belief that it's more important to you is that you're "right" than what actually works.
I really don't understand where you're coming from.
But the post did not "spend 95% of the time showing the other side is wrong", and it's really rich to say that at the same time as you're saying that it's seasy for readers to miss the damning sentences when they're not "dominating the whole text". Most of the post was discussing the age old problem of TCP implementations needing to be bug-compatible with each other. In addition it showed a new (to me) example of this phenomenon, and did a detour into explaining why that bug was easier to trigger with our stack than some others.
It's like tou've for some reason decided that I don't understand that there are broken implementations around, that it came as a complete surprise to me, and that the only thing that matters is that I'm right and the other guy is wrong. But the whole point of the post was exactly that it doesn't matter if you're right or not. If you want a production quality TCP implementation you need to support what's out there, not what's in the spec.
The short term solution was a configuration change to force a 1s SYN retransmit timeout, even if that's a 99.9th percentile handshake RTT in this environment. In the longer term we need to evaluate the timeouts in the most widely deployed systems, so that we can offer a calibrated set of completely safe / reasonable / aggressive values.
For example one of the links in the post suggests that OS X has a 700ms SYN retransmit timeout. If that's true (haven't tested yet), and the same is true for iOS, then a 700ms timeout might be a pretty reasonable setting rather than an aggressive one. Because if a piece of kit fails for iOS traffic, it's not going to remain in service for long ;-)
One of those cases where you can't do much more than follow the herd.
I'm glad to learn that your first solution matched exactly what I assumed it would be. Had the solution been present in the article, I certainly wouldn't comment that it's just "look how others on the internet are wrong."
Thank you for sharing these new details. Together with them I have much better feeling for getting involved in reading it and discussing it. In my view the approach to solution as you stated now matches with the concepts I've stood for in my comments here. I wish you luck in your endeavours.