"Our TCP implementation has a variable base SYN retransmit timout, and in this case it was roughly 500ms. So most of the time the page load would fail with our TCP stack, but succeed with an off the shelf one that had a SYN retransmit timeout of 1 second."
If his timeout were 1 second, his connection with the other node would work. It's his own choice to insist on "variable base SYN retransmit timeout" even with the nodes which weren't tested with such a setup. He also says:
"The options are either to tell a customer that some traffic won't work (which is normally unacceptable, even if the root cause is undeniably on the other end), or to water down a useful feature a little bit to at least fail no more often than the "competition" does."
Meaning, the way they use that less common setting is "all or nothing": either all clients or none. With the assumption that that way he will be "competitive." Again, why is then surprising to discover that "the whole world" is not perfect, once you implement your part with the assumption it were?
Just because he knows what the "ideal case" is he can't expect that everything he confronts would be ideal.
I don't think it was an issue with the timeout. The other side simply could not handle retransmitted SYN. At all. The 500ms comes up only because that happened to be their RTT. If they used 1 second, this host would work, but another one using the same stack but with 1 second delay would continue to fail.
It's not about the other host not having an ideal implementation, or about testing ideal cases. The other host had a bug which cannot cope with some situations. There's nothing they could do about it, apart from tweaking the timeout to some bigger value.
Exactly. And note that we're not talking outlandish latencies here. A 2G connection, or a combination of a 3G connection + OS X + being on the wrong side of the globe would've done it. Or a single well placed packet loss (losing the SYNACK).
Often when you have these kinds of incompatibility issues, they are possible to fix with no cost at all. There's no harm in rearranging or re-aliging your TCP options for maximum compatibility. Or it's possible to notice from either the handshake or later connection behavior that the other end is dodgy, and conditionally disable whatever feature might cause trouble. But you can't possibly do that when we have no information about the other end. That's why I was particularly annoyed by this issue.
(And I certainly wasn't expecting everything to work perfectly, like "acqq" claims. That expectation would be quickly beaten out of anyone dealing with arbitrary TCP traffic.)
Could you map all such buggy devices on the internet and treat their known addresses specially? (Scan the whole public IPv4 range every month to find them?)
I worked for many years with the locations on the remote islands with the high RTT. You wouldn't believe how much software of the biggest industry players failed to work under that circumstances.
My favorite case was when one company first claimed "OK on another side is probably the device from the competition." It was their device on another side too. They "debugged" the case for a bigger part of the year and the case remained unsolved.
I'm absolutely not saying that the far side implementation was OK. I concur that the TCP is damn hard. It's just that you can't expect that "everybody else" is perfect. You have to plan to handle the special cases or not be surprised that you can't handle them.
> You have to plan to handle the special cases or not be surprised that you can't handle them.
What do you propose then? You send valid traffic and get a completely broken response - I don't see any space left for handling special cases here. Like they said in the article, it's the first packet and you have no information about the other side yet. This is not something you can avoid/workaround.
As soon as you think "I send a valid packet, everything must work" you're missing the point. Even OP at the end worries about the actual client communication actually not working. At the end, it's not who's "theoretically" right, it's "can you make it work given the real world limitations" that include the implementations not tested with your "clever better than the competition dynamic timeout modification." It can be clever, but be clever more and plan for the cases when it is against the real life limitations. And don't cry foul. It's you who move into less tested territories, expecting to be better than the "competition".
The same as I don't complain about the herd reactions here. They are probable, therefore expected.
Sigh... let me say this again clearly, because you keep repeating the same point. This case has nothing to do with timeout modification. It will happen with or without it. If you use lower timeout you will run into the issue more often. If you use higher timeout - less often. But you will run into the issue anyway and there is no way to fix it.
So just tell us what is your proposed solution / implied more tested territories. You try to start a connection, send SYN, don't see a response for X (500ms, 1s, whatever, you choose). What do you propose to do that isn't a retransmit (breaks the other side) or choosing new seq numbers (breaks good behaviour on your side by starting two connections and possibly tripping flood detection if you do it too often).
Yes, the buggy implementation is buggy in idealistic sense. But it doesn't matter. The fact is, that bug was de facto not visible until the OP introduced the more "clever" (shorter timeout) code. If the node were really, utterly problematic it wouldn't be on the internet.
What to do in this case? Well, what do you want to do? Want it to work with everything the same? You have to start with the timeout of 1s, like other TCP stacks do. That's what was tested, and then your behaviour to these nodes wouldn't be different from the rest of the internet. If you don't want to do this for all of yours connections (you're losing an advantage to the competitors), then be ready to introduce the list of nodes where you observe the behaviour, and assign the different starting timeout to them. Etc. There is always a solution, the wrong approach is "it can't be solved because it's against the RFC." The solution is making something work under the real-life limitations, and not having the world where there's no bad implementation (the state that's impossible to reach).
BTW I have an impression we don't understand each other because you have a "mathematical" approach (finding one counterexample, my theorem is disproved, let's wave my hand in the air in helplessness). I'm an engineer. There's a real life out there. Everything you can imagine will have a real life counterexample. Deal with it. Whatever you do, it's your decision. Ignore it, adapt, whatever, it depends on what you want to do. Just don't stay on "it's that others are wrong that's the problem, but I'm right and that's it." Unless you want to do this for show. But then allow me to claim it's just a show.
Imagine Google writing around 2000 "we won't index the badly formed HTML pages, because they are against the standard, and our XML processor will never work." There wouldn't be a Google today.
> I concur that the TCP is damn hard. It's just that you can't expect that "everybody else" is perfect.
There is no evidence that he expected everyone else to be perfect. This is simply a debugging story about a case where he found an unexpected problem with an interesting cause. You don't need to make it an opportunity for treating him like he was stupid for having a bug.
edit: and in fact, you say in your response, "You wouldn't believe how much software of the biggest industry players failed to work under that circumstances." In other words: "You wouldn't believe X! But if you're surprised by X, I will scoff and act like you're a clueless noob for not anticipating X."
The people who expect that "everything must work because of the RFCxxx" are of course those who wouldn't believe. It's addressed to them. Regarding "There is no evidence that he expected everyone else to be perfect," he wants to use his custom TCP stack and he explains that for this case he can't do "heuristics" in order to keep the "Our TCP implementation" (which) "has a variable base SYN retransmit timout, and in this case it was roughly 500ms. So most of the time the page load would fail with our TCP stack, but succeed with an off the shelf one that had a SYN retransmit timeout of 1 second."
So everything revolves really around the less common timeout times all the time, the wish to use them everywhere, the discovery that they don't work everywhere and the resulting feeling "from my point of view this is a particularly annoying bug." Which is exactly what I commented to from the start. True, the sentences in the original aren't dominating the whole text, so the less attentive reader (or readers in this case) can miss that.
There are two different approaches to write the article:
1) I will show you that the other side is wrong (spending 95% of time on that). It is annoying because our clever new code doesn't work with it whereas old less clever does.
2) We developed a clever new TCP code. We thought we're clever. But some nodes on the internet are less clever. I'll show you an example.
I claim that the approach 1) implies your belief that it's more important to you is that you're "right" than what actually works.
I really don't understand where you're coming from.
But the post did not "spend 95% of the time showing the other side is wrong", and it's really rich to say that at the same time as you're saying that it's seasy for readers to miss the damning sentences when they're not "dominating the whole text". Most of the post was discussing the age old problem of TCP implementations needing to be bug-compatible with each other. In addition it showed a new (to me) example of this phenomenon, and did a detour into explaining why that bug was easier to trigger with our stack than some others.
It's like tou've for some reason decided that I don't understand that there are broken implementations around, that it came as a complete surprise to me, and that the only thing that matters is that I'm right and the other guy is wrong. But the whole point of the post was exactly that it doesn't matter if you're right or not. If you want a production quality TCP implementation you need to support what's out there, not what's in the spec.
The short term solution was a configuration change to force a 1s SYN retransmit timeout, even if that's a 99.9th percentile handshake RTT in this environment. In the longer term we need to evaluate the timeouts in the most widely deployed systems, so that we can offer a calibrated set of completely safe / reasonable / aggressive values.
For example one of the links in the post suggests that OS X has a 700ms SYN retransmit timeout. If that's true (haven't tested yet), and the same is true for iOS, then a 700ms timeout might be a pretty reasonable setting rather than an aggressive one. Because if a piece of kit fails for iOS traffic, it's not going to remain in service for long ;-)
One of those cases where you can't do much more than follow the herd.
I'm glad to learn that your first solution matched exactly what I assumed it would be. Had the solution been present in the article, I certainly wouldn't comment that it's just "look how others on the internet are wrong."
Thank you for sharing these new details. Together with them I have much better feeling for getting involved in reading it and discussing it. In my view the approach to solution as you stated now matches with the concepts I've stood for in my comments here. I wish you luck in your endeavours.