What is the general lesson we should learn from this?
Postel's law aka robustness principle [1] can easily lead into accumulating complexity when implementations adapt to the bugs in other implementations. How could protocol designers mitigate this problem beforehand?
The lesson is "be strict in what you accept, since the beginning".
For instance, a lesson learned by the Linux kernel developers: when adding a new system call to an operating system, if you have a flags parameter (and you should, which is another lesson they learned), if any unknown flag is set, fail with -EINVAL (or equivalent). Otherwise, buggy programs will depend on these flags being ignored, and you won't be able to use them later for new features.
But it has to be since the beginning; once the protocol is "in the field", you can't increase the strictness without pain.
Otherwise, buggy programs will depend on these flags being ignored, and you won't be able to use them later for new features.
I don't understand why that's true. Just add the new flag and use it for your new feature. Aren't buggy programs who were already sending the flag responsible for their own bugs?
Aren't buggy programs who were already sending the flag responsible for their own bugs?
When the kernel breaks userspace, it's a kernel bug. It's a philosophy for system robustness that the kernel has and other operating systems tend to adopt as well. As you get higher up the stack into 3rd party libraries and other programming tools the maintainers often take a more cavalier approach to maintaining compatibility.
Yes and no. If you push an update and sysadmins all over the world dutifully upgrade and immediately notice programs breaking, their first thought is not going to be "oh, those programs must have had latent bugs." No, they're going to blame you. Besides, having stuff work correctly is always more fun than assigning blame, and validating flags is an easy way to avoid these scenarios.
Hm, fair enough. To be clear, I wasn't arguing against validating flags. I was commenting about "This kernel function only gets 32 possible bitflags, but since they never validated their flags, they can no longer add any additional flags, ever, because it might break other programs which may or may not even exist."
That sort of mentality seems like it would push designers in the direction of poor design decisions. If a bitflag is the best design for a new feature, but they're prevented from using it out of a sense of "Let's not ever break anything ever," then the result may be a bad design that people are stuck with for the next 50+ years, which seems objectively worse.
But my reaction is based on theory and not backed by experience, so it's probably unfounded.
I'd say one of the lessons is that even a supposedly well-standardised system sees hundreds of implementations (or more!) then the accumulated bug baggage can still make it hacky with per-platform code. For comparison consider web browsers, which although are far better these days than they used to be, between just Chrome, Firefox, IE and Safari there's a bunch of quirks and platform-specifics, so I can imagine worldwide TCP deployments are "interesting".
Where possible hacks should be applied only where necessary, e.g. the specific software versions affected only, and exclude fixed versions. Then hopefully in the long run the old buggy versions die out and the hack can be removed... but as the article says, over a network it's not always possible to identify when to apply a hack.
Postel's law aka robustness principle [1] can easily lead into accumulating complexity when implementations adapt to the bugs in other implementations. How could protocol designers mitigate this problem beforehand?
[1]: https://en.wikipedia.org/wiki/Robustness_principle