When TCP sockets refuse to die

majke · on Sept 23, 2019

Then there is a discussion about forcefully killing sockets :)

* close(): socket will be lingering in background as usual

* shutdown(SHUT_RD): no network side effect, discards read buffer

* shutdown(SHUT_WR): equivalent to FIN SO_LINGER socket - if timeout non-zero blocks until write buffer flushed; if timeout is zero then immediately sends RST

* the trick with close() after TCP_REPAIR: (https://lwn.net/Articles/495304/) immediately discard a socket with no network side effects.

* "ss --kill" command: forcefully close a socket from outside process, done with netlink SOCK_DESTROY command.

nly · on Sept 23, 2019

> * shutdown(SHUT_RD): no network side effect, discards read buffer

My understanding is that if the read buffer is not empty, or if you later receive any further data from the other end, that this will result in a RST.

Wrt to linger behaviour: "if timeout non-zero blocks until write buffer flushed" is only true of blocking sockets. For non-blocking sockets things get complicated and vary across platforms

majke · on Sept 23, 2019

I made an attempt to check:

* shutdown(SHUT_RD): seem not to have _any_ side effects. you can totally still recv() on that socket. Kerrisk writes 61.6.6: "However if the peer application subsequently writes data on its socket, then it is still possible to read that data on the local socket". Basically, SHUT_RD makes recv() return 0. That's all it does.

* SO_LINGER on O_NONBLOCK: shutdown() doesn't block. close() still blocks.

wruza · on Sept 23, 2019

This highlights few more details on SHUT_RD: https://books.google.com/books?id=ptSC4LpwGA0C&pg=PA173&lpg=...

That is not discussed in POSIX at all, I believe, so basically platform-unaware SHUT_RD is vaguely defined and I wouldn’t even rely on recv() returning zero in particular.

Edit: changed books domain to .com

dunkelheit · on Sept 23, 2019

Ok so I decided to check it empirically. Behavior is indeed platform-dependent.

Linux: after shutdown(SHUT_RD) all blocked recv() calls unblock and return 0. But the other side can still send data and the recv() call will still read it! It is just that after shutdown when there is nothing to read a recv() call immediately returns 0 instead of blocking.

macOS (and BSD, I presume?): The read buffer is discarded and all subsequent recv() calls return 0. If the other side sends data it is discarded.

Unfortunately I have no Windows machine around to try out.

Now, maybe someone can clarify, given such wildly different behavior what is the intended use case for shutdown(SHUT_RD)?

wruza · on Sept 23, 2019

It is likely a remnant of non-PF_INET families under SOCK_STREAM. There is no SHUT_RD in TCP by design.

Sockets are known to be not very standard landscape historically. Best bet here is to just stick with that “Disables further receive operations” posix definition and follow it to the letter by not recv’ing anything anymore.

wodny · on Sept 23, 2019

Thank you very much for the SOCK_DESTROY. I have completely missed its addition to the kernel.

iforgotpassword · on Sept 23, 2019

Also, just to clarify: Shutdown does not release the socket; the fd is still valid and you need to call close on it to eventually release all resources related to the socket.

imglorp · on Sept 23, 2019

Side conversation.

I am constantly amazed how a tiny piece of code, the linux (or bsd) TCP stack, can be a source of mysteries and adventures for decades, even for kernel experts and industry leaders like CF. The thing has around 11 states and 20 transitions, around 4000 LOC.

Compare this with some of the multi-million LOC, distributed monsters we all know and love.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

dunkelheit · on Sept 23, 2019

This is an example of "the settings problem". When a system has even a moderate number of behaviors controlled by their respective settings the number of possible interactions rises exponentially. Combine that with different possible user behaviors and you get a combinatorial explosion of possibilities with some weird results that are non-obvious even to experts.

imglorp · on Sept 23, 2019

That's a good point, plus it's not just (states x transitions), it's also a whole bunch of hidden state: the other guy's state as well as all the packets in flight.

option_greek · on Sept 23, 2019

Isn't that just TCP (rather than TCP stack).

imglorp · on Sept 23, 2019

Yes, sure, but the protocol and one implementation of it are pretty tightly coupled. As a measure of algorithmic complexity (ala Chaitin?), the kernel code should show us at least order of magnitude, 4 K lines of code.

Compared to, say Google which is 2 billion, a tad more complex.

Dylan16807 · on Sept 23, 2019

If you are going to force a minimum drain rate, please make sure you use a large enough monitoring period. With the patch in "The curious case of slow downloads", once 60 seconds have passed it starts checking download speed as often as every second, which is really aggressive. If you have a slow connection that's not super-stable, you're still going to get kicked, even if you're well over the minimum drain rate on average. An average of some kind over 15-20 seconds would be a lot more appropriate here.

vinay_ys · on Sept 23, 2019

Very true. Just today, I had to use an old 3G phone tethered to my laptop for data connection and found my ping times to be in orders of 5-10 seconds and at sporadic intervals in between data was getting sent/received in bursts at much lower latencies. It wasn't fun trying to get work done on such a connection.

vinay_ys · on Sept 23, 2019

What's really weird is that TCP is an endpoint protocol and should have been relatively easier to upgrade/replace/change (relative to say, IP protocol).

But why haven't we moved to something better?

Say, why doesn't Apple use a better suited protocol between Apple devices and Apple servers? Why doesn't google use a better protocol between Google devices and Google servers (oh wait, they do – QUIC..which is something-other-than-TCP over IP).

More people should be doing this, yes? why not?

It is as if the layered architecture of the network isn't being taken advantage of by engineers.

As people build 4G-5G networks that are IP based, shouldn't we insist they build purely IP based and not peek into layers above and make assumptions? thereby enable more of the flow control and reliable transmission protocol experimentations?

ZWoz · on Sept 23, 2019

Because NAT. Most commonly used as port address translation (One-to-many NAT), so operates with TCP and UDP. That is not only home router issue, mobile netoworks use sometimes Carrier-grade NAT (NAT444). Any new IP protocol has problem with that, so nobody wants to implement something that is going to be broken for most customers.

vinay_ys · on Sept 24, 2019

Yeah, that's the reason for UDP wrapping the custom protocol.

ronsor · on Sept 23, 2019

1. You don't want to reimplement a quarter of the network stack in userspace.

2. Network infrastructure may drop "unusual" packets.

Dylan16807 · on Sept 24, 2019

> 1. You don't want to reimplement a quarter of the network stack in userspace.

That post doesn't say anything about user space. Ideally the new protocol would be in the kernel, triggered with just a flag or even automatically.

vinay_ys · on Sept 24, 2019

It's not reimplementing if it is a new protocol with different capabilities. There's no point in having a layered architecture if we cannot evolve the layers, especially the layers designated as endpoint layers.

gothroach · on Sept 23, 2019

Thought this article was interesting as I've been working with a piece of hardware lately that doesn't close STMP connections after sending mail. Took me a while to figure that out, it would always send email properly the first time but then wouldn't be able to again until after a reboot. Turns out it doesn't close TCP sockets created while sending mail, unless you jump through some hoops. Such is the world of embedded industrial devices, unfortunately.

tuukkah · on Sept 23, 2019

TLDR: TCP_USER_TIMEOUT is an important setting but it's somewhat tricky, not properly documented and there are kernel bugs related to it.

ausjke · on Sept 23, 2019

This is indeed very problematic, I worked on ONVIF test suite 3 years ago, in some failing test cases, the tcp socket can never die in time, it failed the certification as a whole as all the following unit test cases can not continue. All those immediately-kill-tcp-socket or socket-port-reuse can not help, at least not reliably.

Psype · on Sept 23, 2019

System Of A Down intensifies