This can also be used to implement connection acceleration with no kernel-side hackery involved, by having userspace maintain a pool of half-open connections by spamming SYNs at a target, recording ACKs, then constructing a regular socket via TCP_REPAIR when a connection is actually required. That allows omitting one roundtrip from a typical TCP connection setup, which may be substantial in a variety of scenarios.
The technique sounds messy but actually it involves not much work when the target is Linux, a single half-open SYN is good for 63 seconds with the default sysctl settings, which seem to be used with almost every Internet service you might want to reach (including e.g. Google)
I was playing with this during an interview last year and intended to write it up, but never got around to it. The technique seems to work as intended, I made a little prototype reverse proxy for it in Python using a temporary listening socket with a drop-all SO_ATTACH_FILTER to allocate a port number and prevent Linux on the initiator side from responding with a RST to ACKs for a half-open connection it knows nothing about
Ideally it's only necessary transmit about 1 frame per minute for each half-open connection, but you're right, I found one site, India's rail operator IRCTC that still blocked this traffic very quickly.
The original problem was for load balancing, where the protocol is often not known, and backends might be reached over a WAN. There are some neat advantages possible from this method, such as the target application not being woken up early from accept() (etc) by preconnection, and the initiator being able to detect the health of a backend before choosing it as the target of a forwarded connection, but mostly I just thought it was an interesting abuse of an obscure kernel feature.
One of the biggest interests & excitements I feel over QUIC & HTTP3 is the potential for something really different & drastically better in this realm. Right out of the box, QUIC is "connectionless", using cryptography to establish session. I feel like there's so much more possibility for a data-center to move around who is serving a QUIC connection. I have a lot of research to do, but ideally that connection can get routed stream by stream, & individual servers can do some kind of Direct Server Return (DSR), to individual streams. But I'm probably pie in the sky with these over-flowing hopes.
Edit: oh here's a Meta talk on their QUIC CDN doing DSR[1].
The original "live migration of virtual machines"[2] paper blew me away & reset my expectations for computing & the connectivity, way back in 2005. They live migrated a Quake 3 server. :)
Co-author of multiple QUIC libraries here: Even though QUIC uses a connectionless protocol (UDP) and QUIC allows client IP addresses to change during a lifetime of a connection, a QUIC connection is actually extremely stateful - a lot more than TCP. There's lots of state for each stream (including potentially very fragmented send and receive buffers) and the overall connection. You could potentially serialize that, but it would be even more work than for TCP.
Normal applications would typically just make sure the client can reconnect as fast as possible (and QUIC can do it with 0 to 1RTT), and then have suitable application level semantics that limit any availability issues when a reconnect happens (e.g. for large downloads you can restart using a ranged request. For persistent connection the server can tell the client with a GOAWAY that it might shut down and the client can reconnect early to avoid running into the availability issue).
My understanding is that for [1], their frontend proxy still is a single QUIC peer which contains all state of the actual connection - otherwise they also couldn't do connection level flow control and overall connection congestion control. That layer now just instructs another server about packet layouts to send, but doesn't make the other layer handle QUIC transmission completely on its own.
Multihoming is one of the key features of most protocols invented after tcp (sctp, QUIC, mptcp) and for good reason, it is so so useful in many scenarios.
but given the place where we ended up, maybe host addresses make more sense than interface addresses (ignoring the effect that would have on routing table aggregation)
I guess it all depends on your constraints. If you can use a message bus or high-level abstractions, like kafka, go for it. But for extremely low latency or very constrained environments, if you need to switch billions packets per second, it's hard to beat the simplicity and efficiency of arp+ip ...
I use it a lot (without criu or libsoccr) for high-availability shenanigans, to avoid the reconnection delay. Machine A is 'main' and had established a connection to distant-machine 1. A crashes (the hardware stops), machine B takes over its MAC, IP and all its established TCP connections. Nothing can be seen from the distant-machine side (maybe a gratuitous ARP slips out... to no avail). And yes, for <1 millisecond takeover this is necessary and I thank the criu project with all my engineering heart for not going the 'just put a module in there and be done with it' but actually making sockets checkpointable and restartable, and saving me from the pains of a userland network stack.
The actual details are far more funny and interesting (we could talk about checkpoint not being an atomic operation for the kernel, how you need to do some magic with "plug" qdiscs and qdiscs being applicable on egress only you'll look into IFBs and I love Linux it is so versatile and full of amazing little features). Don't forget to hot-update conntrack too...
And since libsoccr is GPL you might need to do this yourself, and you'll want to do it anyway, because it's interesting and you'll learn so many things.
My only gripe is the checkpoint still being a bit slow and maybe if I keep annoying Jens Axboe on twitter maybe soon it'll be a io_uring chain <3.
Every microsecond, and HFT people rent datacenter space the closest possible to physical exchanges... When speed of light is your main concern, maybe the Linux kernel is not your friend anymore (although dpdk can help here). I'm happy people keep pushing the kernel so hard, and still try to keep the kernel generic and composable, so we can profit from this huge, amazing work.
Often times it's mission critical systems where latency is key, for many reasons:
- you have a human in the loop to take a split second decision, every millisecond counts
- you have a very short time to perform 'looped' operations - where the result of one measurement must be taken into account to effect the next measurement (adaptive optics, some radar systems, some mechanical control loops) and you can't wait.
You'd think 'oh but you got more than 1 ms for that' Well not always since one must take into account the time to detect the failure, and the time to switch other parts of the system (which sometimes must be done in sequence with the connection takeover).
I'd say 'forget tcp' there but we don't always get to decide the comm layer...
If RDS could fail over in 1ms I would be extremely happy. In practice, a few minutes of downtime for things like DB upgrades is usually acceptable in the business I am in, however this is enough time to cause quite a lot of alerting noise.
If the window of unavailability was instead 1ms, there would be dramatically less noise, potentially none.
You capture and stream the state of the connection as often and fast as you can. Ideally the tcp state would be streamed in the same operation as checkpointing, still in io_uring.
CRIU is a really impressive tool, which can do some really cool things. It's pretty difficult to use, given the nature of what it's doing and all the moving parts to orchestrate. I added experimental support to Docker for migrating containers, but for lots of reasons it never made it as a full fledged feature.
Since the ELF interpreter and glibc initialize differently depending on cpu flags, it's probably unlikely that migrating between arbitrary hosts can be expected to always work.
(But in practice, we mostly rely on the fact that if you are in Google Cloud and specify a minimum CPU version, Google masks CPUID in that VM to exactly match that CPU, even if the underlying hardware is newer. They do this so live-migration of the whole VM works, but it also helps for process migration. I'm not sure where this is specifically documented but see e.g. https://stackoverflow.com/a/44507857 .)
Probably that's why hyperv have a option to force any guest only able to access to a common cpu instrument set. So they will always have the same flag and migration is possible.
Presumably packets which continue to arrive at the old host must get forwarded to the new host, and returning packets must be spoofed. Or does this also involve some upstream network reconfiguration?
Live migrating a process from one host to another sounds fun on paper, until you run into all the caveats. And it gets even more fun when you try to do it on a cloud provider, where your process state is coupled to the instance on which it runs (such as security group or instance metadata state).
> It's crazy fun on the cloud provider side too. Live migrating entire instances between two hypervisors isn't exactly the simplest thing to do.
Migrating VMs works pretty decently actually, biggest latency is from telling the network equipment where the traffic should now go.
The attached hardware is the issue, as you said. If all you talk is paravirtualized stuff it generally works, just need to make sure target CPU supports required stuff (we had to downgrade exposed CPU temporarily to migrate between intel and AMD hypervisors), but good luck with having GPU attached...
Moving to "migrate the process" (which is essentially what is required to migrate containers) adds a whole level of mess to deal with.
Any application that has internal state that can’t be easily checkpointed/restored but has to be moved to a different node. For example an ffmpeg process that is running a long transcode and you want to utilize cheap spot instances
Aside from all the ones mentioned, one idea I’ve considered is what if the machine containing data you’re retrieving is elsewhere and there’s a lot of it. Rather than having the inbound machine be responsible for proxying all that data (and all the round trips to the user space proxy process), what if you just migrate ownership of the TCP connection to the other machine instead to do Direct Server Return (DSR)? Then you can cut a huge amount of overhead out.
In practice of course that’s easier said than done (eg what if there’s a proxy in between that had some data buffered in application buffers), but it’s a neat idea I want to revisit some day. And it’s totally possible that using something like QUIC is a better model than relying on migrating TCP kernel stack.
So, this is making a TCP Connection serializable in an ad hoc fashion.
> It is natural to want those connections to follow the container to its new host, preferably without the remote end even noticing that something has changed, but the Linux networking stack was not written with this kind of move in mind.
Some languages provide serialization of most anything by default, such as Lisp. Now, even in Lisp there are objects which don't make sense to serialize, including a TCP connection; however, the components thereof can be collected and sent across the wire or wherever else in a standardized way. The C language, in comparison, offers a few serialization routines for non-structured types, and that's about all.
So, my point is the ability to take running state, serialize it, and reinstate it elsewhere is only impressive to those who have misused computers for so long that they don't understand this was something basic in 1970 at the latest.
But this isn't state of the process image that needs to be serialised, it's state of the connection between two hosts and some kernel configuration on those hosts. Programming language doesn't play into it at all. Languages "such as Lisp" will have the exact same problem, for the same reason. Collecting all of the "components" of the connection and sending them to a different host won't make the other host start sending packets to the new recipient, or replay the in-flight packets (which is state on intermediate routers, different computers than the connected ones entirely), or fix the ARP tables on the neighbouring hosts. None of that is available, and certainly isn't writeable, to the host doing the serialising.
To play some silly semantics games, this isn't so much about _serialising_ a connection as it is about _deserialising_ the connection and having it work afterwards. That act has literally nothing to do with programming language.
> But this isn't state of the process image that needs to be serialised, it's state of the connection between two hosts and some kernel configuration on those hosts. Programming language doesn't play into it at all.
It is because UNIX is written in the C language that there are even multiple flat address spaces instead of segments or a single address space systemwide. The fact that the kernel exists at all is also due to this. It has everything to do with the implementation language.
> Languages "such as Lisp" will have the exact same problem, for the same reason.
Under UNIX, yes.
> Collecting all of the "components" of the connection and sending them to a different host won't make the other host start sending packets to the new recipient, or replay the in-flight packets (which is state on intermediate routers, different computers than the connected ones entirely), or fix the ARP tables on the neighbouring hosts. None of that is available, and certainly isn't writeable, to the host doing the serialising.
It may very well require some specialized machinery, but not nearly so much as one may think to be necessary.
> To play some silly semantics games, this isn't so much about _serialising_ a connection as it is about _deserialising_ the connection and having it work afterwards.
That's implicit. I needn't write of deserializing when writing of serializing, as one is worthless without the other, at least in most cases.
> That act has literally nothing to do with programming language.
Look at what Lisp and Smalltalk systems could do before UNIX existed and tell me that again.
> It is because UNIX is written in the C language that there are even multiple flat address spaces instead of segments or a single address space systemwide.
That is flat out wrong. C supports multi-programming in a system that has one address space (that includes the kernel too). Programs just have to be compiled relocatable.
You know, like what happens with shared libraries: which are written in C, and get loaded at different addresses in the same space, yet access their own functions and variables just fine.
Multics used segments and Lisp Machines had a single address space. UNIX breaks down quickly without multiple fake single address spaces for each program.
> Programs just have to be compiled relocatable.
Yes, and with unrestricted memory access, one program can crash the entire system.
> You know, like what happens with shared libraries: which are written in C, and get loaded at different addresses in the same space, yet access their own functions and variables just fine.
That is except when one piece manipulates global state in a way with which another piece can't cope, and at best the whole thing crashes. Dynamic linking in UNIX is so bad some believe it can't work, and instead use static linking exclusively.
> UNIX breaks down quickly without multiple fake single address spaces for each program.
So do MS-DOS, Mac OS < 9, and others: any non-MMU OS.
> Yes, and with unrestricted memory access, one program can crash the entire system.
That's true in any system with no MMU that runs machine-language native executables written in assembly language or using unsafe compiled languages.
Historically, there existed partition-based memory management whereby even in a single physical address space, programs are isolated from stomping over each other.
> when one piece manipulates global state in a way with which another piece can't cope
This problem is the same with both static and dynamic linking.
And lisp too!
> UNIX breaks down quickly without multiple fake single address spaces for each program.
Citation needed. I don't think my programs very commonly try to go completely outside their address space. The closest thing I see is null pointer crashes, which are still not very common, and those would work the same way in a shared address space.
Edit: Yes, fork doesn't work the same. That's a very narrow use case on the vast majority of machines.
But this isn't about address spaces. They're moving connections between hardware hosts. It sounds like you're got your drum to beat but this isn't about that.
The technique sounds messy but actually it involves not much work when the target is Linux, a single half-open SYN is good for 63 seconds with the default sysctl settings, which seem to be used with almost every Internet service you might want to reach (including e.g. Google)
I was playing with this during an interview last year and intended to write it up, but never got around to it. The technique seems to work as intended, I made a little prototype reverse proxy for it in Python using a temporary listening socket with a drop-all SO_ATTACH_FILTER to allocate a port number and prevent Linux on the initiator side from responding with a RST to ACKs for a half-open connection it knows nothing about