> The connectx() functionality is valuable, and should be added to Linux one way or another. It's not trivial to get all its use cases right. Hopefully, this blog post shows how to achieve this in the best way given the operating system API constraints.
I think this write-up is valuable since it shows a deficiency that impacts real-world industry usage. This kind of public explanation of the problem and workaround is perfect ammunition to get support behind such an initiative. Also given that darwin already supports this is another point in their favour.
A few times during this read I thought "that was a bad decision on their part" but upon completing the article I changed my mind. Their needs are definitely complex but they aren't unreasonable. It seems like having kernel support for their use cases is appropriate.
A few things that weren’t mentioned in detail or which I skimmed over without noticing:
- The problem with bind before connect is that the OS thinks you might call listen after bind instead of connect and listen requires the src port/ip to be unique.
- Running out of ports may affect short-lived connections too: by default when you (the client) close a connection, it goes into a ‘TIME_WAIT’ state that locks up the {src,dst}_{ip,port} quadruple for (by default) 2 minutes to protect new connections with that quadruple getting packets that were meant for the previous connection.
I think one thing that weakly surprised me about the article was that cloud flare seem to be mostly using standard BSD sockets apis and, presumably, the kernel interface to the network cards. I would have expected their main products (ie being a CDN) to use more specialised network cards and userspace networking (eg the final function they give for UDP does a ton of syscalls, and for tcp it still does several) with a non-BSD api. The main advantage would be having more control over their system and the ability to avoid a lot of unnecessary overhead from syscalls or apis, but there could be other advantages like NIC TLS (though this can also be accessed through the kernel as kernel TLS).
I’m sure cloudflare have reasons for it though. Maybe the hardware is more expensive than underutilisation or hard to source or buggy or user space networking just doesn’t improve things much. Many of those things can make servers more efficient but that might not be necessary if many servers are needed at lots of edge locations with each server not needing efficiency. Or maybe those things are mostly just saving mics when the latency over the wire is millis.
> I think one thing that weakly surprised me about the article was that cloud flare seem to be mostly using standard BSD sockets apis and, presumably, the kernel interface to the network cards. I would have expected their main products (ie being a CDN) to use more specialised network cards and userspace networking (eg the final function they give for UDP does a ton of syscalls, and for tcp it still does several) with a non-BSD api.
You might be interested in another article from them: "Why we use the Linux kernel's TCP stack" [1]
> Running out of ports may affect short-lived connections too: by default when you (the client) close a connection, it goes into a ‘TIME_WAIT’ state that locks up the {src,dst}_{ip,port} quadruple for (by default) 2 minutes to protect new connections with that quadruple getting packets that were meant for the previous connection.
This bit hard at a previous job. A very important PHP script would open DB connections (no pooling) and when it’s load hit the magic threshold we started running out of ports. But requests would succeed as TIME_WAIT expired so it just looked like some requests would hang with no reason.
It was tough for us to figure out. The final solution was obvious (pooling) but that took some doing because the script was quite old and it took a bit of an overhaul for that to work. Don’t remember exactly why. Tech debt is fun.
But we learned a new thing we had to be careful of, and a new failure mode we never expected.
Back in the HTTP/0.9 era TIME_WAIT was a big problem when running a busy [reverse] proxy. I remember using ndd to "tune" tcp_time_wait_interval on Solaris for this, as well as the anon (ephemeral) port range.
"The solution is more quadruplets" is for me one those things that's blindingly obvious once pointed out.
See also the other link Cloudflare "Why we use the Linux kernel's TCP stack" link in a sibling - about half-way through they reveal they use both kernel bypass (DOS can max out iptables at ~1Mpps) and Solarflare NICs (offload options not stated), as well as Intel NICs with netmap bypass.
GNUNet CADET (admittedly a very experimental and non-production system) uses a 512 bit port number. A port can be a string, like "https" or "my-secret-port" which is hashed with SHA-512 to produce the port number. I like this idea very much and dream of it becoming a reality on the web.
They use SHA-2-512 everywhere else already, presumably for post-quantum resistance. I guess it gives them a smaller cipher footprint? They use the same data type in their codebase for hashes? 512 bits is sufficiently future proof? Future use cases that haven't been thought of yet, that require a secure hash? 64 bytes isn't that much overhead?
But I'm really just guessing, I'm not sure what their reasons are.
I agree in principle - SHA-2-256 could be a reasonable hash function whilst maintaining the properties of a secure hash function - or less than that if collision resistance isn't needed.
On the contrary, 64 bytes is a huge overhead. The standard maximum transfer unit (MTU) is 1500 bytes. So 64 bytes is 4.2% of the packet (at best!). Add the fact that you need a source and destination address, that's 128 bytes, or 8.5% of the MTU. Or, if you will, 8.5% (or more) of internet traffic, just in port numbers.
Maybe I'm misunderstanding the suggestion, but it would be a terrible idea to have 512 bit port numbers in TCP/UDP packets.
Post quantum resilience feels like overkill for port numbers since they’re not secret (if it was you wouldn’t be able to connect). 32bit or even 64bit would be more than enough in terms of extending our current scheme. If you wanted to use a resistant collision hash of an arbitrary service string, 128 bits would also be enough and even 64 bits might be too (again - these aren’t secret so you’re only trying to avoid collisions and to resist port enumeration attacks).
Post quantum resilience absolutely is overkill even if they are secret, since they're not the type of secret you'd care about that for imo. But I suspect "we have a 512bit port space" is intended to make them secrets ie: impossible for an attacker to bruteforce, at which point you run into birthday attacks, so you're talking about the square of the space. Even still 512bit is too high imo, 192bit should be fine.
I like the idea, but if we can extend or change the protocols why use ports at all? Use IPv6 but assign the last two or more octets to individual services routed on the hosts.
A fantastic article that deals with an issue of real, practical importance. I was surprised, however, to hear no mention of file descriptor limits, and I'm curious as to why that's not relevant. I think the article could be at least marginally improved by adding a note about this topic, particularly around the time it discusses the 28k available ephemeral ports available.
We set it to max and move on. Also, we monitor to see if there is an application doing something wrong. There never was an issue with a single process eating too many descriptors, and during normal operation it doesn't happen.
Maybe just because there are other good articles about that [1] and/or because you can hit the ephemeral port limit without even using that many ports from a single process. They gave examples of ssh (the client) and dig hitting the ephemeral port limit; no reason those programs would use a significant number of file descriptors.
This article is nice and it shows clearly what kind of problems servers for TCP-based infrastructure on the internet have to face.
The issue I have with it: It's showing how the problem can be solved for systems that you are in control of, but not for systems (clients) that you aren't in control of.
The reason why servers on the web are so easily DDoS-able is because of long-lived connection defaults among most higher-level protocol implementations...and because ISPs all over the globe intentionally throttle initial connections by up to 30-60 seconds for the initial SYN/SYN-ACK when their "flatrates" run out on mobile networks.
Server administrators then face the decision to either drop the mobile world completely or be faced with a situation where a couple malicious actors can take down their infrastructure with simple methods like even a slowloris or a SYN flood attack.
This is also not discussing the problem that ISPs allow (relayed) amplification attacks because they don't think it's their responsibility to track where traffic comes from and where traffic is supposed to go; whereas I would disagree on that.
If I would have influence like cloudflare in the IETF/IANA/Internet Society, I'd try to push initiatives like this:
- Disallow ISPs from spreading network traffic that they can easily detect as amplification attacks. If UDP origin in the request packet != UDP target in the response packet, simply drop it.
- Force ISPs to block SYN floods. This isn't hard to do on their end, but it gets harder to do inside your own little infrastructure. If an ASN doesn't block SYN floods regularly, punish them with blocked traffic.
- Force lower network protocol timeouts. If a socket's network timeout is 60 seconds, it's total overkill for this small planet and only serves to allow this shitty behaviour of ISPs and nothing more. Even in the world of 56k modems, this doesn't serve a purpose anymore.
- (like Google) Try to push UDP-based networking stacks that (with above exception) try to solve this connection timeout problem by allowing to dynamically multiplex / multi-pipeline traffic through the framing protocol of the same UDP socket.
Nice article and the UDP section shows some heavy socket wizardry.
As they require all the clients to cooperate in the use of the SO_REUSEADDR pseudomutex, I wonder if some more explicit process-shared resources would have been better.
28000 ephemeral ports is enough for the c10k problem, but 100k+ seems a stretch. At what point is increasing the number of ports from 28k to a higher number the right answer? Reuse as described here sounds like a useful optimization but at some point (or due to pathological workload) even that will be exhausted I’d think? What to do then
> At what point is increasing the number of ports from 28k to a higher number the right answer?
You can increase it to 65535, although you'll get funny looks from some people. Beyond that, you'd need TCP and UDP extensions, which seems unlikely to be deployed. Usually it's OK, you rarely need to connect more than 65,000 times to a single host; but if you do, you can try to convince them to open more listening ports, or use more listening IPs or more connecting IPs --- one way to encourage IPv6.
Really, the socket API should be updated to a single syscall that binds and connects (or listens) and lets you express what you actually need. Whether it's to have a whole port reserved, or just need a currently unique 4-tuple with no local information selection, or you've got one piece of the local information; or maybe you want a connection that will hash to the same CPU you're currently on. If you control all of the outgoing connections, you can do this work in userspace, but it's challenging; if you're running multiple processes that share the 4-tuple address space, it's a lot harder (PS, try to partition the space so processes don't compete). Turning two (or more) syscalls into a single one is useful anyway, especially if you've got meltdown mitigations turned on.
Since you worked at WhatsApp, you'd have seen it first hand how Rick Reed et al pushed the boundaries of FreeBSD network stack and Erlang BEAM... 2M+ (TCP) long-lived connections per server, though presumably to different (CGNAT?) destinations back in 2012: https://blog.whatsapp.com/1-million-is-so-2011/
Acknowledging that WhatsApp may have never hit the limits Cloudflare has, did FreeBSD make it easier than it would have been on Linux [0]?
Is it common for network load balancers to instead bypass the kernel (xdp, netmap, snabbswitch etc) and take over connection management entirely (perhaps, along with a hand-rolled TCP/IP flow manager)?
That's 2M+ inbound connections[1]. Inbound connections are relatively easy, because client IPs are pretty diverse, even when there's CGNAT. This is way easier than a proxy load like Cloudflare does. I briefly ran some HAProxy servers that helped finalize the migration to Facebook; that was 1:1 each inbound connection (that managed to send a valid enough hello packet in a minute) got an outbound connection --- that was a lot harder. I was able to do a pretty good amount of tuning for that specific case by mucking about in the (FreeBSD) kernel; IMHO, the Linux kernel isn't as easy to muck about with, but then I wasn't ever eased into it; and it's possible it wouldn't have needed as much mucking about, also FreeBSD 13 has some interesting looking improvements which may have addressed the bottlenecks I was seeing, but I no longer have access to hyperscale optimization problems. :) TLDR; to get haproxy + FreeBSD 12 to do the max 1:1 tcp proxying, use only as many CPUs as RSS-able NIC RX queues (ideally don't have 2x14-core servers when you can only use 16 cpus, but getting new hardware wasn't an option), run one haproxy process per CPU (pinned), use RSS on inbound sockets and select outgoing ip + port so that outgoing connections use the same cpu (search for my patches on the haproxy list), then tweak the kernel a bit, I think my patches are floating around on FreeBSD phabricator or a mailing list somewhere, but they're not great; I did some tweaking to the tcb hashing and IIRC a little bit of mucking around with port selection/verfication, but it's all a bit hazy, the bottleneck was definitely kernel locks around inserting new outbound connections.
WhatsApp pre-Facebook didn't use network load balancers; DNS round robin was good enough for us. We tried our hosting provider's load balancers for part of our service, but the uptime of the load balancers was worse than the uptime of the two servers behind it, so we didn't consider it for anything else; connection limits were too low for chat anyway.
Facebook uses proxygen for network load balancers, but I don't recall hearing anything about kernel bypass. They're mostly http(s), and they do request inspection and some responses from the edge, so IMHO, it doesn't fit the usual case for hyperoptimizing the user/kernel interface: it's more appropriate to focus on user/kernel interface when you're cpu limited, but your application is doing very little.
If you had traditional layer 2 load balancer (aka, the good ones that let you do direct server return), I'd imagine it looks a lot closer to a kernel table based NAT than an HAProxy; and that way you'd have most (if not all) of the traffic handled by the kernel rather than ever going to the application. I wanted to build something fun like that for the proxy, but I also wanted something that I could setup and tune, and not feel guilty about leaving a mess for someone when I left. I really really actually wanted to build a kernel bypassed userspace tcp stack for proxying, in Erlang, where I could hot-load changes, cause it's a giant pain to drain hundreds of thousands of connections so you can update the kernel without disrupting users; way more fun to hot load your tcp stack. Alas.
[1] I think Rick posted somewhere when we accidentally ran past 3 Million connections on a chatd (nope, it was only 2.8M[2]), but the connection numbers went down a lot as we did better client to server encryption and authentication, and as the chat protocol got more complex, so we ran out of CPU because of our server logic. Facebook hosted WhatsApp server nodes are way smaller machines (or were, when I left; single socket, less cores, less ram), so the server count jumped quite a bit.
Edit: Sorry, this comment is full of runon sentences and kind of mostly just a braindump of anything related to the prompt. Happy to take questions, either here (if at least tangentally related to the thread topic of lots of outbound connections) or by email, and maybe I'll make better sentences with followup questions :)
aliasing. that's the solution i always knew to use. think it was added in linux 2.x.
edit: guess it's been deprecated in recent kernels and it was added to linux 1.2 in 1995. apparently now the ip networking stack just understands multiple addresses per interface.
defeating time_wait is a bad idea, the source ephemeral port is how routers, firewalls, nats and other hosts identify a stream. reusing too soon can confuse all of them.
Just one more reason to redesign the tcp/ip stack. Can you imagine what our ABIs will look like if we're still hacking in kludges for another 40+ years? Opening a connection "the right way" will look like casting a spell with voodoo. Ooh, maybe we'll get cooler titles then, like Master of the Dark Network Arts.
Is this really a tcp/ip stack issue, or is this a BSD-sockets issue?
Since Cloudflare showed off how it is fixed with another API (aka: connectx) and/or ordering of bind/connect/etc. etc., it sounds more like a BSD-socket issue to me.
------
In fact, the closest thing to a "protocol error" to me is in the UDP section as far as I'm concerned?
The source of the problem is the source port, which is a random number that only signifies what application running on a source host is requesting one particular connection. And it has to be a random number, because if it isn't it's a security issue. And re-use of ports is only necessary because we're using this finite range of numbers to signify a unique connection.
The deeper down the rabbit hole you go, the more you realize that little of modern connections are by design, but rather are built on a series of kludges used to fix problems and use-cases that the original design didn't cover.
A modern network stack could bring a lot of benefits. Here's a few things I would like in a new stack:
- Service discovery. The reason nothing on the internet can use any port other than "443" is that we failed to build in the ability for TCP to relate to services rather than applications. TCP/443 refers to the application-layer of "HTTP", but also the transport-and-application layer of "TLS". And one IP can only ever host a single application on that host and port. If you want to run multiple applications, you have to run one application which talks the application protocol, deciphers what host you want from TLS, deciphers what host you want again from HTTP, and then opens up another connection to get you to the actual service you wanted. This is convoluted and unreliable - and decrypts the payload far from the actual application that should be looking at this data! Instead, the protocol's destination should be a combination of identifiers, and the stack should route the connection to the appropriate application. For example, instead of "host node01/port 443", you could open a connection to "host:node01/protocol:tls/protocol:http/service:public-web-pool-hackernews" (not that verbose though). All the application needs to do is open a listening socket with the same connection string, and the operating system will look up the application matching that connection string, unbundle any protocol layers it can, and connect a new file descriptor to the waiting application. You could even have multiple applications open listening sockets on the same connection string and the OS could round-robin connections to them automatically. And routers could pass connections to services distributed on pools of hosts without the need for an "application-specific load-balancing reverse-proxy". We can run all kinds of layers of protocols and services, and there is no limitation of tying up a single integer "443" with all these different concepts. This also allows us to expand the number of real internet protocols and not just keep stacking crap on top of HTTPS because we can't route any other traffic because everyone's hard-coded themselves into this one port number.
- TLS becomes a part of the ABI. Any network application that wants to use TLS could open_tls_connection() and get a file descriptor. Just like applications shouldn't have to implement their own TCP/UDP packet parsers, they shouldn't have to bundle their own TLS implementations. This simplifies application code and increases speed by letting the kernel unbundle the payload from the encrypted session.
- With the above two changes, add authentication and authorization layers to the stack. Now any network service can have encryption, privacy, authentication, and authorization, without needing every application to explicitly build in support for a particular method. This would allow you to write some arbitrary web app that requires login, and add an implicit trust of authorizations from host "github.com" with realm "public". If you get a connection request that's signed properly, you know the user is an authorized github.com user (and a payload with their metadata). Imagine authenticating every single network call by default. This would make it dead-simple to secure all network endpoints that often get left exposed on the internet with no authentication.
- Distributed Tracing tags should be added to every layer and hop of a connection, the way an IP TTL is passed. This way you could know exactly where a connection is failing anywhere along its path. You could also pass tags forward along new connections to explicitly carry forward a trace, such as a database connection carrying a tag from a previous HTTP connection. No need for "DT infrastructure", every OS's stack would just come with tracing bundled and you could ls /proc/net/* to find out what was related to what.
The other major reason everything is deployed on 443/tcp is that middleboxes ruthlessly block everything that isn't 443/tcp, and your stack design won't change that. There's huge incentive to fit everything (or, at least, everything TCP) into TLS on port 443 --- and it's mostly fine, because 443/tcp has better mux/demux features than plain-old-TCP anyways.
People are already doing work on moving TLS into the kernel. There are two big challenges with universalizing the idea. The first is that there's a lot of policy code that accompanies TLS, and having kernel state for all of it doesn't make sense. The other is that TLS changes a lot more frequently than TCP, and some of those changes are urgent, and you don't want to have to update a kernel and reboot your machine when all you really need is a service update and restart.
I believe my method could work around middleboxes if you make an onion. Each layer of the onion could be encrypted separately. The packet going over the internet would just expose "host:node01". When it got to the host, the stack would attempt to unravel the next layer, find that it's TLS, use a cert matching "node01" to decrypt the payload. Then find the next layer, find that it's is HTTP, and so on. You could forward many layers over many networks with each node/service only decoding what it has a certificate (or authorization?) for, with only the final layer getting the full data payload.
Encrypted layers already exist in the current stack: MACsec for Ethernet or WPA et al for Wireless or 3G/4G since day one and so on for each option in the network/interface layer, IPsec for the network layer, TLS for transport layer, and after that you're in the application layer which can choose it's encryption.
The friction with getting the lower layers encrypted hasn't been lack of option to do so in the current stack it's convincing everyone to throw out their middlebox setups to allow the encrypted protocols universally. Well that and encryption is more expensive (both performance and implementation cost) so the less you can get away with the easier it is to convince people to do. If that were easy to get people to change the middleboxes then we wouldn't be in this situation in the first place.
This is one of the main reasons why everything is over HTTPS these days: it was the lowest layer universally deployed net new protocol that was encrypted from day 1 and didn't have to worry about ossification as much. It's far easier to just use that as the new base transport layer and ignore the problem, doing so with new encrypted protocols over UDP when the functionality you're trying to replace in your protocol is at the TCP level.
Service discovery is already a function of DNS be it bare bones for decades like a SRV record saying which port the HTTP server is on to DNS-SD which is more catalog style infrastructure. Things like LDAP clients actually obey service discovery already. Things like HTTPS servers/browsers have resisted it because of the current security infrastructure (name only based) issues it would open not the network infrastructure. It's not like the security infrastructure side even needs to be redone from the ground up it's just the juice from changing the existing infrastructure hasn't been deemed worth the squeeze by anyone yet.
TLS being deeply embedded has led to more problems than it avoids. Things like QUIC or HTTP/3 are the result of even just TCP sockets being OS managed. Getting anything new to happen when the old is 12 OS implementations updated independently of the program is never going to be popular with programs. That said Linux, FreeBSD, and Windows all support kTLS APIs already, in the case of the latter for about 2 decades. It can be useful in some situations but it's not a silver bullet.
Those layers already exist, actually an option at each layer: IPsec, DTLS, TLS, HTTPS. The lack of bottom up integration and universal adoption isn't from lack of having another option, fighting that it's easy to control the auth the way you want in the HTTPS layer is probably the main restriction.
Tracing tags are considered too problematic for privacy in modern protocols. E.g. QUIC offers an optional spin bit. Putting tracing tags about the full path of every packet certainly wouldn't fly, as convenient as it would make things. For debug packets only that's what ICMP already is.
.
There are a lot of cool things that could be done with a redesigned stack, mostly around preventing ossification of the lower layers, but it's not like it's ever been golden from day 1 or will ever be golden "if we just get it right next time"™. If the journey thus far has taught us anything it's that the internet protocol stack should design itself to avoid being everything included and instead try to be as transparent a transport it can so end applications can choose their own way of doing things without the assumptions of the transport getting in the way.
It’s true that TCP/IP don’t really follow the OSI 7 layer model but I think the solution is to get rid of the old model that is not fit for purpose rather than to rediscover the failures that led TCP/IP to succeed instead of the standards that were designed in tandem with the model. (Apart from the layers, the main things to come from that effort to today were x.509 and ldap)
The only reason the current way is not "voodoo" to you is because you've seen it a thousand times before. The current way is already 40 years of hacking in kludges.
> Can you imagine what our ABIs will look like if we're still hacking in kludges for another 40+ years?
Stable, reliable and backward compatible?
Extensions are a better way to grow widely used APIs. They put a burden on the implementor, but don't force thousands of consumers on the eternal treadmill of obsolescence.
This article is written as though Linux isn't an open source OS. Basically every big player rolls their own kernel to get features they want. This use case here is pretty exotic and for 99% of people using Linux it's perfectly fine for their needs.
Many of the big players try to upstream as much as possible, whether their goal is laziness (not having to carry as many patches) or altruism. Ideally, that starts with a good description of the problem (like this blog post), with patches to follow.
Before making any change to kernel you need a strong use case description and explanation why it's not possible with current methods. Often, building this case is harder/more time consuming than writing the code itself.
This is kindof a soft-RFC model. You drum the drum until there is enough understanding and momentum to get the change in.
I think this write-up is valuable since it shows a deficiency that impacts real-world industry usage. This kind of public explanation of the problem and workaround is perfect ammunition to get support behind such an initiative. Also given that darwin already supports this is another point in their favour.
A few times during this read I thought "that was a bad decision on their part" but upon completing the article I changed my mind. Their needs are definitely complex but they aren't unreasonable. It seems like having kernel support for their use cases is appropriate.