Linux TCP will have lockless listener processing 3.5M SYNs per sec

raphaelj · on Oct 31, 2015

This lock starts to be a bottleneck when you want to scale the Linux TCP stack over a large number of cores (8+). This patch could be an huge improvement.

Because of the issues that this lock leads to, I had to implement an user-space, lock-less, TCP/IP stack for my Master's thesis, last year:

https://github.com/raphaelj/rusty

https://github.com/RaphaelJ/master-thesis/raw/master/thesis....

haberman · on Nov 1, 2015

Does the load overwhelm a single CPU? Does it not work to make the networking stack have affinity to a single CPU?

SEJeff · on Nov 1, 2015

A 10g network can saturate a modern CPU if you're doing interrupt per-packet (disabling adaptive coalescing), on 40G, 56G, and 100G infiniband networks it is trivial :)

Multi-queue is absolutely mandatory for a modern networking stack, so this is really fantastic work.

nefitty · on Nov 1, 2015

It's tenable that a load stack of a significant size would demand multithreading but at this point in affinity processing technology it is uneccessary to develop that capability. It may be possible that doubling the queue length could lighten the load but in my experience it might just be better to pursue non-intrusive commits that pursue freed up buffer resources.

gonzo · on Nov 1, 2015

Libuinet is a userspace port of the FreeBSD stack that can be run lockless.

zokier · on Oct 31, 2015

> 20 files changed, 307 insertions(+), 746 deletions(-)

Two orders of magnitude faster and half the code size? That's nice.

lfx · on Oct 31, 2015

> This effort started two years ago [..].

And it is well thought changes.

snowwindwaves · on Nov 1, 2015

They probably worked on other things over the last 2 years as well. I wonder how many hours were actually spent on modifying 1000 lines of code. Probably a staggeringly high number, certainly all LOC are not equal.

luckydude · on Nov 1, 2015

What you said. I love seeing changesets like that.

ionforce · on Nov 1, 2015

I hate large change sets. They are impossible to code review.

I would rather have death by one million patches than that.

Too · on Nov 1, 2015

Some times the problem isn't one piece of code that can be solved by adding an if-statement somewhere, some times it's the whole architecture of the module and the only way to solve that is to rip the whole thing out and replace it with something new. Reviewing such a "change" (it's more of a new feature actually) does require you to understand it completely and not just within the context of an insertion between two other lines of code. Being too scared to do such a replacement can easily lead to stagnation and fire-extinguishing that never solves the root cause of a problem.

mackal · on Nov 1, 2015

It's 17 commits, each that do one thing.

goalieca · on Oct 31, 2015

i always find less is more with programming.

__s · on Oct 31, 2015

Until you simplify your sort function into insert sort (which, admittedly, is really fast before container size hits triple digits)

brianwawok · on Nov 1, 2015

even Quicksort got faster if you check for runs i.e. already sorted parts. More code but faster.

j42 · on Nov 1, 2015

Can anyone comment on whether it would be a good or bad idea to try to implement this in production ASAP?

I'm a bit far removed from the Linux kernel to be comfortable auditing that myself, but I run some high-volume/low-latency exchange clusters and the limiting bottleneck on requests/box has always been due to SYN/ACK negotiation.

I solved that currently by having hundreds of smaller servers, which distributes the network load quite nicely but isn't even coming close to maximal utilization of CPU. Realistically since moving from PHP to Haskell we're seeing about 1k req/s/box, but without the network slowing things down we're looking at a magnitude of increase on the current hardware.

Just FYI I've already handled the obvious, such as intelligent caching, nginx split upstreams, TIME_WAIT and reuse adjustments (1s), et al. Qualitatively, we're looking for an assurance on <= 100ms TTFB for the 99th percentile, in a way that allows us to use the most of our hardware via green threads.

cxseven · on Nov 1, 2015

Consider the UDP-based QUIC protocol that Chrome now supports.

From http://blog.chromium.org/2015/04/a-quic-update-on-googles-ex...:

"For latency-sensitive services like web search, the largest gains come from zero-round-trip connection establishment. The standard way to do secure web browsing involves communicating over TCP + TLS, which requires 2 to 3 round trips with a server to establish a secure connection before the browser can request the actual web page. QUIC is designed so that if a client has talked to a given server before, it can can start sending data without any round trips, which makes web pages load faster. The data shows that 75% percent of connections can take advantage of QUIC’s zero-round-trip feature. Even on a well-optimized site like Google Search, where connections are often pre-established, we still see a 3% improvement in mean page load time with QUIC."

From https://www.chromium.org/quic:

"Key features of QUIC over existing TCP+TLS+SPDY include

* Dramatically reduced connection establishment time

* Improved congestion control

* Multiplexing without head of line blocking

* Forward error correction

* Connection migration"

https://en.wikipedia.org/wiki/QUIC

bpbp-mango · on Nov 1, 2015

Since it sounds like your workload is nicely distributed, why not try it one a few nodes and compare?

j42 · on Nov 1, 2015

I was actually planning on testing a 33% deployment this evening, I guess what I'm really wondering is that because this is a low-level networking adjustment that modifies locking behavior, are there 'gotchas' I should be aware of beforehand?

I'll test and report back regardless; only afraid of the scenario where the test goes well, and the entire network goes down a week later (e.g., a wonderfully-fun time we had previously chasing down rogue epoll queues and zombie processes that only occurred after a threshold of sustained load).

baruch · on Nov 1, 2015

As with any new code, it may have bugs. The kernel could deadlock or crash due to insufficient locking.

You may want to start small and report any bugs that you find to help improve the code.

hayd · on Nov 4, 2015

did you get anything working ?

jolynch · on Nov 1, 2015

I think the latest number is looking more like 6M Mpps SYN (http://marc.info/?l=linux-netdev&m=144431864418007&w=2)!

This is also awesome because it fixes the issue with SO_REUSEPORT where the 3WHS gets screwed up by the per listener queues. I think that we still need some extra help from another patch (http://marc.info/?l=linux-netdev&m=144331389206170&w=2) to allow SO_REUSEPORT applications to drain connections, but the original patch plus the SO_REUSEPORT_LISTEN_OFF patch, or whatever Eric and Tolga end up doing, should be enough to allow applications to finally do graceful shutdowns of listen sockets. This is very relevant to my interests because it means that HAProxy will be able to do a natively supported hitless reload on Linux for the first time ever.

This stuff is brilliant.

api · on Oct 31, 2015

I'd like to see them fix the other scalability problems: hard coded maximum numbers of interfaces, socket API scaling issues, etc. You should be able to open as many sockets as you have RAM and push as much data as you have CPU to do so. A huge machine should be able to do fifty million TCP connections.

brianwawok · on Nov 1, 2015

are you sending the same data to 50 million people? I don't see you doing much computing for 50 mIllino clients even on great hardware.

jessaustin · on Nov 1, 2015

This might be the sort of thing that is hard to imagine using before it's available, but it gets used eventually anyway...

SEJeff · on Nov 2, 2015

I could see this relatively obvious to use for a static proxy like varnish or maybe squid. Static html doesn't require much processing and this seems workable given the right hardware / software stack.

api · on Nov 2, 2015

There are many use cases for having lots of connections open that aren't doing much: chat apps, IoT, pub/sub busses, etc.

alberth · on Nov 1, 2015

How does this compare to DragonflyBSD which has a completely lockless network stack?

https://www.dragonflybsd.org/~aggelos/netmp-paper.pdf

cm3 · on Nov 1, 2015

And Sepherosa Ziehau has been improving the network stack even more for the last few years. I cannot find a good overview but the commits usually include performance metrics. There's an interview in bsdnow episode 93 http://www.bsdnow.tv/episodes/2015_06_10-stacked_in_our_favo...

alberth · on Nov 1, 2015

This is only for IPv4 though :(

http://marc.info/?l=linux-netdev&m=144438785303456&w=2

majke · on Oct 31, 2015

Not later than yesterday I was explaining why syncookies on Linux aren't fast - you can't send more than say 300k pps of syn cookies for a single socket.

This patch seem to fix that. I wonder if this patch is merged yet.

aexaey · on Nov 1, 2015

It's in linux-next, so this will show up in 4.4:

https://git.kernel.org/cgit/linux/kernel/git/next/linux-next...

[EDIT: s/4.5/4.4/]

GTP · on Oct 31, 2015

This will make DoS attacks harder.

Animats · on Oct 31, 2015

Right. This doesn't mean connections open at that rate. It just means that systems are more resistant to SYN flooding attacks.

Originally, upon receipt of a SYN, hosts began to open a connection and allocated the necessary state table and buffer resources. By sending bogus SYN requests, an attacker could tie up many buffer resources until the timer for connection opening timed out and released the resources. Defense against SYN flooding involves reducing the resources committed upon receipt of a SYN. Only on the next packet of the handshake are the connection resources committed. Most SYN flooding attacks use a phony source IP address, so the reply SYN with ACK goes to the phony IP address and the handshake never completes.

There's now a TCP Fast Open proposal.[1] This allows sending data in the SYN packet. But that means committing resources on receipt of the SYN. That breaks this approach to DDOS mitigation, and has a different mechanism for dealing with phony SYN detection. Is the new code prepared for that?

[1] https://tools.ietf.org/html/rfc7413

lwf · on Nov 1, 2015

Perhaps not, but the TCP Fast Open draft has some simple proposed mitigations: put a limit on the total number of PendingFastOpenRequests, and consider disabling TFO if your server is currently under a SYN attack. Both of these can be accomplished without a major loss of availability; you simply revert to the pre-TFO speeds/latencies.

rifung · on Nov 1, 2015

Can anyone explain why this took two years? Not trying to say they're slow, I'm just unfamiliar with how something like this goes from inception to patch. I'd like to try contributing myself someday.

baruch · on Nov 1, 2015

These are complex parts of the kernel and since they impact virtually any user getting anything wrong is bad. It takes a while to merge such changes and to even just get them right in order to feel fine with submitting such changes.

I've worked in an area close to that (TCP performance tuning), my patches never made it to the kernel. Someone else later came and took a slightly different approach and wrote a much cleaner code and got the changes in.

Small patches are rather easy to get applied, more intrusive ones (and changing locks is highly intrusive) is much harder. Not impossible though and the reward (for the community) is high, the internal feeling for succeeding is also great :-)

baruch · on Nov 1, 2015

An addition to that is that getting things right sometimes takes multiple attempts, at least for me.

I've seen that in that work I attempted to do for the Linux Kernel TCP stack and also in prior and later work related projects. Large sweeping changes may require familiarization with a large body of code and the impacts of any single change to the overall system behavior. This may be well intentioned and a good direction but the details may prove a pitfall if a slightly bad turn is taken. At that point you need to backtrack and that backtracking takes a while to conclude it is needed and to find a different approach that will work.

Sometimes it takes time to realize that a direction is incorrect and a backtrack is needed and then some more time to clear your head off and find the right approach.

iheartmemcache · on Nov 1, 2015

A compare and contrast of your code vs their code would be awesome to see. E.g., an analysis of the difference between the two approaches and why his managed to get pulled upstream and yours didn't. Perhaps, a retrospective on what you would have changed as well.

baruch · on Nov 1, 2015

There are few small patches of mine that did enter:

    8a3c3a9 [TCP]: Check num sacks in SACK fast path
    6f74651 [TCP]: Seperate DSACK from SACK fast path
    fda03fb [TCP]: Advance fast path pointer for first block only
    db3ccda [TCP]: Fix sorting of SACK blocks.

One can find the accepted patches from Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> in the git history log. I'm trying to find my old patches. Look for commits from 2006-2007 which are mostly relevant to this topic.

The big issue was that the retransmit list was over 10,000 items long and SACK/DSACK/retransmit loops needed to traverse this list from the start for each new packet that arrived. It made transmission fall in performance when doing 100Mbps at 200msec latency (high Bandwidth Delay Product means a very large queue of inflight packets).

Edit: I found a set of patches but to do a proper analysis will take me just too much time that I prefer to invest into new things. Maybe I'll do that when I will want a distraction from my current projects :-)

userbinator · on Nov 1, 2015

Assuming 3.5G instructions per second, that's approximately 1000 instructions per SYN, which includes parts of the NIC's driver and lower-level processing (e.g. interrupts) too. It would be interesting to see a breakdown of how many instructions are spent in which parts of the stack.

asamarin · on Nov 1, 2015

This is an interesting point. Which makes me wonder: is 3.5 GIPS a realistic measure of nowadays commodity CPUs? Because those figures seem to describe CPU clock frequency instead, and it turns out that most instructions take more that 1 tick to complete execution; on the other hand, instruction pipelining and other optimization strategies may lower the ticks-per-instruction average number closer to (or even beyond) 1. Can anyone more informed than me shed some light on this?

gonzo · on Nov 1, 2015

Depends on workload.

Compare pmc data for instructions retired .vs processor unhalted.

chx · on Nov 1, 2015

Cue in https://xkcd.com/619/ and your favorite desktop bug.

polskibus · on Nov 1, 2015

I'm guessing google etc. have had similar or better improvements in place in their internal stacks for years. Good to see progress not locked inside one company's datacenters!

tonfa · on Nov 1, 2015

The patch author works for Google.

snambi · on Nov 1, 2015

Amazing

urvader · on Oct 31, 2015