This lock starts to be a bottleneck when you want to scale the Linux TCP stack over a large number of cores (8+). This patch could be an huge improvement.
Because of the issues that this lock leads to, I had to implement an user-space, lock-less, TCP/IP stack for my Master's thesis, last year:
A 10g network can saturate a modern CPU if you're doing interrupt per-packet (disabling adaptive coalescing), on 40G, 56G, and 100G infiniband networks it is trivial :)
Multi-queue is absolutely mandatory for a modern networking stack, so this is really fantastic work.
It's tenable that a load stack of a significant size would demand multithreading but at this point in affinity processing technology it is uneccessary to develop that capability. It may be possible that doubling the queue length could lighten the load but in my experience it might just be better to pursue non-intrusive commits that pursue freed up buffer resources.
They probably worked on other things over the last 2 years as well. I wonder how many hours were actually spent on modifying 1000 lines of code. Probably a staggeringly high number, certainly all LOC are not equal.
Some times the problem isn't one piece of code that can be solved by adding an if-statement somewhere, some times it's the whole architecture of the module and the only way to solve that is to rip the whole thing out and replace it with something new. Reviewing such a "change" (it's more of a new feature actually) does require you to understand it completely and not just within the context of an insertion between two other lines of code. Being too scared to do such a replacement can easily lead to stagnation and fire-extinguishing that never solves the root cause of a problem.
Can anyone comment on whether it would be a good or bad idea to try to implement this in production ASAP?
I'm a bit far removed from the Linux kernel to be comfortable auditing that myself, but I run some high-volume/low-latency exchange clusters and the limiting bottleneck on requests/box has always been due to SYN/ACK negotiation.
I solved that currently by having hundreds of smaller servers, which distributes the network load quite nicely but isn't even coming close to maximal utilization of CPU. Realistically since moving from PHP to Haskell we're seeing about 1k req/s/box, but without the network slowing things down we're looking at a magnitude of increase on the current hardware.
Just FYI I've already handled the obvious, such as intelligent caching, nginx split upstreams, TIME_WAIT and reuse adjustments (1s), et al. Qualitatively, we're looking for an assurance on <= 100ms TTFB for the 99th percentile, in a way that allows us to use the most of our hardware via green threads.
"For latency-sensitive services like web search, the largest gains come from zero-round-trip connection establishment. The standard way to do secure web browsing involves communicating over TCP + TLS, which requires 2 to 3 round trips with a server to establish a secure connection before the browser can request the actual web page. QUIC is designed so that if a client has talked to a given server before, it can can start sending data without any round trips, which makes web pages load faster. The data shows that 75% percent of connections can take advantage of QUIC’s zero-round-trip feature. Even on a well-optimized site like Google Search, where connections are often pre-established, we still see a 3% improvement in mean page load time with QUIC."
I was actually planning on testing a 33% deployment this evening, I guess what I'm really wondering is that because this is a low-level networking adjustment that modifies locking behavior, are there 'gotchas' I should be aware of beforehand?
I'll test and report back regardless; only afraid of the scenario where the test goes well, and the entire network goes down a week later (e.g., a wonderfully-fun time we had previously chasing down rogue epoll queues and zombie processes that only occurred after a threshold of sustained load).
This is also awesome because it fixes the issue with SO_REUSEPORT where the 3WHS gets screwed up by the per listener queues. I think that we still need some extra help from another patch (http://marc.info/?l=linux-netdev&m=144331389206170&w=2) to allow SO_REUSEPORT applications to drain connections, but the original patch plus the SO_REUSEPORT_LISTEN_OFF patch, or whatever Eric and Tolga end up doing, should be enough to allow applications to finally do graceful shutdowns of listen sockets. This is very relevant to my interests because it means that HAProxy will be able to do a natively supported hitless reload on Linux for the first time ever.
I'd like to see them fix the other scalability problems: hard coded maximum numbers of interfaces, socket API scaling issues, etc. You should be able to open as many sockets as you have RAM and push as much data as you have CPU to do so. A huge machine should be able to do fifty million TCP connections.
I could see this relatively obvious to use for a static proxy like varnish or maybe squid. Static html doesn't require much processing and this seems workable given the right hardware / software stack.
And Sepherosa Ziehau has been improving the network stack even more for the last few years. I cannot find a good overview but the commits usually include performance metrics. There's an interview in bsdnow episode 93 http://www.bsdnow.tv/episodes/2015_06_10-stacked_in_our_favo...
Not later than yesterday I was explaining why syncookies on Linux aren't fast - you can't send more than say 300k pps of syn cookies for a single socket.
This patch seem to fix that. I wonder if this patch is merged yet.
Right. This doesn't mean connections open at that rate. It just means that systems are more resistant to SYN flooding attacks.
Originally, upon receipt of a SYN, hosts began to open a connection and allocated the necessary state table and buffer resources. By sending bogus SYN requests, an attacker could tie up many buffer resources until the timer for connection opening timed out and released the resources. Defense against SYN flooding involves reducing the resources committed upon receipt of a SYN. Only on the next packet of the handshake are the connection resources committed. Most SYN flooding attacks use a phony source IP address, so the reply SYN with ACK goes to the phony IP address and the handshake never completes.
There's now a TCP Fast Open proposal.[1] This allows sending data in the SYN packet. But that means committing resources on receipt of the SYN. That breaks this approach to DDOS mitigation, and has a different mechanism for dealing with phony SYN detection. Is the new code prepared for that?
Perhaps not, but the TCP Fast Open draft has some simple proposed mitigations: put a limit on the total number of PendingFastOpenRequests, and consider disabling TFO if your server is currently under a SYN attack. Both of these can be accomplished without a major loss of availability; you simply revert to the pre-TFO speeds/latencies.
Can anyone explain why this took two years? Not trying to say they're slow, I'm just unfamiliar with how something like this goes from inception to patch. I'd like to try contributing myself someday.
These are complex parts of the kernel and since they impact virtually any user getting anything wrong is bad. It takes a while to merge such changes and to even just get them right in order to feel fine with submitting such changes.
I've worked in an area close to that (TCP performance tuning), my patches never made it to the kernel. Someone else later came and took a slightly different approach and wrote a much cleaner code and got the changes in.
Small patches are rather easy to get applied, more intrusive ones (and changing locks is highly intrusive) is much harder. Not impossible though and the reward (for the community) is high, the internal feeling for succeeding is also great :-)
An addition to that is that getting things right sometimes takes multiple attempts, at least for me.
I've seen that in that work I attempted to do for the Linux Kernel TCP stack and also in prior and later work related projects. Large sweeping changes may require familiarization with a large body of code and the impacts of any single change to the overall system behavior. This may be well intentioned and a good direction but the details may prove a pitfall if a slightly bad turn is taken. At that point you need to backtrack and that backtracking takes a while to conclude it is needed and to find a different approach that will work.
Sometimes it takes time to realize that a direction is incorrect and a backtrack is needed and then some more time to clear your head off and find the right approach.
A compare and contrast of your code vs their code would be awesome to see. E.g., an analysis of the difference between the two approaches and why his managed to get pulled upstream and yours didn't. Perhaps, a retrospective on what you would have changed as well.
There are few small patches of mine that did enter:
8a3c3a9 [TCP]: Check num sacks in SACK fast path
6f74651 [TCP]: Seperate DSACK from SACK fast path
fda03fb [TCP]: Advance fast path pointer for first block only
db3ccda [TCP]: Fix sorting of SACK blocks.
One can find the accepted patches from Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> in the git history log. I'm trying to find my old patches. Look for commits from 2006-2007 which are mostly relevant to this topic.
The big issue was that the retransmit list was over 10,000 items long and SACK/DSACK/retransmit loops needed to traverse this list from the start for each new packet that arrived. It made transmission fall in performance when doing 100Mbps at 200msec latency (high Bandwidth Delay Product means a very large queue of inflight packets).
Edit: I found a set of patches but to do a proper analysis will take me just too much time that I prefer to invest into new things. Maybe I'll do that when I will want a distraction from my current projects :-)
Assuming 3.5G instructions per second, that's approximately 1000 instructions per SYN, which includes parts of the NIC's driver and lower-level processing (e.g. interrupts) too. It would be interesting to see a breakdown of how many instructions are spent in which parts of the stack.
This is an interesting point. Which makes me wonder: is 3.5 GIPS a realistic measure of nowadays commodity CPUs? Because those figures seem to describe CPU clock frequency instead, and it turns out that most instructions take more that 1 tick to complete execution; on the other hand, instruction pipelining and other optimization strategies may lower the ticks-per-instruction average number closer to (or even beyond) 1. Can anyone more informed than me shed some light on this?
I'm guessing google etc. have had similar or better improvements in place in their internal stacks for years. Good to see progress not locked inside one company's datacenters!
Because of the issues that this lock leads to, I had to implement an user-space, lock-less, TCP/IP stack for my Master's thesis, last year:
https://github.com/raphaelj/rusty
https://github.com/RaphaelJ/master-thesis/raw/master/thesis....