Improving Linux networking performance

zelos · on Jan 23, 2015

I love reading articles like this: "So, for example, a cache miss on Jesper's 3GHz processor takes about 32ns to resolve. It thus only takes two misses to wipe out the entire time budget for processing a packet."

Then I go back to adding another layer of abstraction to my bloated Java code and die a little inside.

eklavya · on Jan 23, 2015

Actually guys like him are making our life easier and so we don't have to worry about a lot of stuff like this. Abstractions are a necessity, don't feel bad. There is no way a small team using Java/Scala can accomplish what they usually do in the time frame they do it, without those abstractions. Computers are there to ease our lives after all.

p.s. not undermining the need to optimize and all

fulafel · on Jan 23, 2015

For completeness, generally memory access that misses the last level cache incurs 60-110 ns latency on recent DDR3 based x86 hardware, see eg. http://www.sisoftware.co.uk/?d=qa&f=ben_mem_latency&l=en&a=

I don't know what exactly the 32ns measurement is from, sounds similar to the "in-page" figures on the above page.

gretful · on Jan 23, 2015

yeah, it's easy to cry when you see a real artist working on something, and you have to go back to your etch-a-sketch.

npsimons · on Jan 23, 2015

Real art may inspire the soul, but road signs keep you from dying.

joshbaptiste · on Jan 22, 2015

Indeed, there has been a large amount of "bypass the kernel" campaigns in the last couple of months. Robert Graham's 2013 Shmoocon talk has a great introduction into the whys and hows of this movement.

https://www.youtube.m/wco?v=D0atch9jdbS6oSI

Facebook had a job posting that showed up on HN for a position to help speed up Linux's networking stack. While I doubt these improvements will surpass the kernel bypassing model, I'm glad a developer has decided to tackle this head on and help overall efficiency.

SEJeff · on Jan 23, 2015

Look no further than Solar Flare's Open Onload[1], Mellanox's VMA[2], Myricom's DBL[3], Intel's DPDK[4], etc to see that kernel bypass tech is used in 10G+ networking all the time.

It is super fast, but also lower latency. Many of the features used in lower latency environments like busy polling have been slowly making their way upstream[5]. I think Linux will get pretty close to the userspace kernel bypass stuff, and this is coming from a guy who works in a low latency environment for $REAL_JOB.

[1] http://www.openonload.org/

[2] http://www.mellanox.com/page/software_vma?mtag=vma

[3] https://www.myricom.com/support/downloads/dbl.html

[4] http://www.intel.com/content/www/us/en/intelligent-systems/i...

[5] https://lwn.net/Articles/540281/

fit2rule · on Jan 23, 2015

Another great example is snabbswitch, which combines a lot of really great technologies to provide superlative performance:

https://github.com/SnabbCo/snabbswitch/wiki

Hypervisor, Kernel, LuaJIT, Userspace. Amazing!

TheCondor · on Jan 23, 2015

There is a lot more than bypass the kernel here. You need special dma hardware that will preload cpu cache lines and stuff like that. Maybe even specialized cpu cache that can be pinned

mobiplayer · on Jan 23, 2015

It seems to be pretty uncommon for Linux setups to have, e.g. TOE enabled, but in my humble opinion it is an easy win on performance on 10G networks.

wmf · on Jan 23, 2015

TOE isn't supported under Linux and it was only a win if your TCP stack was slow (and by "your" I mean Windows). TSO is enabled by default in Linux and is indeed an easy win.

adamnemecek · on Jan 23, 2015

here's the fixed link

https://www.youtube.com/watch?v=D09jdbS6oSI

voltagex_ · on Jan 23, 2015

Direct video link: http://www.shmoocon.org/2013/videos/Shmoocon%202013%20-%20C1...

fulafel · on Jan 23, 2015

Too bad improvements in network technology haven't found their way to consumer level. It's probably related to the stagnant broadband speeds, last mile bandwidth improvements slowed to a crawl many years ago and concurrently device connectivity actually moved to a slower networking tech, wi-fi. Now people are happy to use wi-fi for home desktop computers since their last mile net connection is so slow anyway.

It's 10 years since motherboard integrated 1G became commonplace in regular PCs, same for 10G is nowhere in sight...

skuhn · on Jan 23, 2015

10 gig still isn't even commonly available on server motherboards, because of power / space / cost. There also aren't many copper 10 gig top-of-rack switches, just the Cisco Nexus 3064T and Arista 7050T come to mind. Juniper doesn't even have one.

It's easier for a lot of places to use twinax with 10 gig SFP+ switches rather than going copper 10 gig. That is definitely not going to trickle down to the consumer level.

It will probably be another 1-2 years before 10 gig is ubiquitous at the server level, and another 2-3 years after that before it is commonly on consumer equipment. Or maybe it never will be, and things will go in another direction.

donavanm · on Jan 23, 2015

> There also aren't many copper 10 gig top-of-rack switches, just the Cisco Nexus 3064T and Arista 7050T come to mind. Juniper doesn't even have one.

I might be missing your definition of tor. The juniper QFX5100 series has the 48T which does 48x 10GBASE-T plus 6x QSFP. the 5100-96S does 96x SFP+ and 8x QSFP. There are plenty of other cheap merchant silicon platforms that look similar. Personlly Im happy with DAC on SFP+ ports.

skuhn · on Jan 24, 2015

Oh yeah, I always forget about the QFX series. Seems like that would do the job.

List prices are utter nonsense for switches, but the QFX does come in above the other two I mentioned. Perhaps because of its fibre-channely nature that no one (I hope) cares about.

QFX5100-48T: $24,000 Nexus 3064-T: $13,000 7050T-52: $20,000

Still, any of these would work if you get the right deal. I could see an advantage for 10g copper if I had mixed racks where not all of my hosts needed 10g on the server side, but that's a big premium to pay over 1g TOR if you aren't using lots of 10 gig ports.

For me, I just use copper on 1 gig racks and DAC on 10 gig racks. So far, so good.

koffiezet · on Jan 29, 2015

Last mile speeds for end-users aren't that important, most networking interaction they require goes to the internet anyway. Internet providers in the USA trying to push 5Mbit to be classified as "broadband" should tell you enough... I'm used to full speed 100Mbit internet (actually 160, but my router only has a 100Mbit connection), but for average useage - the difference between 50 and 100Mbit is rarely noticeable - unless you download big files.

Once 4k or 8k streaming becomes more common, sure then you'd like to have full Gbit where max bitrate would be 340Mbit - but for even decent full-hd streaming - nobody needs more than 50Mbit (in real throughput that is).

Higher speed things are more important on servers that have to handle more and more connections, and the end of this horizontal expansion is not in sight. Everybody starts using the internet for more and more things, and this all has to end up on servers that have to handle connections from millions of clients.

mcpherrinm · on Jan 23, 2015

10 gigabit over Cat6A cable is also power hungry and very tricky: It's approaching the limits of what you can do over that sort of cable. Plus the frequencies involved make the adapter design a lot more troublesome. I suppose we'll start seeing cheap 10 gig knockoff nics eventually, but it's not even everywhere in the datacenter, never mind the home.

hkhkhhjhnbv · on Jan 23, 2015

yep. Fiber is the way to go when you need 10Gbit. I was looking in to doing my place with Cat6 or 6a. Not only is 6a incredibly thick (thus, more difficult to run and probably terminate) but you need special tools and knowledge just to make sure the wiring is correct and there are no RF issues going on. It becomes way more about RF crosstalk and science, and less about simple wire connectivity.

donavanm · on Jan 23, 2015

Maybe SR/LR is the right choice for home. In the DC/rack some sort of DAC is the way to go. DAC will save you like 5 watts per port over optics.

justincormack · on Jan 23, 2015

Here you are, motherboard integrated dual 10G from Supermicro [1]. Around £400. They should be commonplace this year I think, it is just starting.

[1] http://www.supermicro.com/products/motherboard/Xeon/C600/X9S...

fulafel · on Jan 23, 2015

Yeah, they exist in a select few workstation class Xeon boards. Intel announced those (server-targeted) chips in 2012, they take time to show up even in this high end niche. As there haven't been any announcements about mass market parts, it would be a small miracle if they showed up this year.

greggyb · on Jan 23, 2015

So, a local ISP has recently released 10Gbit fiber to home (for a tiny subset of the City).

Would any consumers be able to use this, or would it only make sense with a commercial switch (probably for a shared connection for an apartment complex or similar?)?

What would I have to do at home to take advantage of it?

pixl97 · on Jan 23, 2015

Yea, with the speed of SSDs increasing so quickly the speed at which a desktop computer can transfer a file and store it has greatly exceeded what gigabit can provide. At the SME level 10G is still too expensive to justify at this point for regular workstations.

arca_vorago · on Jan 23, 2015

Am I the only one that thinks we need to start at re-evaluation of BSD sockets first? I know Apple tried and gave up, but it just seems like so many of the building block pieces we use everyday could really use a major polish or some good competition.

wmf · on Jan 23, 2015

This article is mostly about the lower parts of network stack like QoS and talking to the NIC; the user API is kind of orthogonal but equally important. There have been several research projects about improved networking APIs; my favorite is IX which gets line-rate performance while retaining kernel/user protection. https://www.usenix.org/conference/osdi14/technical-sessions/...

jalcazar · on Jan 23, 2015

This reminds me of MegaPipe. Basically it creates a pipe between kernel and user space. It uses batching too https://www.usenix.org/conference/osdi12/technical-sessions/...

trentnelson · on Jan 29, 2015

It bugs me that the source isn't available for stuff like this. Makes it tough to objectively evaluate things.

alricb · on Jan 22, 2015

Possible contrast & compare: the presentations on OpenBSD's network stack at http://openbsd.com/papers/

scott_karana · on Jan 23, 2015

The first relevant-seeming presentation is from 2009[1] and doesn't really get into the pitfalls of low-latency switching/handling like this article does.

I'm definitely interested to see how other operating systems handle this, though. In particular: Windows (is networking in user-mode?) and Solaris-likes.

1 http://quigon.bsws.de/papers/2009/eurobsdcon-faster_packets/

xenadu02 · on Jan 23, 2015

Why are we still using 1500 byte packets at 100G again? Seems like there won't be any tricks left to make 1000G work. Does that count as technical debt?

donavanm · on Jan 23, 2015

Pretty much everything has supported 9k jumbos for over a decade. The internets mostly 1.5k MTU, but you normally arent doing multi gigabit streams over public connectivity. The other argument is TSO. Your kernels probably writing a 64K "packet" to the NIC driver. When segmentation etc is handled by the hardware why do you care about the MSS? On the network device side the SerDes are the issue. And were already running parallel lanes there; 40 is 4x 10 lanes and 100 is 4x 25 lanes. Why not 10x 100 in a couple years?

nitrogen · on Jan 24, 2015

When segmentation etc is handled by the hardware why do you care about the MSS?

Because of Ethernet's mandatory minimum inter-packet spacing.

donavanm · on Jan 24, 2015

Ok... So looking at IFG as 96 bits or 12 bytes of "overhead" thats 0.8% or 0.13% for 1.5k and 9k frames. Why do I care about 0.67% of throughput? And pretty much all silicon in over a decade does line rate at 1k anyways. Or if its latency a hypothetical higher clocked lane would be something like 1ns instead of ~3ns per frame? Thats the difference between 2 clock cycles and 6 cycles of latency.

So what is your shorter/faster ifg buying in practice.

nitrogen · on Jan 24, 2015

I've seen an audio bus that had to shorten the gap to have enough bandwidth, but that was with 100mbit.

arjn · on Jan 23, 2015

Interesting article. Reminds me of my time in grad school :-)

Looks like jasper's recommended way is to improve or find a way to bypass the memory managment (slab allocation) subsystem.

There should be a way to tack on a more network optimized memory management layer or allocator onto the regular one.

Could turn out be a good research project.

icantthinkofone · on Jan 23, 2015

The first they should do is do what Facebook is doing and turn to FreeBSD for ideas in their attempt to make Linux as good as FreeBSD's: http://www.theregister.co.uk/2014/08/07/facebook_wants_linux...

riffraff · on Jan 23, 2015

IIRC netmap was a cool concept born on freebsd but also available for linux.

http://info.iet.unipi.it/~luigi/netmap/

fideloper · on Jan 22, 2015

So...we're all upvoting this hoping someone else understands networking at this level, right? And that maybe they'll do something awesome with it.

lucb1e · on Jan 22, 2015

I probably couldn't improve the actual code myself, but I do understand what is being talked about conceptually. I enjoyed the article and learned a bit from it, thus I upvoted.

wmf · on Jan 23, 2015

I understood this article and it is relevant to my interests since 25G NICs are coming this year.

agrover · on Jan 23, 2015

source? I thought the next step was 40G?

wmf · on Jan 23, 2015

http://25gethernet.org/

40G has been out for a few years but it's fairly expensive since it uses four lanes. 25G will be the best option if you need something faster than 10G IMO.

mcpherrinm · on Jan 23, 2015

To expand on that: On a switch chip today, like the common Trident 2, you have 10 and 40 gig interfaces. The 40 gig are just four lanes bonded together (10 being one). These 25 gig products runs each lane at 25G instead of 10, so you get a 25G port in the same density you used to have 10, 50G at double the density of 40, and 100gbit/s at the old 40 gig density.

I think this is largely being driven by the server folk, who want to connect at 25G instead of 10.

neomantra · on Jan 23, 2015

And to expand on that slightly, it is not just about density, but also latency. Since 40G is clocked the same as 10G (as it is 4x 10G lanes), upgrading from 10G infrastructure to 40G will not improve overall latency Whereas 25G is clocked faster and will have lower overall latency. [Overall meaning the first packet won't arrive relatively sooner, but the second one will.]

There may be latency improvements inherent from design improvements within newer packet processing / switching ASICs themselves, just like Intel's tick/tock. For example, NASDAQ offers a 10G handoff to their 40G infrastructure which is 5-9us faster because of topology and equipment upgrades [1].

The nice thing about 40G is that it is compatible with 10G so you can selectively upgrade your switches to 40G but keep your 10G NICs, using these neato QSFP+->SFP+ breakout cables [2].

[1] http://www.nasdaqtrader.com/content/Productsservices/trading...

[2] https://www.google.com/search?q=qsfp+breakout&tbm=isch

justincormack · on Jan 23, 2015

There was an attempt to do that with 2.5G, but it has been relegated to backplanes and was never formally standardised.

donavanm · on Jan 23, 2015

The atom server boards released last year actually have 4x 2.5g lanes. As far as ive seen everyone just uses 4x 1g serdes on them instead of the hybrid 10.