Capturing Millions of Packets per Second in Linux without Third-Party Libraries

cm3 · on Oct 24, 2016

In case you're wondering about the different linux kernel bypass mechanisms, here's the relevant slide from a recent talk: https://lh3.googleusercontent.com/TO1UdUicn1wuF4jIAhskikO6ML...

Sorry I haven't found the actual slides yet, that's why it's a photo from someone who took it while attending the talk.

revelation · on Oct 24, 2016

This must be the 10th blog post to land on HN on the same topic, and they all walk through the same steps and all use the same hardware (ixgbe), which is by the way a hard prerequisite to make much of these strategies effective.

In any case, stop reinventing the wheel, just use a library purpose-made:

http://dpdk.org/

lukego · on Oct 24, 2016

I am a Snabb hacker and I see things differently. Ethernet I/O is fundamentally a simple problem, DPDK is taking the industry in the wrong direction, and application developers should fight back.

Ethernet I/O is simple at heart. You have an array of pointer+length packets that you want to send, an array of pointer+length buffers where you want to receive, and some configuration like "hash across these 10 rings" or "pick a ring based on VLAN-ID." This should not be more work than, say, a JSON parser. (However, if you aren't vigilant you could easily make it as complex as a C++ parser.)

DPDK has created a direct vector for hardware vendors to ship code into applications. Hardware vendors have specific interests: they want to differentiate themselves with complicated features, they want to get their product out the door quickly even if that means throwing bodies at a complicated implementation, and they want to optimize for the narrow cases that will look good on their marketing literature. They are happy for their complicated proprietary interfaces to propagate throughout the software ecosystem. They also focus their support on their big customers via account teams and aren't really bothered about independent developers or people on non-mainstream platforms.

Case in point: We want to run Snabb on Mellanox NICs. If we adopt the vendor ecosystem then we are buying into four (!) large software ecosystems: Linux kernel (mlx5 driver), Mellanox OFED (control plane), DPDK (data plane built on OFED+kernel), and Mellanox firmware tools (mostly non-open-source, strangely licensed, distributed as binaries that only work on a few distros). In practice it will be our problem to make sure these all play nice together and that will be a challenge e.g. in a container environment where we don't have control over which kernel is used. We also have to accept the engineering trade-offs that the vendor engineering team has made which in this case seems to include special optimizations to game benchmarks [1].

I say forget that for a joke.

Instead we have done a bunch more work up front to first successfully lobby the vendor to release their driver API [2] and then to write a stand-alone driver of our own [3] that does not depend on anything else (kernel, ofed, dpdk, etc). This is around 1 KLOC of Lua code when all is said and done.

I would love to hear from other people who want to join the ranks of self-sufficient application developers. Honestly our ConnectX driver has been a lot of work but it should be much easier for the next guy/gal to build on our experience. If you needed a JSON parser you would not look for a 100 KLOC implementation full of weird vendor extensions, so why do that for an ethernet driver?

[1] http://dpdk.org/ml/archives/dev/2016-September/046705.html [2] http://www.mellanox.com/related-docs/user_manuals/Ethernet_A... [3] https://github.com/snabbco/snabb/blob/mellanox/src/apps/mell...

grive · on Oct 24, 2016

> DPDK is taking the industry in the wrong direction, and application developers should fight back.

DPDK is doing the exact same work you did, make hardware vendor release their driver API and abstract it away so that Application developer can stay independent from it.

You "successfully lobbyied" for one API to be released. Now do that for any number of hardware, NICs versions, and in the end you will have to release a generic API, which is effectively a new DPDK.

Completely independent applications will only go so far. You are left with a vendor lock-in with a very high upfront cost if you ever need to evolve your hardware.

lukego · on Oct 24, 2016

I understand your perspective. If you are satisfied with using a vendor-provided software stack to interface with hardware then you are well catered for by DPDK and do not have to care what is under the hood.

I feel that the hardware-software interface is fundamental and that vendors should not control the software. I see an analogy to CPUs. I am really happy that CPU vendors document their instruction sets and support independent compiler developers. I would be disappointed if they started keeping their instruction sets confidential, available only under NDA, and told everybody to just use their LLVM backend without understanding it.

grive · on Oct 24, 2016

That is effectively the case. See for example DDIO with Intel which can only be enabled for specific devices with full cooperation between Intel and this particular vendor.

You cannot compete with a DDIO-enabled device, which of course all Intel devices are.

See also the Intel multibuffer crypto library, which was specialized and timed for Intel CPUs. No one else could write at this level of optimization, because we do not have the internal design and simulator that Intel work with.

So yeah, you are talking with sophisticated hardware which will have firmware blobs and undocumented features. If you only rely on general instructions sets you will only get so far. When we are talking about ns of latency and these level of bandwidth, they make the difference between several stacks.

The push for smart-NICs will increasingly blur the line between soft and hard layer. We can either direct our efforts so as to avoid rewriting an abstraction layer upon it or do so for each vendor-specific API (OFED is but one example, there will be others).

lukego · on Oct 24, 2016

I will respectfully disagree :).

We have taken Intel's reference code (https://github.com/lukego/intel-ipsec/blob/master/code/avx2/...) for high-speed AES-GCM encryption and used DynASM (https://luajit.org/dynasm.html) to refactor it as a much smaller program (https://github.com/snabbco/snabb/blob/master/src/lib/ipsec/a...). I see this as highly worthwhile: we are working on making the software ecosystem simpler and tighter just because we are hackers, while Intel are working primarily on selling CPUs and whatever is best for their bottom line.

I disagree with this characterization of DDIO but I don't think Hacker News comments is the best venue for such low-level discussions. Hope to chat with you about it in some more suitable forum some time :) that would be fun.

JoachimSchipper · on Oct 25, 2016

FWIW, I would be quite interested in your view on DDIO - anything you can link?

lukego · on Oct 25, 2016

Intel DDIO FAQ: http://www.intel.com/content/dam/www/public/us/en/documents/...

My understanding is that DDIO is an internal feature of the processor and works transparently with all PCI devices. Basically Intel extended the processor "uncore" to serve PCIe DMA requests via the L3 cache rather than directly to memory.

drewg123 · on Oct 24, 2016

I think you're confusing DDIO with DCA. DDIO is Intel's mechanism of allocating L3 cache ways to DMA, and works for any vendor's card. DCA is an older set of steering hints that cause per-TLP steering hint flags to influence whether or not a DMA write ends up in the CPU cache. DCA is highly targeted, and much more effective in realistic workloads because you can be smart, and cache just descriptors and packet header DMA writes (eg, metadata). With DDIO, you end up caching everything, and with a limited number of cache ways, you end up often caching nothing, because later DMAs push earlier ones out of cache before the host can use the data.

At a previous employer, we figured out the DCA steering hits and implemented it in our NIC. Thankfully enough of our PCIe implementation was programmable to allow us to do this.

hueving · on Oct 24, 2016

Don't you have a financial interest in snabb succeeding and dpdk failing?

lukego · on Oct 24, 2016

No. The way to make the most money is to align yourself with vendors. Mine is a technical interest in making networking software simpler.

Senji · on Oct 24, 2016

You're writing your driver in LUA? Am I missing something?

mrottenkolber · on Oct 24, 2016

LuaJIT is a very good tracing JIT compiler and DynASM Lua mode lets us embed assembly in Lua where necessary.

rdtsc · on Oct 24, 2016

Depending what you are doing, with latencies (or throughput) in that range, sticking a black box library in there right away might not be the best idea always. Doing what the author did is also a way to learn how things work. Eventually the library might be the answer, but if I had to do what they did, I would do it by hand first as well.

Jweb_Guru · on Oct 24, 2016

I doubt there's a single real use case out there for whom DPDK isn't fast enough, but custom hardware isn't warranted.

mianos · on Oct 24, 2016

Plus it also supports even lower level drivers for a bunch of cards (some are VM virtualised, such as the Intel em), as well as AF_PACKET, oh, and pcap.

qb45 · on Oct 24, 2016

And OP reimplemented the subset he actually needed in half KLOC.

_pmf_ · on Oct 24, 2016

Because sometimes you do not want to replace one abstraction with another if the very point is to remove a layer.

bogomipz · on Oct 24, 2016

Can you elaborate on what you mean by the Intel NICs/drivers being a hard prerequisite here?

revelation · on Oct 24, 2016

Lots of the low latency options require driver and Hardware cooperation, busy polling, BQL, essentially all of the ethtool options, even the IRQ affinity.

Intel has been a driving force behind many kernel networking improvements but they naturally don't care for other manufacturers, so they implement a little bit of kernel infrastructure and put the rest into their drivers.

bogomipz · on Oct 24, 2016

Understood, agreed. Thanks for the clarification.

kristianov · on Oct 24, 2016

This. The author does not even address why they are not using DPDK. Are they using AMD server CPUs?

pavel_odintsov · on Oct 24, 2016

Thanks for resurrecting my article! :) It was originally wrote for one russian site and I do not have spare time to translate it.

Damogran6 · on Oct 24, 2016

There's still not a lot of oomph left over to do anything with the traffic...or is that not the point of the exercise? You're not going to be comparing it, or writing it to disk at these levels.

This traffic is a bit above the levels I've dealt with, but I've seen Cloud Datacenter levels of traffic that, as far as I know, you can't practically log/monitor/IPS/SIEM...or am I misinformed?

nmjohn · on Oct 24, 2016

> you can't practically log/monitor/IPS/SIEM...or am I misinformed?

It depends the hardware you're using (specifically router), but using netflow / sflow / ipfix [0] you can get pretty high visibility even for high bandwidth networks. This only gets you "metadata" and not a full packet capture - but for monitoring and the like, the metadata can be far more useful.

I'm not entirely sure what level of traffic you're talking about, but I know it's possible with the right hardware to use netflow with 100GbE links without having to sample (ie: Recording flows for every packet, not 1 in every n packets)

[0]: Good sflow vs. netflow beakdown: http://networkengineering.stackexchange.com/a/1335

usefulcat · on Oct 24, 2016

You probably could with more cores. For example, the newest Xeon E5-2697A has 16 cores and can be made to run at 3+GHz continuously.

bsder · on Oct 24, 2016

That's great, but how do you then get that many packets to disk so that you can do something with them?

Presumably you need flash drives, and probably an append-only filesystem?

gjulianm · on Oct 24, 2016

Not necessarily. One can capture 10G to disk using "only" a RAID-0 with 8-10 mechanical disks: it does the job both in bandwidth and space, and you can use regular filesystems such as XFS.

40G is a little bit more difficult: you need a huge RAID (simple, direct scaling: 32-40 disks) with mechanical disks to achieve the necessary bandwidth, and if you want to use SSDs you will need a lot of them too in order to have enough space to save any meaningful amount of traffic.

I remember seeing papers on on-the-fly compression for network traffic, but IIRC the results were not very impressive and the performance cost was noticeable.

signa11 · on Oct 24, 2016

> but how do you then get that many packets to disk

it may be possible to do disk i/o at that high rate e.g. with pci-e or a dedicated appliance for dumping the entire stream. but you would running out of storage pretty fast.

for example, a quick back-of-the-envelope calculation, where you dump packet stream from 4x10gbps cards with minimal 84b size (on ethernet), show that you would exhaust the storage in approx. 4.5 minutes :)

bsder · on Oct 24, 2016

40 Gigabits per second is roughly 4 Gigabytes per second.

4 Gigabytes per second times 86400 seconds per day is 345,600 Gigabytes per day.

Roughly: 345 Terabytes per day.

Large, but not stupidly so.

m-app · on Oct 24, 2016

40 Gbps would actually be exactly 5 Gigabytes per second (divided by 8).

bsder · on Oct 24, 2016

While I don't know the exact overhead of 10GigE, there is likely still some overhead.

At the lower speeds, things like 8b/10b encoding and Reed-Solomon ECC added enough overhead that dividing by 10 was more accurate than dividing by 8.

signa11 · on Oct 25, 2016

> While I don't know the exact overhead of 10GigE, there is likely still some overhead.

on 10gige pipes, at max Ethernet mtu (1500) bytes etc, there is approx. 94% of available bandwidth for user data (accounting for things like inter-frame-gap, crc checksums etc). with jumbo-frames that number goes to 99%.

bsder · on Oct 26, 2016

Okay, so call it 10% overhead (actually 8%) if we're taking a WAG (wild *ss guess).

That would mean that I would need to divide by about roughly nine (8.8 or so).

Sorry, I can't do divide-by-nine quickly in my head. I can do divide by 10 though, and my error is roughly 10%.

signa11 · on Oct 24, 2016

> 40 Gigabits per second is roughly 4 Gigabytes per second.

it should be 5GB/s right ? :)

akira2501 · on Oct 24, 2016

> where you dump packet stream from 4x10gbps cards with minimal 84b size (on ethernet), show that you would exhaust the storage in approx. 4.5 minutes

I don't understand your relations here, or the size of the storage you're considering. I did this:

3.5 million packets per second * 84 bytes per packet * 4 interfaces == 101.606 terabytes/day.

Or, at the raw interface rate:

10 Gb/s * 4 interfaces == 432 terabytes/day.

signa11 · on Oct 24, 2016

> or the size of the storage

whoopsie, i was considering 1.2tb of storage...

> 3.5 million packets per second

it is actually, 14.88 Mpps.

hacknat · on Oct 24, 2016

If anyone is interested I wrote a lock-free, c/c++ free, implementation of an AF_PACKET socket abstraction in Golang:

https://github.com/nathanjsweet/zsocket

I haven't implemented fan-out at all, but if anybody is interested in adding it, I'd happily apply they're pull-request.

user5994461 · on Oct 24, 2016

What's the point of capturing and storing 40 Gb/s of network traffic?

detaro · on Oct 24, 2016

Same as it is on slower speeds.

Analytics, debugging, security monitoring, ...

Of course you're going to try to avoid storing all traffic, but to decide what's interesting it has at least to be captured first. And on big sites, 40 Gb/s is still only an already random-sampled or pre-filtered subset of all traffic.

feld · on Oct 24, 2016

analysis, IDS, etc.

The inability to analyze traffic at this rate is a serious problem. How do you study it to see how protocols can be improved? A lab environment cannot compare to real world traffic. How do you detect attacks (not DoS!) if it's hidden in a link operating at this capacity?

user5994461 · on Oct 24, 2016

Even if you can capture the traffic at wire speed, the CPU doesn't have the power to analyse the stream. I thought that traffic analysers had to be done with FPGA/ASIC because of that.

feld · on Oct 24, 2016

My manager did his thesis on this. Endace NICs, split traffic up and send to a cluster of IDS servers. Allows you to actually do line rate analysis. No need for FPGA/ASIC.

zxv · on Oct 24, 2016

when monitoring a network, and faced with choices of where to tap the network, a tap which captures a wider view can be advantageous. For instance, capturing at a WAN link can provide a better view of attackers.