BPF, XDP, Packet Filters and UDP

ignoramous · on Oct 21, 2020

So... this post casually outlines how one could go about build a Global Network Load Balancer at Google-scale. Amazing!

A few naïve questions:

> You can make any protocol work with a custom proxy. Take DNS: your edge servers listen for UDP packets, slap PROXY headers on them, relay the packets to worker servers, unwrap them, and deliver them to containers.

Curious: Wouldn't SOCKS5 here be a like-for-like replacement for PROXY? Why would one choose one over the other?

> WireGuard doesn't have link-layer headers, and XDP wants it to

Is the gist here that WireGuard doesn't because it is Layer 3? And that XDP sits one layer below it?

> Jason Donenfeld even wrote a patch, but the XDP developers were not enthused, just about the concept of XDP running on WireGuard at all

Could someone please explain this? Is it that XDP here didn't want to add a support to delegate routing onto WireGuard?

> It's a little hard to articulate how weird it is writing eBPF code. You're in a little wrestling match with the verifier

Would NetMap or Intel's dpdk instead make for an non-enterprising choice here? Don't they have a similar profile in terms of throughput? I guess, one has to use a userspace TCP/IP stack like gVisor's NetStack or LwIP to go with NetMap/dpdk?

> Those configurations are fed into distributed service discovery; our servers listen on changes and, when they occur, they update a routing map

How is this system implemented? Curious because uptime, availability, durability, and latency must be of prime importance for such a service. Is there a blog about this detailing the challenges inherent here? Or, does it use consul/etcd or some such out-of-the-box solution?

> a simple map of addresses to actions and next-hops; the Linux bpf(2) system call lets you update these maps on the fly.

Clarification: does this mean the maps are already in a format the bpf/2 command understands, or is something else going on here?

Thanks.

tptacek · on Oct 22, 2020

So, someone here can correct me about this if I'm wrong, but my impression is that DPDK's architecture sort of assumes single-purpose packet-processing servers --- it's commonly deployed with polling-mode drivers, right? So apart from the fact that DPDK would have required us to run another proxy daemon in userland, which was the thing we were trying to avoid doing, it also might have been tricky to get it to behave nicely on the same machines as our Rust edge proxies.

The nice thing about XDP is, you use llvm to make a .o, you load it with iproute2, and you forget about it; your code isn't a process running on the system, but rather a part of the Linux networking stack. It's as if the kernel just shipped with a "route Fly's UDP" feature.

ignoramous · on Oct 22, 2020

Thanks.

I was going through Cloudflare's Magic Transit docs and they seem to simply use Direct Server Return (Facebook does so too for their L4 load balancer [0] instead of dealing with XDP in the return path).

Any reason fly.io doesn't but relays the packets again out via the edge?

[0] https://engineering.fb.com/open-source/open-sourcing-katran-...

tptacek · on Oct 22, 2020

It's possible that we could; my concern is our forwarding takes over IP addressing, and can forward cross-region, and that theoretically direct server return could break RPF. DSR is, of course, much easier than the flow-based response forwarding we're doing now.

haivri · on Oct 22, 2020

Interesting idea about socks5 - my guess is it's less useful here because it only relays destination address, whereas proxy protocol has both, source and destination.

otoburb · on Oct 21, 2020

>>Linux kernel developers quickly come to the same conclusion the DTrace people came to 15 years ago: if you're going to have a compiler and a kernel-resident VM, you might as well use it for everything. So, the seccomp system call filter gets eBPF. Kprobes get eBPF. Kernel tracepoints gets eBPF. Userland tracing gets eBPF. If it's in the Linux kernel and it's going to be programmable (even if it shouldn't be), it's going to be programmed with eBPF soon.

Feels like Oprah Winfrey's September 13th, 2004 show: "YOU get a car! YOU get a car! And YOU get a car! Everybody gets a car!"[1]

[1] https://www.youtube.com/watch?v=pviYWzu0dzk

keithalewis · on Oct 21, 2020

This article is remarkably well written. The first paragraph lays out why you would want to read it, or not. It then presents a well documented history of the problems it is solving to illustrate the whys and wherefores of the product. Well done! Thanks.

Aaronstotle · on Oct 21, 2020

I really enjoyed this post, as someone who doesn't possess much programming prowess, I am fascinated with eBPF/kernel sub-systems and I am always eager to learn more. I might have to take the author's advice and build an emulator soon.

ignoramous · on Oct 21, 2020

See also https://docs.cilium.io/en/v1.8/bpf/#bpf-guide

bogomipz · on Oct 21, 2020

The post states:

>"You can make any protocol work with a custom proxy. Take DNS: your edge servers listen for UDP packets, slap PROXY headers on them, relay the packets to worker servers, unwrap them, and deliver them to containers. You can intercept all of UDP with AF_PACKET sockets, and write the last hop packet that way too to fake addresses out. And at first, that's how I implemented this for Fly."

This is really interesting. I looked at the linked blog post and was hoping there was more implementation details. Does your Fly pi-hole use HAProxy and the PROXY headers then? Is the config for that available anywhere i could see?

tptacek · on Oct 21, 2020

No, the Pi-hole example uses the XDP UDP scheme this blog post talks about: DNS packets arrive on edge servers, XDP intercepts them before they reach the IP stack, puts a proxy header on the message (we don't use HAProxy's proxy protocol, to conserve space), and relays it out WireGuard; TC BPF attached to the WireGuard interface on the other end (the worker server) strips off the header, fixes the addresses accordingly, and relays to the tap interface for the right worker.

The first cut of this feature I built, without BPF, used NFQueue (diverting packets based on iptables rules to userspace), did a sockets-based proxy from edge to worker, and used a simple raw socket to fix the addresses and write the packet to its destination. NFQueue was annoying to work with, I looked at BPF filters instead, and ultimately wound up just doing the whole thing in BPF.

You don't need to know anything about this to use UDP on Fly.io; you can just add UDP ports the same way you'd add TCP ports (the `fly.toml` in the Pi-hole blog post shows an example).

sysbot · on Oct 22, 2020

XDP UDP mapping to firecracker vms via WireGuard is really interesting! I have a question a bit before UDP is landed on the NIC, assuming the NICs on the edge servers is connected to multiple transit providers for incoming and outgoing traffic. This mean from the VM perspective, you can have incoming/outgoing tap/tun inside the VMs able to receive packets from difference transits or outbound, did you do anything with this aspect? and if so do you also deal with ECMP inbound in such that you can have the same virtual IP receiving UDP on multiple edge servers?

bogomipz · on Oct 21, 2020

I see. Thanks for the clarification. I need to read up more on XDP Schemas and headers. Might you or anyone else have any resources you found helpful?

tptacek · on Oct 21, 2020

There's not much to know! "XDP" is really just the Linux term of art for "BPF running directly off the network driver". Your BPF program --- ordinarily, just a C program you compiled with clang --- is given a struct with pointers to the beginning and end of a packet, and your program can return OK, DROP, or REDIRECT, in addition to modifying the packet.

The XDP project itself has a pretty excellent tutorial:

https://github.com/xdp-project/xdp-tutorial

austinpena · on Oct 21, 2020

I’ve had nothing but good experiences using Fly

DSingularity · on Oct 21, 2020

“ If you're just looking to play around with this stuff, by the way, I can give you a Dockerfile that will get you a janky build environment, which is how I did my BPF development before I started using perf, which I couldn't get working under macOS Docker”

Anyone find this?

tptacek · on Oct 21, 2020

Haven't published it! If nobody else has a good one, I'll post mine; the only reason I haven't is that it's janky as fuck (it installs extra stuff, and I pull a lot of my own kernel BPF header deps in).

bogomipz · on Oct 21, 2020

Yes please do publish it. It would make a great addition to the article. Great post by the way.

DSingularity · on Oct 21, 2020

I haven’t seen one! It would be nice to have one. Btw very nice write up.

ncmncm · on Oct 22, 2020

You don't have to write your eBPF codes in C: You can write them in C++! Or Rust! Or Fortran? About anything that can be translated to LLVM IR. Zig! Nim! Zim!

So, that is the way we will get to run C++ code in the Linux kernel. And, soon enough, in the BSDs.

It is hard to get a sense of how janky all this is, or how amazing it is that all this Rube Goldberg gimcrackery can be made to work the wonders it is seen to do all day, every day. It's not just a dancing bear, it's a bear on the Bolshoi stage!

(Donenfeld had better get his act together and get wireguard fitting better with how eBPF wants things to be, because that is where the world is headed.)

If your program isn't spending most of its time inside the kernel running code sorta JITted from eBPF, you're just not serious about performance.

Unless, of course, you have gone full-on kernel bypass, and the kernel never gets your packets at all. Then you can just run straight-up, optimized native machine code translated directly from C++, or Rust, or even, with masochism enough, C!

tptacek · on Oct 23, 2020

The actual execution model of what BPF is doing closely (and deliberately) matches C, and you care about every detail of what's happening (not least because you have a limited number of instructions to spend in a BPF program). There's no meaningful safety win in writing BPF in Rust (if there is, we have bigger problems). You can't use the standard library, or any of the standard library's data structures. In fact, I don't even think you can call functions right now without giving up tail calls. So I don't see the advantage. But you do you!

As for Donenfeld: you couldn't be more wrong. Jason wrote a patch to fix the WireGuard/XDP breakage, and the XDP team rejected it, saying that they didn't feel XDP made sense for WireGuard.

Their position is also easy to understand: the point of XDP is to intercept packets before they're copied into socket kernel buffers, and you can't meaningfully do that with a virtual network that runs off UDP sockets to begin with. I disagree with them about this being dispositive --- consistency of interface is much more important to me than "performance surprises" --- but, whatever, at least acknowledge the debate rather than sniping.

ibotty · on Oct 22, 2020

> Donenfeld had better get his act together and get wireguard fitting better with how eBPF wants things to be, because that is where the world is headed.

Can you clarify (and possibly tone down) your comment? The impression I get from the article is that he is doing what he can, even proposing patches.

dochtman · on Oct 21, 2020

So presumably this will also open up avenues for doing QUIC and thus HTTP/3 on Fly?

mrkurt · on Oct 21, 2020

Yep! We have a "Firecracker that accepts QUIC" running with this.

People usually want HTTP + TLS handled for them, though. So when we ship QUIC + HTTP3 as a first class feature, we'll terminate QUIC and give people whatever their app process can accept.

eptcyka · on Oct 21, 2020

Unrelated, but could you please expand on how firecracker fits within your stack?

tptacek · on Oct 21, 2020

You could describe our job as "taking Dockerfiles from customers and running them globally"; the way we actually "run" Docker images is to convert them to root filesystems and run them inside Firecracker. Firecracker is the core of our stack.

k__ · on Oct 22, 2020

I find it rather curious that the cloud-native crowd tries to sell us containers, but cloud providers themselves use VMs.

Like, if not even AWS, Cloudflare, and fly.io use containers, how can K8s be native in any way?

I mean, that makes even Lambda, which runs on Firecracker VMs, more native than K8s.

mrkurt · on Oct 22, 2020

Most organizations don't have the multi tenant problem we do, and end up just using Docker when they do containers.

But I also think it's fair to call "Firecracker VMs" containers. Most of what those people are talking about is application packaging and deployment, not necessarily what actually runs.

For what it's worth, I am also cynical about "cloud native".

michaeldwan · on Oct 22, 2020

Fwiw we run as many services as we can on our own platform. Mission critical systems like our registry, api, redis servers, and much more are all running as fly apps in firecracker.

riyakhanna1983 · on Oct 21, 2020

Why not just run Docker containers natively?

mrkurt · on Oct 22, 2020

We have a whole post about workload isolation that answers this: https://fly.io/blog/sandboxing-and-workload-isolation/

The tldr is: Docker containers don't offer enough isolation for multi tenant systems.

They're also very slow to boot, compared to a Firecracker VM.

riyakhanna1983 · on Oct 22, 2020

But does FC not incur high I/O overhead at runtime?

mrkurt · on Oct 22, 2020

The flippant answer is "it doesn't really matter, safety usually wins over performance".

But also, we run Firecrackers on our own physical servers and performance is quite good (even network + disk performance).

dochtman · on Oct 21, 2020

Any insight into what QUIC/H3 stack you'll be using for the proxy?

jeromegn · on Oct 21, 2020

To be determined. We're hoping to contribute and use what's going to come out of hyper's h3 efforts (we use Rust for our reverse-proxy). There's not much there yet though: https://github.com/hyperium/h3

We're not in a huge hurry to support QUIC / H3 given its current adoption. However, our users' apps will be able to support it once UDP is fully launched, if they want to.

apitman · on Oct 21, 2020

Are you using a custom reverse proxy? For a recent project I started with Caddy but ended up needing some functionality it didn't have, and didn't need most of what it did have. I'm currently using a custom proxy layer, but I'm concerned I might end up having to implement more than I want to (I know I'll at least need gzip). Curious what your experience at fly has been with this.

mrkurt · on Oct 21, 2020

We are! It's Rust + Hyper. It is a _lot_ of work, but that's because we're trying to proxy arbitrary client traffic to arbitrary backends AND give them geo load balancing.

Writing proxies is fun. Highly recommended.

apitman · on Oct 22, 2020

Cool, thanks!

I was actually just playing with Hyper for a few hours last night. Are you guys using async/await yet? Any suggestions for learning materials for async rust other than the standard stuff?

ignoramous · on Oct 22, 2020

another stupid question, but can't help it: golang seems like a popular choice among network developers. Any reason that made fly.io choose Rust over golang for the proxy?

mrkurt · on Oct 22, 2020

Because of JavaScript. Really!

We settled on Rust back when we were building a JS runtime into our proxy. It's a great language for hosting v8. When we realized our customers just wanted to run any ol' code, and not be stuck in a v8 isolate, we extracted the proxy bits and kept them around.

I think you could build our proxy just fine in Go. One really nice thing about Rust, though, is the Hyper HTTP library. It's _much_ more flexible than Go's built in HTTP package, which turns out to be really useful for our particular product.

mholt · on Oct 22, 2020

What functionality did you need that Caddy didn't have?

apitman · on Oct 22, 2020

Hey Matt! I'm referring to the ability to change the Caddy config from an API that is itself proxied through Caddy. Here's the issue which you very helpful in[0].

Ultimately I realized that most of what I needed from Caddy was really just certmagic, which has worked flawlessly since I got it set up. Plus I need the final product to compile into a single binary. Since my custom reverse proxy only took a few lines of code, I haven't worried too much about it. But there are a few features which I'll have to integrate eventually.

If I end up seeing myself headed down the path of making a full-fledged reverse proxy, I'll reconsider trying to implement my project as a Caddy plugin.

[0]: https://github.com/caddyserver/caddy/issues/3754

edf13 · on Oct 21, 2020

Could this be opened up to other (none http) based protocols and also over UDP?

Isamu · on Oct 21, 2020

You can filter pretty much any packet. The downside with using TCP is you are looking at individual packets, which may be out of order, that sort of thing.

tptacek · on Oct 21, 2020

For CDN purposes, you assume that something on each end of the TCP connection --- something outside of BPF --- is going to be running a full TCP. In our case, that something is Linux's TCP/IP stack running in a Firecracker VM (we could load XDP programs into our VMs, but we don't).

You can do a lot with TCP, and be tolerant to out-of-order delivery and drops, just by shuttling the individual packets. So we can in fact "cut through" TCP sessions directly to Firecracker, avoiding our proxies. We don't, though: our "tcp" handlers actually route through our Rust proxies, both because that's what they've always done, and because in most cases there isn't much of a win to bypassing the proxies, which have a lot more load balancing and resiliency logic than the BPF-based UDP data path does.

tptacek · on Oct 21, 2020

Should work for any UDP. We can do the same thing for non-UDP protocols, I guess, too.

bogomipz · on Oct 21, 2020

I was curious about was what is the fly.io container orchestrator that runs this edge architecture and were there any challenges implementing this on that? Cheers.

mrkurt · on Oct 21, 2020

We use Nomad + custom Firecracker plugin and custom device plugins. Nomad is very nice, much smaller in scope than k8s, and saved us a lot of effort.

We don't use any of Nomad's networking stuff, really, so all the BPF work was relatively easy to bolt in. All our Firecrackers get tun/tap pairs we manage outside of Nomad, it's reasonably simple to make them do what we want.

I think we'll ultimately move off for custom orchestration but we want to wait as long as possible before we do that.

bonfire · on Oct 25, 2020

Really liked this blog post. Thank you. I've always found eBPF to be "magic"

iammarco11 · on Oct 28, 2020

is there any forward proxy I can use which supports HTTP/3 so I can monitor the traffic