So... this post casually outlines how one could go about build a Global Network Load Balancer at Google-scale. Amazing!
A few naïve questions:
> You can make any protocol work with a custom proxy. Take DNS: your edge servers listen for UDP packets, slap PROXY headers on them, relay the packets to worker servers, unwrap them, and deliver them to containers.
Curious: Wouldn't SOCKS5 here be a like-for-like replacement for PROXY? Why would one choose one over the other?
> WireGuard doesn't have link-layer headers, and XDP wants it to
Is the gist here that WireGuard doesn't because it is Layer 3? And that XDP sits one layer below it?
> Jason Donenfeld even wrote a patch, but the XDP developers were not enthused, just about the concept of XDP running on WireGuard at all
Could someone please explain this? Is it that XDP here didn't want to add a support to delegate routing onto WireGuard?
> It's a little hard to articulate how weird it is writing eBPF code. You're in a little wrestling match with the verifier
Would NetMap or Intel's dpdk instead make for an non-enterprising choice here? Don't they have a similar profile in terms of throughput? I guess, one has to use a userspace TCP/IP stack like gVisor's NetStack or LwIP to go with NetMap/dpdk?
> Those configurations are fed into distributed service discovery; our servers listen on changes and, when they occur, they update a routing map
How is this system implemented? Curious because uptime, availability, durability, and latency must be of prime importance for such a service. Is there a blog about this detailing the challenges inherent here? Or, does it use consul/etcd or some such out-of-the-box solution?
> a simple map of addresses to actions and next-hops; the Linux bpf(2) system call lets you update these maps on the fly.
Clarification: does this mean the maps are already in a format the bpf/2 command understands, or is something else going on here?
So, someone here can correct me about this if I'm wrong, but my impression is that DPDK's architecture sort of assumes single-purpose packet-processing servers --- it's commonly deployed with polling-mode drivers, right? So apart from the fact that DPDK would have required us to run another proxy daemon in userland, which was the thing we were trying to avoid doing, it also might have been tricky to get it to behave nicely on the same machines as our Rust edge proxies.
The nice thing about XDP is, you use llvm to make a .o, you load it with iproute2, and you forget about it; your code isn't a process running on the system, but rather a part of the Linux networking stack. It's as if the kernel just shipped with a "route Fly's UDP" feature.
I was going through Cloudflare's Magic Transit docs and they seem to simply use Direct Server Return (Facebook does so too for their L4 load balancer [0] instead of dealing with XDP in the return path).
Any reason fly.io doesn't but relays the packets again out via the edge?
It's possible that we could; my concern is our forwarding takes over IP addressing, and can forward cross-region, and that theoretically direct server return could break RPF. DSR is, of course, much easier than the flow-based response forwarding we're doing now.
Interesting idea about socks5 - my guess is it's less useful here because it only relays destination address, whereas proxy protocol has both, source and destination.
A few naïve questions:
> You can make any protocol work with a custom proxy. Take DNS: your edge servers listen for UDP packets, slap PROXY headers on them, relay the packets to worker servers, unwrap them, and deliver them to containers.
Curious: Wouldn't SOCKS5 here be a like-for-like replacement for PROXY? Why would one choose one over the other?
> WireGuard doesn't have link-layer headers, and XDP wants it to
Is the gist here that WireGuard doesn't because it is Layer 3? And that XDP sits one layer below it?
> Jason Donenfeld even wrote a patch, but the XDP developers were not enthused, just about the concept of XDP running on WireGuard at all
Could someone please explain this? Is it that XDP here didn't want to add a support to delegate routing onto WireGuard?
> It's a little hard to articulate how weird it is writing eBPF code. You're in a little wrestling match with the verifier
Would NetMap or Intel's dpdk instead make for an non-enterprising choice here? Don't they have a similar profile in terms of throughput? I guess, one has to use a userspace TCP/IP stack like gVisor's NetStack or LwIP to go with NetMap/dpdk?
> Those configurations are fed into distributed service discovery; our servers listen on changes and, when they occur, they update a routing map
How is this system implemented? Curious because uptime, availability, durability, and latency must be of prime importance for such a service. Is there a blog about this detailing the challenges inherent here? Or, does it use consul/etcd or some such out-of-the-box solution?
> a simple map of addresses to actions and next-hops; the Linux bpf(2) system call lets you update these maps on the fly.
Clarification: does this mean the maps are already in a format the bpf/2 command understands, or is something else going on here?
Thanks.