It talks a lot about performance, but the actual cloud load balancers such as AWS ELB or Azure Load Balancer are implemented in the software-defined network (SDN) and are accelerated in hardware.
For example in Azure, only the first two packets in a VM->LB->VM flow will traverse the LB. Subsequent packets are direct from VM-to-VM and are rewritten in the host NICs to merely appear to go via the LB address. This enables staggering throughputs that no “software in a VM” can possibly hope to match.
Personally I wish people would stop with the unnecessary middle boxes. It’s 2023 and there are cloud VMs now that can put out 200 Gbps! Anything in path of a few dozen such VMs will melt into slag.
This is especially important for Kubernetes and microservices in general that are already very chatty and have layers of reverse proxies five deep in surprisingly common configurations already.
> only the first two packets in a VM->LB->VM flow will traverse the LB. Subsequent packets are direct from VM-to-VM and are rewritten in the host NICs to merely appear to go via the LB address
Generally you can do this if the network is software defined. The boundaries between networks aren’t actually real, which means you can load balance by simply determining if the packets should be able to route then letting them route for the rest of the flow.
+1 would be great for more detailed explanation or links out to reading material.
I assume there are some limitations if you actually skip the LB? How do the host NICs rewrite the LB address, does this imply there is hardware support for this kind of bypass routing?
I’m not convinced it offers better cogs actually. You can do line speed on CPU with DPDK these days and even relatively beefy CPUs are probably cheaper than a specialized hardware like a xilinx card.
I spent a lot of time thinking about this, and the conclusion I reached is while the cpu may be cheaper, it can generate revenue where the fpga cannot. So even thought the tco may seem upside down the opportunity cost of using a few cores makes up for the additional cost.
Gcp for example has the potential for ~1 k revenue per core over a system lifespan. A smart nic is probably ~1.5k, so saving 2 cores outs you in the black and has other security advantages.
Not sure if you mean this in purely making accounting look good sense but I’m not following - usually you get fixed $$ budget not cpu count budget. Therefore buying all cpus will actually make workload cores cheaper due to higher volume discount.
Just curious "only the first two packets in a VM->LB->VM flow will traverse the LB. Subsequent packets are direct from VM-to-VM and are rewritten in the host NICs to merely appear to go via the LB address" ,
how is it possible to change the Load Balancer IP(VIP) to VM IP in a session . Are you talking about DSR(Direct Server Return) here ?
Cloud networking is basically Magic(tm). The packet headers are a mere formality to keep legacy operating systems happy.
In typical data centres the "network" is really just a handful of Cisco boxes. In the cloud, the network extends to the FPGAs or ASICs in the servers themselves, including the hypervisors.
When a packet leaves a VM, the hypervisor host rewrites it, typically in hardware, and then when the remote hypervisor receives it, the packet is rewritten back to what the destination VM accepts.
This allows thousands of overlapping 10.0.0.0/24 subnets, and "tricks" like direct VM-to-VM traffic that appears to go via a load balancer.
The actual load balancer VMs just "set up" the flow, while instructing the hosts to take over the direct traffic in their stead.
Ok got it , something in lines of OpenFlow. Is there any documentation/links on this being used by AWS / Azure/ GCP .. I would like to read more on this.
Don't have time to look but if you check Gitlab (the company) infrastructure issue tracker (it's open source) they have some details on how GCP cloud networking works with quotes from GCP support staff.
I guess they're seen high amounts of out-of-order packets and there's some detailed write ups on why that happens with GCP SDN implementation.
The details for the bare metal benchmarks are sparse. I would have expected an eBPF solution to outperform the "aging" IPVS by an important margin. Moreover, the peak performance of IPVS is far better (115 vs 57 reqs/s). It would be interesting to know if it is an outlier. A benchmark with an increasing workload over time would be more precise on how to compare both solutions.
Is this a blind load balancer similar to the iptables statistics module or are there health checks? Are they active or passive health checks? Asking because I saw a comparison to HAProxy.
Does anyone else do Segmemt Routing in kube? This particularly caught my eye. I wonder how much other software & setup users need to take advantage of this Loxilb. It's such a different paradigm, specifying much more of the route packets take!
This looks interesting, specially the support of GTP/SCTP; although it seems quite new, first commit on github is last year, wonder if anyone has used this in production?
For example in Azure, only the first two packets in a VM->LB->VM flow will traverse the LB. Subsequent packets are direct from VM-to-VM and are rewritten in the host NICs to merely appear to go via the LB address. This enables staggering throughputs that no “software in a VM” can possibly hope to match.
Personally I wish people would stop with the unnecessary middle boxes. It’s 2023 and there are cloud VMs now that can put out 200 Gbps! Anything in path of a few dozen such VMs will melt into slag.
This is especially important for Kubernetes and microservices in general that are already very chatty and have layers of reverse proxies five deep in surprisingly common configurations already.