Quic throws away roughly 40 years of performance optimizations that operating systems and network card vendors have done for TCP. For example (based on the server side)
- sendfile() cannot be done with QUIC, since the QUIC stack runs in userspace. That means that data must be read into kernel memory, copied to the webserver's memory, then copied back into the kernel, then sent down to the NIC. Worse, if crypto is not offloaded, userspace also needs to encrypt the data.
- LSO/LRO are (mostly) not implemented in hardware for QUIC, meaning that the NIC is sent 1500b packets, rather than being sent a 64K packet that it segments down to 1500b.
- The crypto is designed to prevent MiTM attacks, which also makes doing NIC crypto offload a lot harder. I'm not currently aware of any mainstream (eg, not an FPGA by a startup) that can do inline TLS offload for QUIC.
There is work ongoing by a lot of folks to make this better. But at least for now, on the server side, Quic is roughly an order of magnitude less efficient than TCP.
I did some experiments last year for a talk I gave which approximated loosing the optimizations above. https://people.freebsd.org/~gallatin/talks/euro2022.pdf
For a video CDN type workload with static content, we'd go from being about to serve ~400Gb/s per single-core AMD "rome" based EPYC (with plenty of CPU idle) to less than 100Gb/s per server with the CPU maxed out.
Workloads where the content is not static and has to be touched already in userspace, things won't be so comparatively bad.
- The crypto is designed to prevent MiTM attacks, which also makes doing NIC crypto offload a lot harder.
Huh? Surely what you're doing in the accelerated path is just AES encryption/ decryption with a parameterised key which can't be much different from TLS?
- sendfile() cannot be done with QUIC, since the QUIC stack runs in userspace. That means that data must be read into kernel memory, copied to the webserver's memory, then copied back into the kernel, then sent down to the NIC. Worse, if crypto is not offloaded, userspace also needs to encrypt the data.
- LSO/LRO are (mostly) not implemented in hardware for QUIC, meaning that the NIC is sent 1500b packets, rather than being sent a 64K packet that it segments down to 1500b.
- The crypto is designed to prevent MiTM attacks, which also makes doing NIC crypto offload a lot harder. I'm not currently aware of any mainstream (eg, not an FPGA by a startup) that can do inline TLS offload for QUIC.
There is work ongoing by a lot of folks to make this better. But at least for now, on the server side, Quic is roughly an order of magnitude less efficient than TCP.
I did some experiments last year for a talk I gave which approximated loosing the optimizations above. https://people.freebsd.org/~gallatin/talks/euro2022.pdf For a video CDN type workload with static content, we'd go from being about to serve ~400Gb/s per single-core AMD "rome" based EPYC (with plenty of CPU idle) to less than 100Gb/s per server with the CPU maxed out.
Workloads where the content is not static and has to be touched already in userspace, things won't be so comparatively bad.