Hacker News new | past | comments | ask | show | jobs | submit login

Author of the OpenSSL QUIC stack here. Great writeup.

TBQH, I'm actually really pleased with these performance figures - we haven't had time yet to do this kind of profiling or make any optimisations. So what we're seeing here is the performance prior to any kind of concerted measurement or optimisation effort on our part. In that context I'm actually very pleasantly surprised at how close things are to existing, more mature implementations in some of these benchmarks. Of course there's now plenty of tuning and optimisation work to be done to close this gap.




I'm curious if you've architected it in such a way that it lends itself to optimization in the future? I'd love to hear more about how these sorts of things are planned, especially in large C projects.


As much as possible, yes.

With something like QUIC "optimisation" breaks down into two areas: performance tuning in terms of algorithms, and tuning for throughput or latency in terms of how the protocol is used.

The first part is actually not the major issue, at least in our design everything is pretty efficient and designed to avoid unnecessary copying. Most of the optimisation I'm talking about above is not about things like CPU usage but things like tuning loss detection, congestion control and how to schedule different types of data into different packets. In other words, a question of tuning to make more optimal decisions in terms of how to use the network, as opposed to reducing the execution time of some algorithm. These aren't QUIC specific issues but largely intrinsic to the process of developing a transport protocol implementation.

It is true that QUIC is intrinsically less efficient than say, TCP+TLS in terms of CPU load. There are various reasons for this, but one is that QUIC performs encryption per packet, whereas TLS performs encryption per TLS record, where one record can be larger than one packet (which is limited by the MTU). I believe there's some discussion ongoing on possible ways to improve on this.

There are also features which can be added to enhance performance, like UDP GSO, or extensions like the currently in development ACK frequency proposal.


Actually the benchmarks just measure the first part (cpu efficiency) since it’s a localhost benchmark. The gap will be most likely due to missing GSO if it’s not implemented. Its such a huge difference, and pretty much the only thing which can prevent QUIC from being totally inefficient.


Thank you for the details!


Thank you kindly for your work. These protocols are critically important and more the more high-quality and open implementations exist the more likely they are to be free and inclusive.

Also, hat tip for such competitive performance on an untuned implementation.


Are there good reasons to use HTTP3/QUIC that aren't based on performance?


I suppose that depends on your definitions of "good" and what counts as being "based on performance". For instance QUIC and HTTP/3 support better reliability via things like FEC and connection migration. You can resume a session on a different network (think going from Wi-Fi to cellular or similar) instead of recreating the session and FEC can make the delivery of messages more reliable. At the same time you could argue both of these ultimately just impact performance depending on how you choose to measure them.

Something more agreeably not performance based is the security is better. E.g. more of the conversation is enclosed in encryption at the protocol layer. Whether that's a good reason depends on who you ask though.


We need to distinguish between performance (throughput over a congested/lossy connection) and efficiency (cpu and memory usage). Quic can achieve higher performance, but will always be less efficient. The linked benchmark actually just measures efficiency since it’s about sending data over loopback on the same host


What makes QUIC less efficient in CPU and memory usage?


Quic throws away roughly 40 years of performance optimizations that operating systems and network card vendors have done for TCP. For example (based on the server side)

- sendfile() cannot be done with QUIC, since the QUIC stack runs in userspace. That means that data must be read into kernel memory, copied to the webserver's memory, then copied back into the kernel, then sent down to the NIC. Worse, if crypto is not offloaded, userspace also needs to encrypt the data.

- LSO/LRO are (mostly) not implemented in hardware for QUIC, meaning that the NIC is sent 1500b packets, rather than being sent a 64K packet that it segments down to 1500b.

- The crypto is designed to prevent MiTM attacks, which also makes doing NIC crypto offload a lot harder. I'm not currently aware of any mainstream (eg, not an FPGA by a startup) that can do inline TLS offload for QUIC.

There is work ongoing by a lot of folks to make this better. But at least for now, on the server side, Quic is roughly an order of magnitude less efficient than TCP.

I did some experiments last year for a talk I gave which approximated loosing the optimizations above. https://people.freebsd.org/~gallatin/talks/euro2022.pdf For a video CDN type workload with static content, we'd go from being about to serve ~400Gb/s per single-core AMD "rome" based EPYC (with plenty of CPU idle) to less than 100Gb/s per server with the CPU maxed out.

Workloads where the content is not static and has to be touched already in userspace, things won't be so comparatively bad.


- The crypto is designed to prevent MiTM attacks, which also makes doing NIC crypto offload a lot harder.

Huh? Surely what you're doing in the accelerated path is just AES encryption/ decryption with a parameterised key which can't be much different from TLS?


Among others: having to transmit 1200-1500byte packets individually to the kernel, which it will all route, filter (iptables, nftables, ebpf) individually instead of just acting on much bigger data chunks for TCP. With GSO it gets a bit better, but it’s still far off from what can be done for TCP.

Then there’s the userspace work for assembling and encrypting all these tiny packets individually, and looking up the right datastructures (connections, streams).

And there’s challenges load balancing the load of multiple Quic connections or streams across CPU cores. If only one core dequeues UDP datagrams for all connections on an endpoint then those will be bottlenecked by that core - whereas for TCP the kernel and drivers can already do more work with multiple receive queues and threads. And while one can run multiple sockets and threads with port reuse, it poses other challenges if a packet for a certain connection gets routed to the wrong thread due to connection migration. Theres also solutions for that - eg in the form of sophisticated eBPF programs. But they require a lot of work and are hard to apply for regular users that just want to use QUIC as a library.


TCP has at least one unfixable security exploit: literally anybody on the network can reset your connection. Availability is 1/3 of security, remember.


How awesome is that? Thank you for all your harf work. It's thanks to people such as yourself that the whole world keeps on working.

Obligatory:

https://xkcd.com/2347/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: