- syscall interfaces are a mess, the primitive APIs are too slow for regular sized packets (~1500 bytes), the overhead is too high. GSO helps but it’s a horrible API, and it’s been buggy even lately due to complexity and poor code standards.
- the syscall costs got even higher with spectre mitigation - and this story likely isn’t over. We need a replacement for the BSD sockets / POSIX APIs they’re terrible this decade. Yes, uring is fancy, but there’s a tutorial level API middle ground possible that should be safe and 10x less overhead without resorting to uring level complexity.
- system udp buffers are far too small by default - they’re much much smaller than their tcp siblings, essentially no one but experts have been using them, and experts just retune stuff.
- udp stack optimizations are possible (such as possible route lookup reuse without connect(2)), gso demonstrates this, though as noted above gso is highly fallible, quite expensive itself, and the design is wholly unnecessarily intricate for what we need, particularly as we want to do this safely from unprivileged userspace.
- several optimizations currently available only work at low/mid-scale, such as connect binding to (potentially) avoid route lookups / GSO only being applicable on a socket without high peer-competition (competing peers result in short offload chains due to single-peer constraints, eroding the overhead wins).
Despite all this, you can implement GSO and get substantial performance improvements, we (tailscale) have on Linux. There will be a need at some point for platforms to increase platform side buffer sizes for lower end systems, high load/concurrency, bdp and so on, but buffers and congestion control are a high complex and sometimes quite sensitive topic - nonetheless, when you have many applications doing this (presumed future state), there will be a need.
> Yes, uring is fancy, but there’s a tutorial level API middle ground possible that should be safe and 10x less overhead without resorting to uring level complexity.
I don't think io_uring is as complex as its reputation suggests. I don't think we need a substantially simpler low-level API; I think we need more high-level APIs built on top of io_uring. (That will also help with portability: we need APIs that can be most efficiently implemented atop io_uring but that work on non-Linux systems.)
> I don't think io_uring is as complex as its reputation suggests.
uring is extremely problematic to integrate into many common application / language runtimes and it has been demonstrably difficult to integrate into linux safely and correctly as well, with a continual stream of bugs, security and policy control issues.
in principle a shared memory queue is a reasonable basis for improving the IO cost between applications and IO stacks such as the network or filesystem stacks, but this isn't easy to do well, cf. uring bugs and binder bugs.
One, uring is not extremely problematic to integrate, as it can be chained into a conventional event loop if you want to, or can even be fit into a conventionally blocking design to get localized syscall benefits. That is, you do not need to convert to a fully uring event loop design, even if that would be superior - and it can usually be kept entirely within a (slightly modified) event loop abstraction. The reason it has not yet been implemented is just priority - most stuff isn't bottlenecked on IOPS.
Two, yes you could have e middle-ground. I assume the syscall overhead you call out is the need to send UDP packets one at a time through sendmsg/sendto, rather than doing one big write for several packets worth of data on TCP. An API that allowed you to provide a chain of messages, like sendmsg takes an iovec for data, is possible. But it's also possible to do this already as a tiny blocking wrapper around io_uring, saving you new syscalls.
There's still the problem of sending to multiple destinations: OK sendmmsg() can send multiple datagrams, but for a given socket. When you have small windows (thank you cubic), you'll just send a few datagrams this way and don't save much.
> There's still the problem of sending to multiple destinations: OK sendmmsg() can send multiple datagrams, but for a given socket.
Hmm? sendmsg takes the destination address in the `struct msghdr` structure, and sendmmsg takes an array of those structures.
At the same time, the discussion of efficiency is about UDP vs. TCP. TCP writes are per socket, to the connected peer, and so UDP has the upper hand here. The concerns were about how TCP allows giving a large buffer to the kernel in a single write that then gets sliced into smaller packets automatically, vs. having to slice it in userspace and call send more, which sendmmsg solves.
(You can of course do single-syscall or even zero-syscall "send to many" with io_uring for any socket type, but that's a different discussion.)
> > There's still the problem of sending to multiple destinations: OK sendmmsg() can send multiple datagrams, but for a given socket.
> Hmm? sendmsg takes the destination address in the `struct msghdr` structure, and sendmmsg takes an array of those structures.
But that's still pointless on a connected socket. And if you're not using connected sockets, you're performing destination lookups for each and every datagram you're trying to send. It also means you're running with small buffers by default (the 212kB default buffers per socket are shared with all your destinations, no longer per destination). Thus normally you want to use connected socket when dealing with UDP in environments having performance requirements.
At one point if I remember it didnt actually work, it still just sent one message at a time and returned the length of the first piece of the iovec. Hopefully it got fixed.
I think you need to look at a common use case and consider how many syscalls you'd like it to take and how many CPU cycles would be reasonable.
Let's take downloading a 1MB jpeg image over QUIC and rendering it on the screen.
I would hope that can be done in about 100k CPU cycles and 20 syscalls, considering that all the jpeg decoding and rendering is going to be hardware accelerated. The decryption is also hardware accelerated.
Unfortunately, no network API allows that right now. The CPU needs to do a substantial amount of processing for every individual packet, in both userspace and kernel space, for receiving the packet and sending the ACK, and there is no 'bulk decrypt' non-blocking API.
Even the data path is troublesome - there should be a way for the data to go straight from the network card to the GPU, with the CPU not even touching it, but we're far from that.
1. A 1 MB file is at the very least 64 individually encrypted TLS records (16k max size) sent in sequence, possibly more. So decryption 64 times is the maximum amount of bulk work you can do - this is done to allow streaming verification and decryption in parallel with the download, whereas one big block would have you wait for the very last byte before any processing could start.
2. TLS is still userspace and decryption does not involve the kernel, and thus no syscalls. The benefits of kernel TLS largely focus on servers sending files straight from disk, bypassing userspace for the entire data processing path. This is not really relevant receive-side for something you are actively decoding.
3. JPEG is, to my knowledge, rarely hardware offloaded on desktop, so no syscalls there.
Now, the number of actual syscalls end up being dictated by the speed of the sender, and the tunable receive buffer size. The slower the sender, the more kernel roundtrips you end upo with, which allows you to amortize the processing over a longer period so everything is ready when the last packet is. For a fast enough sender with big enough receive buffers, this could be a single kernel roundtrip.
JPEG is not a particular great example. However most video streams and partially hardware decoded. Usually you still need to decode part of the stream, namely entropy coding and metadata, first on the CPU.
I find this surprising, given that my initial response to reading the iouring design was:
1. This is pretty clean and straightforward.
2. This is obviously what we need to decouple a bunch of things without the previous downsides.
What has made it so hard to integrate it into common language runtimes? Do you have examples of where there's been an irreconcilable "impedance mismatch"?
in the most general form: you need a fairly "loose" memory model to integrate the "best" (performance wise) parts, and the "best" (ease of use/forward looking safety) way to integrate requires C library linkage. This is troublesome in most GC languages, and many managed runtimes. There's also the issue that uring being non-portable means that the things it suggests you must do (such as say pinning a buffer pool and making APIs like read not immediate caller allocates) requires a substantially separate API for this platform than for others, or at least substantial reworks over all the existing POSIX modeled APIs - thus back to what I said originally, we need a replacement for POSIX & BSD here, broadly applied.
I can see how a zero-copy API would be hard to implement on some languages, but you could still implement something on top of io_uring with posix buffer copy semantics , while using batching to decrease syscall overhead.
Zero-copy APIs will necessarily be tricky to implement and use, especially on memory safe languages.
I think most GC languages support native/pinned me(at least Java and C# do memory to support talking to kernel or native libraries.
The APIs are even quite nice.
Java's off-heap memory and memory segment API is quite dreadful and on the slower side. C# otoh gives you easy and cheap object pinning, malloc/free and stack-allocated buffers.
Rust's async model can support io-uring fine, it just has to be a different API based on ownership instead of references. (That's the conclusion of my posts you link to.)
> with a continual stream of bugs, security and policy control issues
This has not been true for a long time. There was an early design mistake that made it quite prone to these, but that mistake has been fixed. Unfortunately, the reputational damage will stick around for a while.
This conversation would be a good one to point them to to show that their policy is not just harmless point-proving, but in fact does cause harm.
For context, to the best of my knowledge the current approach of the Linux CNA is, in keeping with long-standing Linux security policy of "every single fix might be a security fix", to assign CVEs regardless of whether something has any security impact or not.
This is completely false. The CVE website defines these very clearly:
> The mission of the CVE® Program is to identify, define, and catalog publicly disclosed cybersecurity vulnerabilities [emphasis mine].
In fact, CVE stands for "Common Vulnerabilities and Exposures", again showing that CVE == security issue.
It's of course true that just because your code has an unpatched CVE doesn't automatically mean that your system is vulnerable - other mitigations can be in place to protect it.
That's the modern definition, which is rewriting history. Let's look at the actual, original definition:
> The CVE list aspires to describe and name all publicly known facts about computer systems that could allow somebody to violate a reasonable security policy for that system
There's also a decision from the editorial board on this, which said:
> Discussions on the Editorial Board mailing list and during the CVE Review meetings indicate that there is no definition for a "vulnerability" that is acceptable to the entire community. At least two different definitions of vulnerability have arisen and been discussed. There appears to be a universally accepted, historically grounded, "core" definition which deals primarily with specific flaws that directly allow some compromise of the system (a "universal" definition). A broader definition includes problems that don't directly allow compromise, but could be an important component of a successful attack, and are a violation of some security policies (a "contingent" definition).
> In accordance with the original stated requirements for the CVE, the CVE should remain independent of multiple perspectives. Since the definition of "vulnerability" varies so widely depending on context and policy, the CVE should avoid imposing an overly restrictive perspective on the vulnerability definition itself.
Under this definition, any kernel bug that could lead to user-space software acting differently is a CVE. Similarly, all memory management bugs in the kernel justify a CVE, as they could be used as part of an exploit.
Those two links say that CVEs can be one of two categories: universal vulnerabilities or exposures. But the examples of exposures are not, in any way, "any bug in the kernel". They give specific examples of things which are known to make a system more vulnerable to attack, even if not everyone would agree that they are a problem.
So yes, any CVE is supposed to be a security problem, and it has always been so. Maybe not for your specific system or for your specific security posture, but for someone's.
Extending this to any bugfix is a serious misunderstanding of what an "exposure" means, and it is a serious difference from other CNAs. Linux CNA-assigned CVEs just can't be taken as seriously as normal CNAs.
Nowadays the vast majority of CVEs have nothing to do with security, they're just Curriculum Vitae Enhancers, i.e. a student finding that "with my discovery, if A, B, C and D were granted, I could possibly gain some privileges", despite A/B/C/D being mutually exclusive. That's every days job for any security people to sort out that garbage. So what the kernel does is not worse at all.
That’s definitely not the understanding that literally anyone outside the Linux team has for what a CVE is, including the people who came up with them and run the database. Overloading a well-established mechanism of communicating security issues to just be a registry of Linux bugs is an abuse of an important shared resource. Sure “anything could be a security issue” but in practice, most bugs aren’t, and putting meaningless bugs into the international security issue database is just a waste of everyone’s time and energy to make a very stupid point.
Then check out these definitions, from 2000, defined by the CVE editorial board:
> The CVE list aspires to describe and name all publicly known facts about computer systems that could allow somebody to violate a reasonable security policy for that system
As well as:
> Discussions on the Editorial Board mailing list and during the CVE Review meetings indicate that there is no definition for a "vulnerability" that is acceptable to the entire community. At least two different definitions of vulnerability have arisen and been discussed. There appears to be a universally accepted, historically grounded, "core" definition which deals primarily with specific flaws that directly allow some compromise of the system (a "universal" definition). A broader definition includes problems that don't directly allow compromise, but could be an important component of a successful attack, and are a violation of some security policies (a "contingent" definition).
> In accordance with the original stated requirements for the CVE, the CVE should remain independent of multiple perspectives. Since the definition of "vulnerability" varies so widely depending on context and policy, the CVE should avoid imposing an overly restrictive perspective on the vulnerability definition itself.
Under this definition, any kernel bug that could lead to user-space software acting differently is a CVE. Similarly, all memory management bugs in the kernel justify a CVE, as they could be used as part of an exploit.
> important component of a successful attack, and are a violation of some security policies
If the kernel returned random values from gettime, that'd lead to tls certificate validation not being reliable anymore. As result, any bug in gettime is certainly worthy of a CVE.
If the kernel shuffled filenames so they'd be returned backwards, apparmor and selinux profiles would break. As result, that'd be worthy of a CVE.
If the kernel has a memory corruption, use after free, use of uninitialized memory or refcounting issue, that's obviously a violation of security best practices and can be used as component in an exploit chain.
Can you now see how almost every kernel bug can and most certainly will be turned into a security issue at some point?
> All of these are talking about security issues, not "acting differently".
Because no system has been ever taken down by code that behaved different from what it was expected to do? Right? Like http desync attacks, sql escape bypasses, ... . Absolutely no security issue going to be caused by a very minor and by itself very secure difference in behavior.
As detailed in my sibling reply, by definition that includes any bug in gettime (as that'd affect tls certificate validation), any bug in a filesystem (as that'd affect loading of selinux/apparmor profiles), any bug in eBPF (as that'd affect network filtering), etc.
Additionally, any security bug in the kernel itself, so any use after free, any refcounting bug, any use of uninitialized memory.
Can you now see why pretty much every kernel bug fulfills that definition?
See the context I added to that comment; this is not about security issues, it's about the Linux CNA's absurd approach to CVE assignment for things that aren't CVEs.
I don't agree that it's absurd. I would say it reflects a proper understanding of their situation.
You've doubtless heard Tony Hoare's "There are two ways to write code: write code so simple there are obviously no bugs in it, or write code so complex that there are no obvious bugs in it.". Linux is definitely in the latter category, it's now such a sprawling system that determining whether a bug "really" has security implications is no long a reasonable task compared to just fixing the bug.
The other reason is that Linux is so widely used that almost no assumption made to simplify that above task is definitely correct.
I like CVEs, I think Linux approach to CVEs is stupid, but also it was never meaningful to compare CVE count. But I guess it's hard to make people stop doing that, and that's the reason Linux does the thing it does out of spite.
As I understand it, they adopted this policy because the other policy was also causing harm.
They are right, by the way. When CVEs were used for things like Heartbleed they made sense - you could point to Heartbleed's CVE number and query various information systems about vulnerable systems. When every single possible security fix gets one, AND automated systems are checking the you've patched every single one or else you fail the audit (even ones completely irrelevant to the system, like RCE on an embedded device with no internet access) the system is not doing anything useful - it's deleting value from the world and must be repaired or destroyed.
Well, the CVE system itself is only about assigning identifiers, and assigning identifiers unnecessarily couldn't possibly hurt anyone, who isn't misusing the system, unless they're running out of identifiers.
this is a bit of a distraction, sure the leaks and some of the deadlocks are fairly uninteresting, but the toctou, overflows, uid race/confusion and so on are real issues that shouldn't be dismissed as if they don't exist.
FWIW, the biggest problem I've seen with efficiently using io_uring for networking is that none of the popular TLS libraries have a buffer ownership model that really is suitable for asynchronous network IO.
What you'd want is the ability to control the buffer for the "raw network side", so that asynchronous network IO can be performed without having to copy between a raw network buffer and buffers owned by the TLS library.
It also would really help if TLS libraries supported processing multiple TLS records in a batched fashion. Doing roundtrips between app <-> tls library <-> userspace network buffer <-> kernel <-> HW for every 16kB isn't exactly efficient.
async/await io_uwring wrappers for languages such as Swift [1] and Rust [2] [3] can improve usability considerably. I'm not super familiar with the Rust wrappers but, I've been using IORingSwift for socket, file and serial I/O for some time now.
Hi, Tailscale person! If you want a fairly straightforward improvement you could make: Tailscale, by default uses SOCK_RAW. And having any raw socket listening at all hurts receive performance systemwide:
It shouldn’t be particularly hard to port over the optimization that prevents this problem for SOCK_PACKET. I’ll get to it eventually (might be quite a while), but I only care about this because of Tailscale, and I don’t have a ton of bandwidth.
Historically there have been too many constraints on the Linux syscall interface:
- Performance
- Stability
- Convenience
- Security
This differs from eg. Windows because on Windows the stable interface to the OS is in user-space, not tied to the syscall boundary. This has resulted in unfortunate compromises in the design of various pieces of OS functionality.
Thankfully things like futex and io-uring have dropped the "convenience" constraint from the syscall itself and moved it into user-space. Convenience is still important, but it doesn't need to be a constraint at the lowest level, and shouldn't compromise the other ideals.
> Seems to me that the real problem is the 1500 byte MTU that hasn't increased in practice in over 40 years.
As per a sibling comment, 1500 is just for Ethernet (the default, jumbo frames being able to go to (at least) 9000). But the Internet is more than just Ethernet.
If you're on DSL, then RFC 2516 states that PPPoE's MTU is 1492 (and you probably want an MSS of 1452). The PPP, L2TP, and ATM AAL5 standards all have 16-bit length fields allowing for packets up 64k in length. GPON ONT MTU is 2000. The default MTU for LTE is 1428. If you're on an HPC cluster, there's a good chance you're using Infiniband, which goes to 4096.
What are size do you suggest everyone on the planet go to? Who exactly is going to get everyone to switch to the new value?
> What's the alternative? Making no improvements at all, forever?
No, sadly. The alternative is what the entire tech world has been doing for the past 15 years: shove "improvements" inside whatever crap we already have because nobody wants to replace the crap.
If IPv6 were made today, it would be tunneled inside an HTTP connection. All the new apps would adopt it, the legacy apps would be abandoned or have shims made, and the whole thing would be inefficient and buggy, but adopted. Since poking my head outside of the tech world and into the wider world, it turns out this is how most of the world works.
>If IPv6 were made today, it would be tunneled inside an HTTP connection. All the new apps would adopt it, the legacy apps would be abandoned or have shims made, and the whole thing would be inefficient and buggy, but adopted. Since poking my head outside of the tech world and into the wider world, it turns out this is how most of the world works.
What you're suggesting here wouldn't work, wrapping all the addressing information inside HTTP which relies on IP for delivery does not work. It would be the equivalent of sealing all the addressing information for a letter you'd like to send inside the envelope.
Providers would just do Carrier-grade NAT (as they do today) or another wonky solution with a tunnel into different networks as needed. IPv6 is still useful in different circumstances, particularly creating larger private networks. They could basically reimplement WireGuard, with the VPN software doubling as IPv6 router and interface provider. I'm not saying this is a great idea, but it is definitely what someone today would have done (with HTTP as the transport method) if IPv6 didn't exist.
The internet is mostly ethernet these days (ISP core/edge), last mile connections like DSL and cable already handle a smaller MTU so should be fine with a bigger one.
> Yes, uring is fancy, but there’s a tutorial level API middle ground possible that should be safe and 10x less overhead without resorting to uring level complexity.
And the kernel has no business providing this middle-layer API. Why should it? Let people grab whatever they need from the ecosystem. Networking should be like Vulkan: it should have a high-performance, flexible API at the systems level with being "easy to use" a non-goal --- and higher-level facilities on top.
The kernel provides networking because it doesn't trust userspace to do it. If you provided a low level networking API you'd have to verify everything a client sends is not malicious or pretending to be from another process. And for the same reason, it'd only work for transmission, not receiving.
That and nobody was able to get performant microkernels working at the time, so we ended up with everything in the monokernel.
If you do trust the client processes then it could be better to just have them read/write IP packets though.
Also, it is really easy to do the normal IO "syscall wrappers" on top of io_uring instead, even easily exposing a very simple async/await variant of them that splits out the "block on completion (after which just like normal IO the data buffer has been copied into kernel space)" from the rest of the normal IO syscall, which allow pipelining & coalescing of requests.
"GSO gains performance by enabling upper layer applications to process a smaller number of large packets (e.g. MTU size of 64KB), instead of processing higher numbers of small packets (e.g. MTU size of 1500B), thus reducing per-packet overhead."
Generally today an Ethernet frame, which is the basic atomic unit of information over the wire, is limited to 1500 bytes (the MTU, or Maximum Transmission Unit).
If you want to send more - the IP layer allows for 64k bytes per IP packet - you need to split the IP packet into multiple (64k / 1500 plus some header overhead) frames. This is called segmentation.
Before GSO the kernel would do that which takes buffering and CPU time to assemble the frame headers. GSO moves this to the ethernet hardware, which is essentially doing the same thing only hardware accelerated and without taking up a CPU core.
What you're describing is for TCP. On TCP you can perform a write(64kB) and see the stack send it into 1460 segments. On UDP if you write(64kB) you'll get a single 64kB packet composed of 45 fragments. Needless to say, it suffices that any of them is lost in a buffer somewhere for the whole packet never being received and all of them having to be retransmitted by the application layer.
GSO on UDP allows the application to send a large chunk of data, indicating the MTU to be applied, and lets the kernel pass it down the stack as-is, until the lowest layer that can split it (network stack, driver or hardware). In this case they will make packets, not fragments. On the wire there will really be independent datagrams with different IP IDs. In this case, if any of them is lost, the other ones are still received and the application can focus on retransmitting only the missing one(s). In terms of route lookups, it's as efficient as fragmentation (since there's a single lookup) but it will ensure that what is sent over the wire is usable all along the chain, at a much lower cost than it would be to make the application send all of them individually.
Likely Generic Segmentation Offload (if memory serves), which is a generalization of TCP segmentation offload.
Basically (hyper simple), the kernel can lump stuff together when working with the network interface, which cuts down on ultra slow hardware interactions.
Of these the hardest one to deal with is route lookup caching and reuse w/o connect(2). Obviously the UDP connected TCB can cache that, but if you don't want a "connected" socket fd... then there's nowhere else to cache it except ancillary data, so ancillary data it would have to be. But getting return-to-sender ancillary data on every read (so as to be able to copy it to any sends back to the same peer) adds overhead, so that's not good.
A system call to get that ancillary data adds overhead that can be amortized by having the application cache it, so that's probably the right design, and if it could be combined with sending (so a new flavor of sendto(2)) that would be even better, and it all has to be uring-friendly.
The default UDP buffers of 212kB are indeed a big problem for every client at the moment. You can optimize your server as you want, all your clients will experience losses if they pause for half a millisecond to redraw a tab or update an image, just because the UDP buffers can only store so few packets. That's among the things that must urgently change if we want UDP so start to work well on end-user devices.
Say what you want but I bet we'll see lots of eBPF modules being loaded in the future for the very reason you're describing. An ebpf quic module? Why not!
And that scares me, because there's not a single tool that has this on its radar for malware detection/prevention.
we can consider ebpf "a solution" when there's even a remote chance you'll be able to do it from an unentitled ios app. somewhat hyperbole, but the point is, this problem is a problem for userspace client applications, and bpf isn't a particularly "good" solution for servers either, it's high cost of authorship for a problem that is easily solvable with a better API to the network stack.
In the early days of QUIC, many people pointed out that the UDP stack has had far far less optimization put into it than the TCP stack. Sure enough, some of the issues identified here arise because the UDP stack isn't doing things that it could do but that nobody has been motivated to make it do, such as UDP generic receive offload. Papers like this are very likely to lead to optimizations both obvious and subtle.
What is UDP offload going to do? UDP barely does anything but queue and copy.
Linux scheduling from packet-received to thread has control is not real-time, and if the CPUs are busy, may be rather slow. That's probably part of the bottleneck.
The embarrassing thing is that QUIC, even in Google's own benchmarks, only improved performance by about 10%. The added complexity probably isn't worth the trouble. However, it gave Google control of more of the stack, which may have been the real motivation.
Last I looked (several months ago), Linux's UDP stack did not seemed well tuned in its memory management accounting.
For background, the mental model of what receiving network data looks like in userspace is almost completely backwards compared to how general-purpose kernel network receive actually works. User code thinks it allocates a buffer (per-socket or perhaps a fancier io_uring scheme), then receives packets into that buffer, then processes them.
The kernel is the other way around. The kernel allocates buffers and feeds pointers to those buffers to the NIC. The NIC receives packets and DMAs them into the buffers, then tells the kernel. But the NIC and the kernel have absolutely no concept of which socket those buffers belong to until after they are DMAed into the buffers. So the kernel cannot possibly map received packets to the actual recipient's memory. So instead, after identifying who owns a received packet, the kernel retroactively charges the recipient for the memory. This happens on a per-packet basis, it involves per-socket and cgroup accounting, and there is no support for having a socket "pre-allocate" this memory in advance of receiving a packet. So the accounting is gnarly, involves atomic operations, and seems quite unlikely to win any speed awards. On a very cursory inspection, the TCP code seemed better tuned, and it possibly also won by generally handling more bytes per operation.
Keep in mind that the kernel can't copy data to application memory synchronously -- the application memory might be paged out when a packet shows up. So instead the whole charging dance above happens immediately when a packet is received, and the data is copied later on.
For quite a long time, I've thought it would be nifty if there was a NIC that kept received data in its own RAM and then allowed it to be efficiently DMAed to application memory when the application was ready for it. In essence, a lot of the accounting and memory management logic could move out of the kernel into the NIC. I'm not aware of anyone doing this.
> For quite a long time, I've thought it would be nifty if there was a NIC that kept received data in its own RAM and then allowed it to be efficiently DMAed to application memory when the application was ready for it.
I wonder if we could do a more advanced version of receive-packet steering that sufficiently identifies packets as definitely for a given process and DMAs them directly to that process's pre-provided buffers for later notification? In particular, can we offload enough information to a smart NIC that it can identify where something should be DMAed to?
Most advanced NICs support flow steering, which makes the NIC write to different buffers depending on the target port.
In practice though, you only have a limited amount of these buffers, and it causes complications if multiple processes need to consume the same multicast.
Multicast may well be shitcanned to an expensive slow path, given that multicast is rarely used for high bandwidth scenarios, especially when multiple processes need to receive the same packet.
With multiple processes listening for the data? I think that's a market niche.
In terms of billions of devices, multicast is mostly used for zero-config service discovery. I am not saying there isn't a market for high-bandwidth multicast, I am stating that for the vast majority of software deployments, multi-cast performance is not an issue. For whatever deployments it is an issue, they can specialize. And, as in the sibling comment mentions, people who need breakneck speeds have already proven that they can create a market for themselves.
I don’t think the result would be compatible with the socket or io_uring API, but maybe io_uring could be extended a bit. Basically the kernel would opportunistically program a “flow director” or similar rule to send packets to special rx queue, and that queue would point to (pinned) application memory. Getting this to be compatible with iptables/nftables would be a mess or maybe entirely impossible.
I’ve never seen the accelerated steering stuff work well in practice, sadly. The code is messy, the diagnostics are basically nonexistent, and it’s not clear to me that many drivers support it well.
Of course you're going to get horrible latency because of speed-of-light limitations, so the definition of "work" may be weak, but data should be able to be transmitted.
GPUDirect relies on the PeerDirect extensions for RDMA and are thus an extension to the RDMA verbs, not a separate an independent thing that works without RDMA.
You can read/write to GPU buffers with gpudev in DPDK yes. It also uses some of the infrastructure that powers GPUDirect (namely the page pinning and address translation). Because you can use the addressable memory in DPDK buffer steering you can have the NIC DMA to/from the GPU and then have a GPU kernel coordinate with your DPDK application. This will be pretty fast on a good lossless datacentre network but probably pretty awful over the Internet. In the DC though it will be beaten by real GPUDirect on RDMA naturally as you don't need the DPDK coordinator and all tx/rx can be driven by the GPU kernel instead.
This isn't GPUDirect though, that is an actual product.
This is GPUDirect. GPUDirect is the technology that enables any third-party device to talk to a GPU (like a NIC).
> but probably pretty awful over the Internet. In the DC though it will be beaten by real GPUDirect on RDMA naturally
It's being used in many places successfully over the internet. RDMA is fine, but completely breaks the abstraction of services. In many places you do not want to know who is sending or what address to send/receive.
Why don't we eliminate the initial step of an app reserving a buffer, keep each packet in its own buffer, and once the socket it belongs to is identified hand a pointer and ownership of that buffer back to the app? If buffers can be of fixed (max) size, you could still allow the NIC to fill a bunch of them in one go.
Presuming that this is a server that has One (public) Job, couldn't you:
1. dedicate a NIC to the application;
2. and have the userland app open a packet socket against the NIC, to drink from its firehose through MMIO against the kernel's own NIC DMA buffer;
...all without involving the kernel TCP/IP (or in this case, UDP/IP) stack, and any of the accounting logic squirreled away in there?
(You can also throw in a BPF filter here, to drop everything except UDP packets with the expected specified ip:port — but if you're already doing more packet validation at the app level, you may as well just take the whole firehose of packets and validate them for being targeted at the app at the same time that they're validated for their L7 structure.)
I think DPDK does something like this. The NIC is programmed to aim the packets in question at a specific hardware receive queue, and that queue is entirely owned by a userspace program.
A lot of high end NICs support moderately complex receive queue selection rules.
I mean, under the scheme I outlined, the kernel is still going to do that by default. It's not that the NIC's driver is overridden or anything; the kernel would still be reading the receive buffer from this NIC and triggering per-packet handling — and thus triggering default kernel response-handling where applicable (and so responding to e.g. ICMP ARP messages correctly.)
The only thing that's different here, is that there are no active TCP or UDP listening sockets bound to the NIC — so when the kernel is scanning the receive buffer to decide what to do with packets, and it sees a TCP or UDP packet, it's going to look at its connection-state table for that protocol+interface, realize it's empty, and drop the packet for lack of consumer, rather than doing any further logic to it. (It'll bump the "dropped packets" counter, I suppose, but that's it.)
But, since there is a packet socket open against the NIC, then before it does anything with the packet, it's going to copy every packet it receives into that packet socket's (userspace-shared) receive-buffer mmap region.
- 64 packets per syscall, which is enough data to amortize the syscall overhead - a single packet is not.
- UDP offload optionally lets you defer checksum computation, often offloading it to hardware.
- UDP offload lets you skip/reuse route lookups for subsequent packets in a bundle.
What UDP offload is no good for though, is large scale servers - the current APIs only work when the incoming packet chains neatly organize into batches per peer socket. If you have many thousands of active sockets you’ll stop having full bundles and the overhead starts sneaking back in. As I said in another thread, we really need a replacement for the BSD APIs here, they just don’t scale for modern hardware constraints and software needs - much too expensive per packet.
In my head the main benefit of QUIC was always multipath, aka the ability to switch interfaces on demand without losing the connection. There's MPTCP but who knows how viable it is.
I always thought the main benefit of QUIC was to encrypt the important part of the transport header, so endpoints control their own destiny, not some middle device.
If I had a dollar for every firewall vendor who thought dropping TCP retransmissions or TCP Reset was a good idea...
It requires explicit backend support, and Apple supports it for many of their services, but I've never seen another public API that does. Anyone have any examples?
Last I looked into this (many years), ELB/GLBs didn't support it on AWS/GCP respectively. That prevented us from further considering implementing it at the time (mobile app -> AWS-hosted EC2 instances behind an ELB).
Not sure if that's changed, but at the time it wasn't worth having to consider rolling our own LBs.
To answer your original question, no, I haven't (knowingly) seen it on any public APIs.
Among other things, GRO (receive offloading) means you can get more data off of the network card in fewer operations.
Linux has receive packet steering, which can help with getting packets from the network card to the right CPU and the right userspace thread without moving from one CPU's cache to another.
You mean Receive Flow Steering, and RFS can only control RPS, so to do it in hardware you actually mean Accelerated RFS (which requires a pretty fancy NIC these days).
Even ignoring the hardware requirement, unfortunately it's not that simple. I find results vary wildly whether you should put process and softirq on the same CPU core (sharing L1 and L2) or just on the same CPU socket (sharing L3 but don't constantly blow out L1/L2).
Eric Dumazet said years ago at a Netdev.conf that L1 cache sizes have really not kept up with reality. That matches my experience.
QUIC doing so much in userspace adds another class of application which has a so-far uncommon design pattern.
I don't think it's possible to say whether any QUIC application benefits from RFS or not.
Handling ACK packets in kernelspace would be one thing - helping for example RTT estimation. With userspace stack ACK's are handled in application and are subject to scheduler, suffering a lot on a loaded system.
There are no ACKs inherent in the UDP protocol, so "UDP offload" is not where the savings are.
There are ACKs in the QUIC protocol and they are carried by UDP datagrams which need to make their way up to user land to be processed, and this is the crux of the issue.
What is needed is for QUIC offload to be invented/supported by HW so that most of the high-frequency/tiny-packet processing happens there, just as it does today for TCP offload. TCP large-send and large-receive offload is what is responsible for all the CPU savings as the application deals in 64KB or larger send/receives and the segmentation and receive coalescing all happen in hardware before an interrupt is even generated to involve the kernel, let alone userland.
Bulk throughout isn't on par with TLS mainly because NICs with dedicated hardware for QUIC offload aren't commercially available (yet). Latency is undoubtedly better - the 1-RTT QUIC handshake substantially reduces time-to-first-byte compared to TLS.
I think one of the original drivers was the ability to quickly tweak parameters, after Linux rejected what I think was userspace adjustment of window sizing to be more aggressive than the default.
The Linux maintainers didn't want to be responsible for congestion collapse, but UDP lets you spray packets from userspace, so Google went with that.
The solution isn't in more UDP offload optimizations as there aren't any semantics in UDP that are expensive other than the quantity and frequency of datagrams to be processed in the context of the QUIC protocol that uses UDP as a transport. QUIC's state machine needs to see every UDP datagram carrying QUIC protocol messages in order to move forward. Just like was done for TCP offload more than twenty years ago, portions of QUIC state need to move and be maintained in hardware to prevent the host from having to see so many high-frequency tiny state updates messages.
Unless I’m missing something here, pretty much any Intel nic released in the past decade should support tcp offload. I imagine the same is true for Broadcom and other vendors as well, but I don’t have something handy to check.
> Which end-user network cards that I can buy can do TCP offloading?
Intel's I210 controllers support offloading:
> Other performance-enhancing features include IPv4 and IPv6 checksum offload, TCP/UDP checksum offload, extended Tx descriptors for more offload capabilities, up to 256 KB TCP segmentation (TSO v2), header splitting, 40 KB packet buffer size, and 9.5 KB Jumbo Frame support.
Practically every on-board network adapter I've had for over a decade has had TCP offload support. Even the network adapter on my cheap $300 Walmart laptop has hardware TCP offload support.
The whole reason QUIC even exists in user space is because its developers were trying to hack a quick speed-up to HTTP rather than actually do the work to improve the underlying networking fundamentals. In this case the practicalities seem to have caught them out.
If you want to build a better TCP, do it. But hacking one in on top of UDP was a cheat that didn’t pay off. Well, assuming performance was even the actual goal.
It already exists, it's called SCTP. It doesn't work over the Internet because there's too much crufty hardware in the middle that will drop it instead of routing it. Also, Microsoft refused to implement it in Windows and also banned raw sockets so it's impossible to get support for it on that platform without custom drivers that practically nobody will install.
I don't know how familiar the developers of QUIC were with SCTP in particular but they were definitely aware of the problems that prevented a better TCP from existing. The only practical solution is to build something on top of UDP, but if even that option proves unworkable, then the only other possibility left is to fragment the Internet.
I like (some aspects of) SCTP too but it's not a solution to this problem.
If you've followed Dave Taht's bufferbloat stuff, the reason he lost faith in TCP is because middle devices have access to the TCP header and can interfere with it.
If SCTP got popular, then middle devices would ruin SCTP in the same way.
QUIC is the bufferbloat preferred solution because the header is encrypted. It's not possible for a middle device to interfere with QUIC. Endpoints, and only endpoints, control their own traffic.
They couldn't have built it on anything but UDP because the world is now filled with poorly designed firewall/NAT middleboxes which will not route things other than TCP, UDP and optimistically ICMP.
counterpoint, it is paying off, just taking a while. this paper wasn't "quick is bad" it was "OSes need more optimization for quick to be as fast as https"
I think this is slightly wrong. the goal was faster without requiring the OS/middleware support. optimizing the OSes that need high performance is much easier since that's a much smaller set of OSes (basically just Linux/Mac/windows)
Yeah they probably wanted a protocol that would actually work on the wild internet with real firewalls and routers and whatnot. The only option if you want that is building on top of UDP or TCP and you obviously can't use TCP.
Your first point is correct - papers ideally lead to innovation and tangible software improvements.
I think a kernel implementation of QUIC is the next logical step. A context switch to decrypt a packet header and send control traffic is just dumb. That's the kernel's job.
Userspace network stacks have never been a good idea. QUIC is no different.
(edit: Xin Long already has started a kernel implementation, see elsewhere on this page)
Even HTTP/2 seems to have been rushed[1]. Chrome has removed support for server push. Maybe more thought should be put into these protocols instead of just rebranding whatever Google is trying to impose on us.
HTTP2 was a prototype that was designed by people who either assumed that mobile internet would get better much quicker than it did, or who didn't understand what packet loss did to throughput.
I suspect part of the problem is that some of the rush is that people at major companies will get a promotion if they do "high impact" work out in the open.
HTTP/2 "solves head of line blocking" which is doesn't. It exchanged an HTTP SSL blocking issues with TCP on the real internet issue. This was predicted at the time.
The other issue is that instead of keeping it a simple protocol, the temptation to add complexity to aid a specific use case gets too much. (It's human nature I don't blame them)
H/2 doesn't solve blocking it on the TCP level, but it solved another kind of blocking on the protocol level by having multiplexing.
H/1 pipelining was unusable, so H/1 had to wait for a response before sending the next request, which added a ton of latency, and made server-side processing serial and latency-sensitive. The solution to this was to open a dozen separate H/1 connections, but that multiplied setup cost, and made congestion control worse across many connections.
> it solved another kind of blocking on the protocol level
Indeed! and it works well on low latency, low packet loss networks. On high packet loss networks, it performs worse than HTTP1.1. Moreover it gets increasingly worse the larger the page the request is serving.
We pointed this out at the time, but were told that we didn't understand the web.
> H/1 pipelining was unusable,
Yup, but think how easy it would be to create http1.2 with better spec for pipe-lining. (but then why not make changes to other bits as well, soon we get HTTP2!) But of course pipelining only really works in a low packet loss network, because you get head of line blocking.
> open a dozen separate H/1 connections, but that multiplied setup cost
Indeed, that SSL upgrade is a pain in the arse. But connections are cheap to keep open. So with persistent connections and pooling its possible to really nail down the latency.
Personally, I think the biggest problem with HTTP is that its a file access protocol, a state interchange protocol and an authentication system. I would tentatively suggest that we adopt websockets to do state (with some extra features like optional schema sharing {yes I know thats a bit of enanthema}) Make http4 a proper file sharing prototcol and have a third system for authentication token generation, sharing and validation.
However the real world says that'll never work. So connection pooling over TCP with quick start TLS would be my way forward.
"HTTP is being used as a file access, state interchange and authentication transport system"
Ideally we would split them out into a dedicated file access, generic state pipe (ie websockets) and some sort of well documented, easy to understand, implement and secure authentication mechanism (how hard can that be!?)
but to you point. HTTP was always mean to be stateless. You issue a GET request to find an object at a URI. That object was envisaged to be a file. (at least in HTTP 1.0 days) Only with the rise of CGI-bin in the middle 90s did that meaningfully change.
However I'm willing to bet that most of the traffic over HTTP is still files. Hence the assertion.
It's okay to make mistakes, that's how you learn and improve. Being conservative has drawbacks of its own. Id argue we need more parties involved earlier in the process rather than just time.
It's a weird balancing act. On the other hand, waiting for everyone to agree on everything means that the spec will take a decade or two for everyone to come together, and then all the additional time for everyone to actively support it.
AJAX is a decent example. Microsoft's Outlook Web Access team implemented XMLHTTP as an activex thing for IE 5 and soon the rest of the vendors adopted it as a standard thing as XmlHttpRequest objects.
In fact, I suspect the list of things that exist in browsers because one vendor thought it was a good idea and everyone hopped on board is far, far longer than those designed by committee. Often times, the initially released version is not exactly the same that everyone standardized on, but they all get to build on the real-world consequences of it.
I happen to like the TC39 process https://tc39.es/process-document/ which requires two live implementations with use in the wild for something to get into the final stage and become an official part of the specification. It is obviously harder for something like a network stack than a JavaScript engine to get real world use and feedback, but it has helped to keep a lot of the crazier vendor specific features at bay.
It's okay to make mistakes, but its not okay to ignore the broad consensus that HTTP2 was TERRIBLY designed and then admit it 10 years later as if it was unknowable. We knew it was bad.
This is a weak argument that simply caters to the ongoing HN hivemind opinion. While Google made the initial proposal, many other parties did participate in getting quic standardized. The industry at large was in favor.
IETF QUIC ended up substantially different from gQUIC. People who say Google somehow single-handedly pushed things through probably haven’t read anything along the standardization process, but of course everyone has to have an opinion about all things Google.
> we identify the root cause to be high receiver-side processing overhead
I find this to be the issue when it comes to Google, and I bet it was known before hand; pushing processing to the user. For example, the AV1 video codec was deployed when no consumer had HW decoding capabilities. It saved them on space at the expense of increased CPU usage for the end-user.
I don't know what the motive was there; it would still show that they are carbon-neutral while billions are busy processing the data.
> the AV1 video codec was deployed when no consumer had HW decoding capabilities
This was a bug. An improved software decoder was deployed for Android and for buggy reasons the YouTube app used it instead of a hardware accelerated implementation. It was fixed.
Having worked on a similar space (compression formats for app downloads) I can assure you that all factors are accounted for with decisions like this, we were profiling device thermals for different compression formats. Setting aside bugs, the teams behind things like this are taking wide-reaching views of the ecosystem when making these decisions, and at scale, client concerns almost always outweigh server concerns.
Well I will say if your running servers hit billions of times per day. Offloading processing to the client when safe to do so starts make sense financially. Google does not have to pay for your CPU or storage usage ect...
Also I will say if said overhead is not too much it's not that bad of a thing.
This is indeed an issue but it's widespread and everyone does it, including Google. Things like servers no longer generating actual dynamic HTML, replaced with servers simply serving pure data like JSON and expecting the client to render it into the DOM. It's not just Google that doesn't care, but the majority of web developers also don't care.
There's clearly advantages to writing a web app as an SPA, otherwise web devs wouldn't do it. The idea that web devs "don't care" (about what exactly?) really doesn't make any sense.
Moving interactions to JSON in many cases is just a better experience. If you click a Like button on Facebook, which is the better outcome: To see a little animation where the button updates, or for the page to reload with a flash of white, throw away the comment you were part-way through writing, and then scroll you back to the top of the page?
There's a reason XMLHttpRequest took the world by storm. More than that, jQuery is still used on more than 80% of websites due in large part to its legacy of making this process easier and cross-browser.
> To see a little animation where the button updates, or for the page to reload with a flash of white, throw away the comment you were part-way through writing, and then scroll you back to the top of the page
I don't understand how web devs understand the concept of loading and manipulating JSON to dynamically modify the page's HTML, but they don't understand the concept of loading and manipulating HTML to dynamically modify the page's HTML.
It's the same thing, except now you don't have to do a conversion from JSON->HTML.
There's no rule anywhere saying receiving HTML on the client should do a full page reload and throw up the current running javascript.
> XMLHttpRequest
This could've easily been HTMLHttpRequest and it would've been the same API, but probably better. Unfortunately, during that time period Microsoft was obsessed with XML. Like... obsessed obsessed.
Rendering JSON into HTML has nothing to do with XMLHttpRequest.
Funny that you mention jQuery. When jQuery was hugely popular, people used it to make XMLHttpRequests that returned HTML which you then set as the innerHTML of some element. Of course being jQuery, people used the shorthand of `$("selector").html(...)` instead.
In the heyday of jQuery the JSON.parse API didn't exist.
It's likely the authors used an existing conference template to fit in their paper's contents. Upon sending it to the conference, the editors can easily fit the contents in their prescribed format, and the authors know how many characters they can fit in the page limit.
arXiv typically contains pre-prints of papers. These may not have been peer-reviewed, and the contents may not reflect the actual "published" paper that was accepted (and/or corrected after peer review) to a conference or journal.
arXiv applies a watermark to the submitted PDF such that different versions are distinguishable on download.
>The results show that QUIC and HTTP/2 exhibit similar performance when the network bandwidth is relatively low (below ∼600 Mbps)
>Next, we investigate more realistic scenarios by conducting the same file download experiments on major browsers: Chrome, Edge, Firefox, and Opera. We observe that the performance gap is even larger than that in the cURL and quic_client experiments: on Chrome, QUIC begins to fall behind when the bandwidth exceeds ∼500 Mbps.
Okay, well, this isn't going to be a problem over the general Internet, it's more of a problem in local networks.
For people that have high-speed connections, how often are you getting >500Mbps from a single source?
Well, I have other issues with QUIC: when I access Facebook with QUIC, the site often loads the first pages but then it kind of hung, force me to refresh the site, which is annoying. I didn’t know it’s a problem with QUIC, until I turned it off. Since then, FB & Co. load at the same speed, but don’t show this annoying behavior anymore!
As someone frequently at 150ms+ latency for a lot of websites (and semi-frequently 300ms+ for non-geo-distributed websites), in practice with the latency QUIC is easily the best for throughput, HTTP/1.1 with a decent number of parallel connections is a not-that-distant second, and in a remote third is HTTP/2 due to head-of-line-blocking issues if/when a packet goes missing.
Currently chewing my way laboriously through RFC9000. Definitely concerned by how complex it is. The high level ideas of QUIC seem fairly straight forward, but the spec feels full of edge cases you must account for. Maybe there's no other way, but it makes me uncomfortable.
I don't mind too much as long as they never try to take HTTP/1.1 from me.
I think keeping HTTP/1.1 is almost as important as not dropping IPV4 (there are good reasons to not being able to tag everything; it's harder to block a country than a user.) For similar reasons we should keep old protocols.
I think it's just your little corner of the woods that isn't adopting it. Over here the trend is very clearly to move away from IPv4, except for legacy reasons.
Save for the France/Germany (~75%) and then USA/Mexcico/Brazil (~50%) rest of the world is not really adopting it... Even in Europe Spain has only ~10% and Poland ~17% penetration but yeah... let's be dismissive with "your little corner"...
The important milestone is when it's safe to turn IPv4 off. And that's not going to happen as long as any country hasn't fully adopted it, and I don't think that's ever going to happen. For better or worse NAT handles outgoing connections and SNI routing handles incoming connections for most use cases. Self-hosting is the most broken but IMO that's better handled with tunneling anyway so you don't expose your home IP.
DS-lite (aka CGNAT), now we don't need to give the costumers a proper IP address anymore. It should be banned as it limits IPv6 adoption and it getting more and more use for "customers own good" and is annoying as hell to work around.
The graph indicates that only 50% of the samples have IPv6. Also consider:
> The graph shows the percentage of users that access Google over IPv6.
About 20% of the world population lives in regions where Google is outright blocked, so the above users with confirmed IPv6 reachability reflects no more than 40% of the world population.
I'd expect that IPv6 deployment will have a long tail end. Countries lower on resources but with relatively modern infrastructure are the ones who will delay the longest in upgrading to IPv6.
Adoption is not even 50%, and the line goes up fairly linear so ~95% will be around 2040 or so?
And if you click on the map view you will see "little corner of the woods" is ... the entire continent of Africa, huge countries like China and Indonesia.
Why did adoption slow down after a sudden rise? I guess some countries switched to ipv6 and since then, progress has been slow? It's hard to infer from the graph but my guess would be india? They have a very nice adoption rate.
Sadly here in Canada I don't think any ISP even supports IPv6 in any shape or form except for mobile. Videotron has been talking about it for a decade (and they have a completely outdated infrastructure now, only DOCSIS and a very bad implementation of it too), and Bell has fiber but does not provide any info on that either.
There's simply not enough demand. ISPs can solve their IP problems with NAT. Web services can solve theirs with SNI routing. The only people who really need IPv6 are self hosters.
Ah that's cool! It sucks that they are basically non existent in Quebec, at least for residential internet. But I think they are pushing for a bigger foothold here
Maybe moving the entire application to the browser/cloud wasn't the best idea for a large number of use cases?
Video streaming, sure, but we're already able to stream 4K video over a 25Mbit line. With modern internet connections being 200Mbit to 1Gbit, I don't see that we need the bandwidth in private homes. Maybe for video conferencing in large companies, but that also doesn't need to be 4K.
The underlying internet protocols are old, so there's no harm in assessing if they've outlived their usefulness. However, we should also consider in web applications and "always connected" is truly the best solution for our day to day application needs.
> With modern internet connections being 200Mbit to 1Gbit, I don't see that we need the bandwidth in private homes
Private connections tend to be asymmetrical. In some cases, e.g. old DOCSIS versions, that used to be due to technical necessity.
Private connections tend to be unstable, the bandwidth fluctuates quite a bit. Depending on country, the actually guaranteed bandwidth is somewhere between half of what's on the sticker, to nothing at all.
Private connections are usually used by families, with multiple people using it at the same time. In recent years, you might have 3+ family members in a video call at the same time.
So if you're paying for a 1000/50 line (as is common with DOCSIS deployments), what you're actually getting is usually a 400/20 line that sometimes achieves more. And those 20Mbps upload are now split between multiple people.
At the same time, you're absolutely right – Gigabit is enough for most people. Download speeds are enough for quite a while. We should instead be increasing upload speeds and deploying FTTH and IPv6 everywhere to reduce the latency.
This is a great post. I often forget that home Internet connections are frequently shared between many people.
This bit:
> IPv6 everywhere to reduce the latency
I am not an expert on IPv4 vs IPv6. Teach me: How will migrating to IPv6 reduce latency? As I understand, a lot of home Internet connections are always effectively IPv6 via CarrierNAT. (Am I wrong? Or not relevant to your point?)
> Google has measured to most customers about 20ms less latency on IPv6 than on IPv4, according to their IPv6 report.
I've run that comparison across four ISPs and never seen any significant difference in latency... not once in the decades I've had "dual stack" service.
I imagine that Google is getting confounded by folks with godawful middle/"security"ware that is too stupid to know how to handle IPv6 traffic and just passes it through.
Here, I would say "need" is a strong term. Surely, you are correct at the most basic level, but if the bandwidth exists, then some streaming platforms will use it. Deeper question: Is there any practical use case for Internet connections about 1Gbit? I struggle to think of any. Yes, I can understand that people may wish to reduce latency, but I don't think home users need any more bandwidth at this point. I am astonished when I read about 10Gbit home Internet access in Switzerland, Japan, and Korea.
Zero trolling: Can you help me to better understand your last sentence?
> However, we should also consider in web applications and "always connected" is truly the best solution for our day to day application needs.
I cannot tell if this is written with sarcasm. Let me ask more directly: Do you think it is a good design for our modern apps to always be connected or not? Honestly, I don't have a strong opinion on the matter, but I am interested to hear your opinion.
Generally speaking I think we should aim for offline first, always. Obvious things like Teams or Slack requires an internet connection to work, but assuming a working internet connection shouldn't even be a requirement for a web browser.
I think it is bad design to expect a working internet connection, because in many places your can't expect bandwidth be cheap, or the connection to be stable. That's not to say that something like Google Docs (others seems to like it, but everyone in my company thinks it's awful) should be a thing, there's certainly value in the real time collaboration features, but it should be able to function without an internet connection.
Last week someone was complaining about the S3 (sleep) feature on laptops, and one thing that came to my mind is that despite these being portable, we somehow expect them to be always connected to the internet. That just seems like a somewhat broken mindset to me.
Note that in deeper sleep states you typically see more aggressive limiting of what interrupts can take you out of the sleep state. Turning off network card interrupts is common.
I don't have access to the article, but they're saying the issue is due to client side ack processing. I suspect they're testing at bandwidths far beyond what's normal for consumer applications.
Gigabit fiber internet is quite cheap and increasingly available (I'm not from the US). I don't just use the internet over a 4/5g connection. This definitely affects more people than you think.
But that is not what i was replying to. I was replying to the claim that this affects regular 4g/5g cell phone speeds. The data is clear that it does not.
The problem is that the biggest win by far with QUIC is merging encryption and session negotiation into a single packet, and the kernel teams have been adamant about not wanting to maintain encryption libraries in kernel. So, QUIC or any other protocol like it being in kernel is basically a non-starter.
The flexibility and ease of changing a userspace protocol IMO far outweighs anything else. If the performance problem described in this article (which I don't have access to) is in userspace QUIC code, it can be fixed and deployed very quickly. If similar performance issue were to be found in TCP, expect to wait multiple years.
No, but it depends on how QUIC works, how Ethernet hardware works, and how much you actually want to offload to the NIC. For example, QUIC has TLS encryption built-in, so anything that's encrypted can't be offloaded. And I don't think most people want to hand all their TLS keys to their NIC[0].
At the very least you probably would have to assign QUIC its own transport, rather than using UDP as "we have raw sockets at home". Problem is, only TCP and UDP reliably traverse the Internet[1]. Everything in the middle is sniffing traffic, messing with options, etc. In fact, Google rejected an alternate transport protocol called SCTP (which does all the stream multiplexing over a single connection that QUIC does) specifically because, among other things, SCTP's a transport protocol and middleboxes choke on it.
[0] I am aware that "SSL accelerators" used to do exactly this, but in modern times we have perfectly good crypto accelerators right in our CPU cores.
[1] ICMP sometimes traverses the internet, it's how ping works, but a lot of firewalls blackhole ICMP. Or at least they did before IPv6 made it practically mandatory to forward ICMP packets.
SCTP had already solved the problem that QUIC proposes to solve.
Google of all companies has the influence to properly implement and accommodate other L4 protocols. QUIC seems like doubling down on a hack and breaks the elegance of OSI model.
The OSI model? We're in the world where TCP/IP won. OSI is a hilariously inelegant model that doesn't map to actual network protocols in practice. To wit: where exactly is the "presentation layer" or "session layer" in modern networking standards?
IP didn't originally have layering. It was added early on, so they could separate out the parts of the protocol for routing packets (IP) and the parts for assembling data streams (TCP). Then they could permit alternate protocols besides TCP. That's very roughly OSI L3 and L4, so people assumed layering was ideologically adopted across the Internet stack, rather than something that's used pragmatically.
Speaking of pragmatism, not everyone wants to throw out all their old networking equipment just to get routers that won't mangle unknown transports. Some particularly paranoid network admin greybeards remember, say, the "ping of death", and would much rather have routers that deliberately filter out anything other than well-formed TCP and UDP streams. Google is not going to get them to change their minds; hell, IPv6 barely got those people to turn on ICMP again.
To make matters worse, Windows does not ship SCTP support. If you want to send or receive SCTP packets you either use raw sockets and run as admin (yikes), or you ship a custom network driver to enable unprivileged SCTP. The latter is less of a nightmare but you still have to watch out for conflicts, I presume you can only have one kind of SCTP driver installed at a time. e.g. if Google SCTP is installed, then you switch to Firefox, it'll only work with Mozilla SCTP and you'll have weird conflicts. Seems like a rather invasive modification to the system to make.
The alternative is to tunnel SCTP over another transport protocol that can be sent by normal user software, with no privileged operations or system modification required. i.e. UDP. Except, this is 2010, we actually care about encryption now. TLS is built for streams, and tunneling TLS inside of multiple SCTP substreams would be a pain in the ass. So we bundle that in with our SCTP-in-UDP protocol and, OOPS, it turns out that's what QUIC is.
I suppose they could have used DTLS in between SCTP and UDP. Then you'd have extra layers, and layers are elegant.
As others in the thread summarized the paper as saying the issue is ack offload. That has nothing to do with whether the stack is in kernel space or user space. Indeed there’s some concern about this inevitable scenario because the kernel is so slow moving, updates take much longer to propagate to applications needing them without a middle ground whereas as user space stacks they can update as the endpoint applications need them to.
I don't have access to the paper but based on the abstract and a quick scan of the presentation, I can confirm that I have seen results like this in Caddy, which enables HTTP/3 out of the box.
HTTP/3 implementations vary widely at the moment, and will likely take another decade to optimize to homogeneity. But even then, QUIC requires a lot of state management that TCP doesn't have to worry about (even in the kernel). There's a ton of processing involved with every UDP packet, and small MTUs, still engrained into many middle boxes and even end-user machines these days, don't make it any better.
So, yeah, as I felt about QUIC ... oh, about 6 years ago or so... HTTP/2 is actually really quite good enough for most use cases. The far reaches of the world and those without fast connections will benefit, but the majority of global transmissions will likely be best served with HTTP/2.
Intuitively, I consider each HTTP major version an increased order of magnitude in complexity. From 1 to 2 the main complexities are binary (that's debatable, since it's technically simpler from an encoding standpoint), compression, and streams; then with HTTP/3 there's _so, so much_ it does to make it work. It _can_ be faster -- that's proven -- but only when networks are slow.
TCP congestion control is its own worst enemy, but when networks aren't congested (and with the right algorithm)... guess what. It's fast! And the in-order packet transmissions (head-of-line blocking) makes endpoint code so much simpler and faster. It's no wonder TCP is faster these days when networks are fast.
I think servers should offer HTTP/3 but clients should be choosy when to use it, for the sake of their own experience/performance.
I see a few million daily page views: Memory usage has been down, latency has been down, network accounting (bandwidth) is about the same. Revenue (ads) is up.
> It _can_ be faster -- that's proven -- but only when networks are slow.
It can be faster in a situation that doesn't exist.
It sounds charitable to say something like "when networks are slow" -- but because everyone has had a slow network experience, they are going to think that QUIC would help them out, but real world slow network problems don't look like the ones that QUIC solves.
In the real world, QUIC wastes memory and money and increases latency on the average case. Maybe some Google engineers can come up with a clever heuristic involving TCP options or the RTT information to "switch on QUIC selectively" but honestly I wish they wouldn't bother, simply because I don't want to waste my time benchmarking another half-baked google fart.
The thing is, very few people who use "your website" are on slow, congested networks. The number of people who visit google on a slow, congested network (airport wifi, phones at conferences, etc) is way greater than that. This is a protocol to solve a google problem, not a general problem or even a general solution.
It's not. Think about what you search for on your mobile, while out or traveling, and what you search for on desktop/wifi. They are vastly different. Your traffic is not representative of the majority of searches.
I'm sure the majority of searches are for "google" and "facebook" and you're right in a way: I'm not interested in those users.
I'm only interested in searches that advertisers are interested in, but this is also where Google gets their revenue from, so we are aligned with which users we want to prioritise, so I do not understand who you possibly think QUIC is for if not Google's ad business?
Google sells that traffic to me: I'm buying ads from the Google Search Page and from YouTube and from ads Google puts anywhere else, so it is Google's traffic.
If google can't serve them an ad, they don't care about them, so QUIC isn't for them, so in order for QUIC to make business-sense to google it has to help them serve more ads.
The theory is if QUIC helped anyone out, Google (being huge) would be able to sell more ads to me if I use QUIC as well. In reality, they are able to sell me more ads when I turn off QUIC, which means this theory cannot be true.
It's strange to read this when you see articles like this[0] and see Lighthouse ranking better with it switched on. Nothing beats real world stats though. Could this be down to server/client implementation of HTTP2 or would you say its a fundamental implication of the design of the protocol?
Trying to make my sites load faster led me to experiment with QUIC and ultimately I didn't trust it enough to leave it on with the increase of complexity.
UDP is problematic because you can't recv() more than one packet at a time, so you get syscall limits that you could just ignore with a TCP-based protocol. There's a trick wasting lots of cores but it's bad for battery-powered devices. There's also iouring and af_xdp that look promising, but they aren't supported as widely as chrome.
HTTP2 I can't explain. HTTP2 should be better. I suspect it's probably an implementation bug because I can replicate lab performance and HTTP2 looks good to me in controlled tests.
I can always try turning it back on again in 6 months...
I don't know which site you're describing, but as a user living very far from major datacenters on an unstable and frequently slow network, HTTP[23] have been the best thing since sliced bread. How it looks in practice is getting hundreds of megabits/s when network is under low load (in the middle of the night, etc), but down to hundreds or even dozens of kilobytes/s in the evening. Sometimes with high packet loss. Always having pings of 100ms or higher (right now it's 120ms to the nearest google.com datacenter, 110ms to bing.com, 100ms to facebook.com, and 250ms to HN).
Back when HTTP2 was introduced and started becoming popular, I spent a few weekend hours on writing a short script to do a blind test of HTTP1.1 vs HTTP2 or some major sites where both were supported. H2 won every time, hands down. It was like comparing a 96kbit/s MP3 to a FLAC.
> UDP is problematic because you can't recv() more than one packet at a time
> but as a user living very far from major datacenters on an unstable and frequently slow network, HTTP[23] have been the best thing since sliced bread. How it looks in practice is getting hundreds of megabits/s when network is under low load ...
The surprise is that some users aren't able to access your site if you enable HTTP[23], and those that can will have worse latency on average.
There's a trick that I use to download large "files" -- I use content-range and multiple parallel fetch() calls. Because of the fast (lower latency) start of HTTP1.1 this outperforms H2 when you can get away with it. People who don't want to use JavaScript can use https://github.com/brunoborges/pcurl or something like it.
Cool. I don't think it helps H3 because the other end will just block anyway (because it's Linux only and the other side probably isn't Linux), and it seems to be a little slower than recvmsg() when there's only a small amount of traffic so I'm pretty sure it's not going to help me, but I'll keep poking around at it...
> It's strange to read this when you see articles like this[0] and see Lighthouse ranking better with it switched on.
I mean, Lighthouse is maintained by Google (IIRC), and I can believe they are going to give their own protocol bonus points.
> Could this be down to server/client implementation of HTTP2 or would you say its a fundamental implication of the design of the protocol?
For stable internet connections, you'll see http2 beat http3 around 95% of the time. It's the 95th+ percentile that really benefits from http3 on a stable connection.
If you have unstable connections, then http3 will win, hands down.
This is already how TLS offload is implemented for NICs that support it. The handshake isn't offloaded, only the data path. So essentially, the application performs the handshake, then it calls setsockopt to convert the TCP socket to a kTLS socket, then it passes the shared key, IV, etc. to the kTLS socket, and the OS's network stack passes those parameters to the NIC. From there, the NIC only handles the bulk encryption/decryption and record encapsulation/decapsulation. This approach keeps the drivers' offload implementations simple, while still allowing the application/OS to manage the session state.
Sure, similar mechanisms are available but for TCP ack offloading and TLS encryption/decryption offloading are distinct features. With QUIC there’s no separation which changes the threat model. Of course the root architectural problem is that this kind of stuff is part of the NIC instead of an “encryption accelerator” that can be requested to operate with a key ID on a RAM region and then the kernel only needs to give the keys to the SE (and potentially that’s where they even originate instead of ever living anywhere else)
Kernels enable IOMMU of the CPU, which limits the memory areas of the NIC can access to only to the memory it needs to access.
This is also why it should be safe to attach pcie over thunderbolt devices.
Although I think for Intel CPUs the mmunuded to be disabled for years because their iGPU driver could not work with it. I hope things have improved with the Xe GPUs.
I'd say Http1.1 is good enough for most people, especially with persistent connections. Http2 is an exponential leap in complexity, and burdensome/error-prone for clients to implement.
For us, what QUIC solves is that mobile users that move around in the subway and so on are not getting these huge latency spikes. Which was one of our biggest complains.
Something that nobody seems to be talking about here is the congestion control algorithm, which is the problem here. Cubic doesn't like losses. At all. In the kernel, pacing is implemented to minimise losses, allowing Cubic to work acceptably for TCP, but if the network is slightly lossy, the perfs are terrible anyway. QUIC strongly recommends to implement pacing but it's less easy to implement accurately in userland when you have to cross a whole chain than at the queue level in the kernel.
Most QUIC implementations use different variations around the protocol to make it behave significantly better, such as preserving the last metrics when facing a loss so that in case it was only a reorder, they can be restored, etc. The article should have compared different server-side implementations, with different settings. We're used to see a ratio of 1:20 in some transatlantic tests.
And testing a BBR-enabled QUIC implementation shows tremendous gains compared to TCP with Cubic. Ratios of 1:10 are not uncommon with moderate latency (100ms) and losses (1-3%).
At least what QUIC is enlightening is that if TCP has worked so poorly for a very long time (remember that the reason for QUIC was that it was impossible to fix TCP everywhere), it's in large part due to congestion control algorithms, and that since they were implemented in kernel by people carefully reading an academic paper that never considers reality but only in-lab measurements, such algorithms behave pretty poorly in front of the real internet where jitter, reordering, losses, duplicates etc are normal. QUIC allowed many developers to put their fingers in the algos, adjust some thresholds and mechanisms and we're seeing stuff improve fast (it could have improved faster if OpenSSL didn't decide to play against QUIC a few years ago by cowardly refusing to implement the API everyone needed, and imposing to rely on locally-built SSL libs to use QUIC). I'm pretty sure that within 2-3 years, we'll see some of the QUIC improvements ported to TCP, just because QUIC is a great playground to experiment with these algos that for 4 decades had been the reserved territory of just a few people who denied the net as it is and worked for the net how they dreamed it.
I think one of the reasons Google choose UDP is that it's already a popular protocol, on which you can build reliable packets, while also having the base UDP unreliability on the side.
From my perspective, which is a web developer's, having QUIC, allowed the web standards to easily piggy back on top of it for the Webtransport API, which is ways better than the current HTTP stack and WebRTC which is a complete mess.
Basically giving a TCP and UDP implementation for the web.
Knowing this, I feel like it makes more sense to me why Google choose this way of doing, which some people seem to be criticizing.
> I think one of the reasons Google choose UDP is that it's already a popular protocol...
If you want your packets to reliably travel fairly unmolested between you and an effectively-randomly-chosen-peer on The Greater Internet, you have two transport protocol choices: TCP/IP or UDP/IP.
If you don't want the connection-management & etc that TCP/IP does for you, then you have exactly one choice.
> ...which some people seem to be criticizing.
People are criticizing the fact that on LAN link speeds (and fast (for the US) home internet speeds) QUIC is no better than (and sometimes worse than) previous HTTP transport protocols, despite the large amount of effort put into it.
It also seems that some folks are suggesting that Google could have put that time and effort into improving Linux's packet-handling code and (presumably) getting that into both Android and mainline Linux.
I wonder if the trick might be to repurpose technology from server hardware: partition the physical NIC into virtual PCI-e devices with distinct addresses, and map to user-space processes instead of virtual machines.
So in essence, each browser tab or even each listening UDP socket could have a distinct IPv6 address dedicated to it, with packets delivered into a ring buffer in user-mode. This is so similar to what goes on with hypervisors now that existing hardware designs might even be able to handle it already.
I've often pondered if it was possible to assign every application/tab/domain/origin a different IPv6 address to exchange data with, to make tracking people just a tad harder, but also to simplify per-process firewall rules. With the bare minimum, a /64, you could easily host billions of addresses per device without running out.
I think there may be a limit to how many IP addresses NICs (and maybe drivers) can track at once, though.
What I don't really get is why QUIC had to be invented when multi-stream protocols like SCTP already exist. SCTP brings the reliability of TCP with the multi-stream system that makes QUIC good for websites. Piping TLS over it is a bit of a pain (you don't want a separate handshake per stream), but surely there could be techniques to make it less painful (leveraging 0-RTT? Using session resumptions with tickets from the first connected stream?).
First and foremost, you can't use SCTP on the Internet, so the whole idea is dead on arrival. The Internet only really works for TCP and UDP over IP - anything else, you have a loooooong tail of networks which will drop the traffic.
Secondly, the whole point of QUIC is to merge the TLS and transport handskakes into a single packet, to reduce RTT. This would mean you need to modify SCTP anyway to allow for this use case, so even what small support exists for SCTP in the large would need to be upgraded.
Thirdly, there is no reason to think that SCTP is better handled than UDP at the kernel's IP stack level. All of the problems of memory optimizations are likely to be much worse for SCTP than for UDP, as it's used far, far less.
I don't see why you can't use SCTP over the internet. HTTP2 has fallbacks for broken or generally shitty middleboxes, I don't see why the weird corporate networks should hold back the rest of the world.
TLS already does 0-RTT so you don't need QUIC for that.
The problem with UDP is that many optimisations are simply not possible. The "TCP but with blackjack and hookers" approach QUIC took makes it very difficult to accelerate.
SCTP is Fine™ on Linux but it's basically unimplemented on Windows. Acceleration beyond what these protocols can do right now requires either specific kernel/hardware QUIC parsing or kernel mode SCTP on Windows.
Getting Microsoft to actually implement SCTP would be a lot cleaner than to hack yet another protocol on top of UDP out of fear of the mighty shitty middleboxes.
WebRTC decided they liked SCTP, so... they run it over UDP (well, over DTLS over UDP). And while HTTP/2 might fail over to HTTP/1.1, what would an SCTP session fall back to?
The problem is not that Windows doesn't have in-kernel support for SCTP (there are several user-space libraries already available, you wouldn't even need to convince MS to do anything). The blocking issue is that many, many routers on the Internet, especially but not exclusively around all corporate networks, will drop any packet that is neither TCP or UDP over IP.
And if you think UDP is not optimized, I'd bet you'll find that the SCTP situation is far, far worse.
And regarding 0-RTT, that only works for resumed connections, and it is still actually 1 RTT (TCP connection establish). New connections still need 2-3 round trips (1 for TCP, 1 for TLS 1.3, or 2 for TLS 1.2) with TLS; they only need 1 round trip (even when using TLS 1.2 for encryption). With QUIC, you can have true 0-RTT traffic, sending the (encrypted) HTTP request data in the very first packet you send to a host [that you communicated with previously].
How is userspace SCTP possible on Windows? Microsoft doesn't implement it in WinSock and, back in the XP SP2 days, Microsoft disabled/hobbled raw sockets and has never allowed them since. Absent a kernel-mode driver, or Microsoft changing their stance (either on SCTP or raw sockets), you cannot send pure SCTP from a modern Windows box using only non-privileged application code.
Per these Microsoft docs [0], it seems that it should still be possible to open a raw socket on Windows 11, as long as you don't try to send TCP or UDP traffic through it (and have the right permissions, presumably).
Of course, to open a raw socket you need privileged access, just like you do on Linux, because a raw socket allows you to see and respond to traffic from any other application (or even system traffic). But in principle you should be able to make a Service that handles SCTP traffic for you, and a non-privileged application could send its traffic to this service and receive data back.
I did find some user-space library that is purported to support SCTP on Windows [1], but it may be quite old and not supported. Not sure if there is any real interest in something like this.
Interesting. I think the service approach would now be viable since it can be paired with UNIX socket support, which was added a couple of years ago (otherwise COM or RPC would be necessary, making clients more complicated and Windows-specific). But yeah, the lack of interest is the bigger problem now.
SCTP works fine on internet, as long your egress is comming from public IP and you don't perform NAT. So in case IPv6 its non issue at all unless you sit behind middle boxes.
residential - not many. Corporate on other hand is different story, thus why happy eyeballs for transport still would needed to gradual rollout anyway.
I doubt there is, because it's just not a very popular thing to even try. Even WebRTC, which uses SCTP for non-streaming data channels, uses it over DTLS over UDP.
but as far as I can tell it's fast _enough_ just not as fast as it could be
mainly they seem to test situations related to bandwidth/latency which aren't very realistically for the majority of users (because most users don't have supper fast
high bandwidth internet)
this doesn't meant QUIC can't be faster or we shouldn't look into reducing overhead, just it's likely not as much as a deal as it might initially loook
QUIC is the standard problem across n number of clients who choose Zscaler and similar content inspection tools. You can block it at the policy level but you also need to have it disabled at the browser level. Which sometimes magically turns on again and leads to a flurry of tickets for 'slow internet', 'Google search not working' etcetera.
Hmm, interesting. We also have a policies imposed by the Regulator™ that leads to us inspecting all web traffic. All web traffic goes through a proxy that's configured in the web browser. No proxy, no internet.
Out of curiosity: What's your use case to use ZScaler for this inspection instead?
Does QUIC do better with packet loss compared to TCP? TCP perceives packet loss as network congestion and so throughput over high bandwidth+high packet loss links suffers.
To force a lucrative cycle of hardware upgrades, you need software to do the opposite.
True story: Back in the early aughties, Intel was hosting regular seminars for dealers and integrators selling either Intel-made PC's, or white box ones. I attended one of those, and the Intel rep openly claimed that Intel had challenged Microsoft to produce software which could bring a GHz CPU to its knees.
They are already expected standards so when you create optimizations you're building on functions that need to be supported additionally on top of them. This leads to incompatibility and sometimes often worse performance as what is being experienced here with QUIC.
A good solution is to create a newer protocol when the limits of an existing protcol are exceeded. No one thought of needing HTTPS long ago and now we have 443 for HTTP security. If we need something to be faster and it has already achieved an arbitrary limit for the sake of backward compatibility it would be better to introduce a new protocol.
I dislike the idea that we're turning into another Reddit where we are pointing fingers at people for updoots. If you dislike my opinion please present one equal to where that can be challenged.
> A good solution is to create a newer protocol when the limits of an existing protcol are exceeded.
It’s not clear to me how this is different to what’s happening. Is your objection that they did it on top of UDP instead of inventing a new transport layer?
No, actually what I mean was that QUIC being a protocol on UDP was intended to take advantage of the speed of UDP to do things faster that some TCP protocols did. While the merit is there the optimizations done on TCP itself has drastically improved the performance of TCP based protocols. UDP is still exceptional but it is like using a crowbar to open bottle. Not exactly the tool intended for the purpose.
Creating a new transport protocol for use on the whole Internet is a massive undertaking, not only in purely technical terms, but much more difficult, in political terms. Getting all of the world's sysadmins to allow your new protocol is a massive massive undertaking.
And if you have the new protocol available today, with excellent implementations for Linux, Windows, BSD, MacOS, Apple iOS, and for F5, Cisco, etc routers done, it will still take an absolute minimum of 5-10 years until it starts becoming available on the wider Internet, and that is if people are desperate to adopt it. And the vast majority of the world will not use it for the next 20 years.
The time for updating hardware to allow and use new protocols is going to be a massive hurdle to anything like this. And the advantage to doing so over just using UDP would have to be monumental to justify such an effort.
The reality is that there will simply not be a new transport protocol used on the wide internet in our lifetimes. Trying to get one to happen is a pipe dream. Any attempts at replacing TCP will just use UDP.
While you're absolutely correct, I think it is interesting to note that your argument could also have applied to the HTTP protocol itself, given how widely HTTP is used.
However, in reality, the people/forces pushing for HTTP2 and QUIC are the same one(s?) who have a defacto monopoly on browsers.
So, yes, it's a political issue, and they just implemented their changes on a layer (or even... "app") that they had the most control over.
On a purely "moral" perspective, political expediency probably shouldn't be the reason why something is done, but of course that's what actually happens in the real world...
There are numerous non-HTTP protocols used successfully on the Internet, as long as they run over TCP or UDP. Policing content running on TCP port 443 to enforce that it is HTTP/1.1 over TLS is actually extremely rare, outside some very demanding corporate networks. If you wanted to send your own new "HTTP/7" traffic today, with some new encapsulation over TLS on port 443, and you controlled the servers and the clients for this, I think you would actually meet minimal issues.
The problems with SCTP, or any new transport-layer protocol (or any even lower layer protocol), run much deeper than deploying a new protocol on any higher layer.
QUICv2 is not really a new standard. It explicitly exists merely to intentionally rearrange some fields to prevent standard hardcoding/ossification and exercise the version negotiation logic of implementations. It says so right in the abstract:
“Its purpose is to combat various ossification vectors and exercise the version negotiation framework.”
You posted your opinion without any kind of accompanying argument, and it was also quite unclear what you meant. Whining about being a target and being downvoted is not really going to help your case.
I initially understood your first post as: "Let's not try to make the internet faster"
With this reply, you are clarifying your initial post that was very unclear.
Now I understand it as:
"Let's not try to make existing protocols faster, let's make new protocols instead"
More that if a protocol has met it's limit and you are at a dead end it is better to build a new one from the ground up. Making the internet faster is great but you eventually hit a wall. You need to be creative and come up with better solutions.
In fact our modern network infrastructure returns on designs intended for limited network performance. Our networks are fiber and 5g which are roughly 170,000 times faster and wider since the initial inception of the internet.
It's wasted energy when they aren't used at their full capacity.
I think that GoogleHTTP has real-world uses for bad connectivity or in datacenters where they can fine-tune their data throughput (and buy crazy good NICs), but it seems that to use it for replacing TCP (which seems to be confirmed as very good when receiver and sender aren't controlled by the same party) the world needs a hardware overhaul or something.
Maybe, the problem is that we are designed around a limited bandwidth network at the initial inception of the internet and have been building around that design for 50 years. We need to change the paradigm to think about our wide bandwidth networks.
- syscall interfaces are a mess, the primitive APIs are too slow for regular sized packets (~1500 bytes), the overhead is too high. GSO helps but it’s a horrible API, and it’s been buggy even lately due to complexity and poor code standards.
- the syscall costs got even higher with spectre mitigation - and this story likely isn’t over. We need a replacement for the BSD sockets / POSIX APIs they’re terrible this decade. Yes, uring is fancy, but there’s a tutorial level API middle ground possible that should be safe and 10x less overhead without resorting to uring level complexity.
- system udp buffers are far too small by default - they’re much much smaller than their tcp siblings, essentially no one but experts have been using them, and experts just retune stuff.
- udp stack optimizations are possible (such as possible route lookup reuse without connect(2)), gso demonstrates this, though as noted above gso is highly fallible, quite expensive itself, and the design is wholly unnecessarily intricate for what we need, particularly as we want to do this safely from unprivileged userspace.
- several optimizations currently available only work at low/mid-scale, such as connect binding to (potentially) avoid route lookups / GSO only being applicable on a socket without high peer-competition (competing peers result in short offload chains due to single-peer constraints, eroding the overhead wins).
Despite all this, you can implement GSO and get substantial performance improvements, we (tailscale) have on Linux. There will be a need at some point for platforms to increase platform side buffer sizes for lower end systems, high load/concurrency, bdp and so on, but buffers and congestion control are a high complex and sometimes quite sensitive topic - nonetheless, when you have many applications doing this (presumed future state), there will be a need.