> Drop packets when demuxing. From y to o, it's best to drop packets at o rather than use backpressure from o to y. Otherwise, a single full output queue could starve the others by stopping y. (Example: imagine if a full queue to a 1 Mbps wifi station could stop traffic to a 100 Mbps wifi station. Some Linux wifi drivers suffer from this.)
Aaah, memories. Combined with point 11, pause frames. I was debugging a weird issue with a gbit switch about 15 years ago.
Port A is a server sending to port B and C. C is only capable of 100mbits. I could send from A to B at 950mbits, and A to C at 50, all good. As soon as I didn't artificially throttle the rate to C at A, it would eventually hit 100mbits for A -> C, which resulted in the rate from A to B also dropping to 100mbits, so a total output of 200 at A. After a lot of trying and poking I saw these mysterious pause frames in Wireshark, which I glanced over before because who'd wanna look at anything below IP... Once I looked them up and disabled pause frames on all the machines, I got the expected result of 900 to B and 100 to C. And once I figured that out it was trivial to formulate a google query that resulted in exactly this problem and the solution to it, which I failed at before.
So ever since then disabling pause frames is one of the first things I do when networking is acting weird.
Bonus: Back then when I told an older colleague about my findings, he basically confirmed "pause frames are evil" with another story: Late 90s they started having a problem in another department that entire network segments sometimes became completely unreachable. And the machines in that segment couldn't even communicate with each other. Randomly power-cycling switches and replugging machines solved the problem. After quite some time they tracked it down: Some folks in said department got shiny new laptops, and whenever those entered standby, the NIC "didn't get the message". Its buffer would eventually fill up (as there was no OS running to handle any packets) and from then on, the network segment would get spammed into oblivion with pause frames.
The modern AQM's are based on "Time in Queue". Both pie and codel work brilliantly with pause frames, so long as BQL is also in the driver. fq+aqm (be it codel, pie or cobalt) works even better.
Adding AQM and FQ wifi was way, way harder, (apenwarr drove the group at google that did some of it), but there is full support for fq_codel now in the mt76, ath9k, ath10k, iwl, and one new realtek chipset in the linux kernel. https://lwn.net/Articles/705884/
Funnily enough there’s another excellent article by apenwarr that touches on this a bit. Ethernet is designed to go as fast as possible/line rate and assumes that everything is the same speed. When you disabled pause frames, it likely prevented the switch from interfering and punting the issue to the TCP stack of the server/client.
And the online book, freely available and primarily on applying fq_codel to everything (and also sch_cake) is here: https://bufferbloat-and-beyond.net/
In the last decade we've managed to eliminate fifos from most of linux, most 3rd party firmwares notably with openwrt and sqm, ios and OSX. The only major things left unfixed are unfortunately home routers and edges.
That was an amazing way of explaining a complex topic. Glad to know there are people much smarter than me keeping the internet alive. Thanks for putting so much work into the demonstration.
Why do we know these things? Why are they true? Does anyone have resources for someone who doesn't know that much about networking justifying these statements?
This article was me trying to write down my internal notes on the topic at the time, after having dived very deeply into problems with the set of interlocking network queues inside embedded wifi devices we were working on.
I was probably not the first to realize that this set of rules probably applies to all kinds of networks, not just ones in embedded device firmware, but I hadn’t really seen it written down before.
But I’m surprised the article ever made it to the HN front page, even 5 years later, since it doesn’t even attempt to address the how/why/how do you know sorts of questions. It’s mainly a placeholder just so I don’t forget. (Those rules for muxes and demuxes and backpressure are really confusing, but I believe them to be strictly correct.)
The rule about bottlenecks (there is only ever one) I borrowed from the TCP BBR paper. The rule about queues on a path always being empty except for exactly one that is always full, I think I borrowed from a talk by Stuart Cheshire that I can never seem to find when I look.
While apenwarr themselves have written a bit about bandwidth v latency, featuring bufferbloat and fq_codel: https://apenwarr.ca/log/20180808
> Why do we know these things? Why are they true?
Some of the deduced insights can be traced in queueing theory: https://kottke.org/19/01/its-time-for-some-queueing-theory As always, the hardwork is in figuring out the right balance given cause and effect (which are time-consuming, if not hard, to deduce in the first place).
I found it to be quite educational to get a DDOS carried out on a public ip of mine to see how a firewall device faired. Lets just say, it was quite interesting seeing various networking queues fill up and different daemons trying to cope.
There is certainly an element of fine tuning and knowledge needed across multiple daemons on a firewall device which includes the network stack, not only so that a daemon has the resource to function, but also to play its part in the device function as a whole. When a daemon falls over, it can create a new temporary attack vector, in much the same way as a device rebooting shouldn't be online until everything is loaded and running, but I dont even see that in some switches.
You might find some willing individuals on the dark web who can provide such DDOS services, if you wanted to test some things out.
This is a good collection of recommendations. At a new job, I often end up making changes to internal queues that (it turns out) conform to these recommendations, and get better performance and reliability out of it.
Though I didn't learn it by engineering networks like the author -- I took the ideas from lean product development and applied them also to software.
> Corollary: limit queue length to a statistically large burst (eg. 99th percentile).
This is somewhat underspecified -- at what time frame are we calling it a burst? How much money/memory/resources are we willing to spend to maintain the queue?
But, critically, are we willing to make a queue so long that we sacrifice throughput during a sustained overload? Sometimes it makes sense to handle only a 20 % burst or even less, because the shorter queues lead to nicer behaviour when the burst draws out into sustained overload.
Round-trip times provide an upper bound on what should be considered a burst, since one RTT is enough time to provide feedback to throttle a burst. Of course, different traffic flows through a bottleneck can have drastically different RTTs, so this isn't too helpful.
In practice, CoDel works well for almost all current network technologies with its target parameter at the default of 5ms of allowable queueing delay. Given the structure of today's networks, an individual packet is unlikely to pass through more than a handful of congested bottlenecks, and a small multiple of 5ms of added delay is a tolerable worst-case comparable to the speed of light delays of long-distance connections.
It's definitely somewhat unsatisfying to not have formal derivations of optimal parameters. But reasonable defaults that have been tested in the real world are still a huge improvement over the old way of having network devices that don't even attempt to handle congestion intelligently.
A new one that has cropped up recently in terms of shortening queues, is rigorous application of the TCP_NOTSENT_LOWAT option, everywhere, but especially in containers.
It's really nothing special, just some general principles that are sensible to me:
- Production levelling/burst smoothing early in the process, so other steps don't have to bother with variability.
- WIP constraints/limited queue sizes both to reveal problems and improve latency.
- Kanban/backpressure to stop the problems at the source, and help troubleshooting.
- Using Little's law as a guideline in tradeoffs between batch size and latency and size of system.
- Deliberately shed load when it cannot be served in time, rather than vainly holding on to it for dear life because "surely we cannot outright reject work?!"
- Dropping at the head of queues to improve latency and serve the fresher requests sooner.
There's probably a lot more, and TFA covers some of it too.
Excessively large buffers confuses TCP windowing. On network devices like switches and routers buffers are only actually needed for two things: internal switching/routing latency (which is usually constant and very small) and serialization onto output interfaces (which varies by input/output port speed mismatch and/or number of inputs multiplexed to one output). Handling network "burstiness" should be left up to the endpoint TCP stack or applications in the case of UDP. Ethernet flow control should be disabled unless you find it actually helps (unlikely).
On servers buffers have more work to do since the CPU may be busy with many other tasks. It is rare that I have had to adjust the default buffer sizes but keep in mind server buffers too large can increase jitter as well as confuse TCP windowing (depending on when ACK's are sent). Some applications may prefer that packets be dropped instead of queued for excessively long times.
The most important thing is to ensure that selective ACK's and window scaling TCP options are enabled. They are enabled by default in modern operating systems but you might be surprised how often they are disabled by clueless sysadmins. A common cause of this is/was buggy TCP offloading drivers where the apparent "solution" was to disable ALL tcp options instead of just TCP offload. Window scaling in particular is essential with modern port speeds.
How should these lessons be understood in the context of queue oriented models of computation like Lee & Park's Dataflow Process networks, Kahn process network semantics etc?
It seems like these models mainly formalise (and aid understanding of) backpressure based mechanisms for message rate control, but have little to say about dropping packets (as this would break the model).
I wonder if there are stochastic models of computation that could help to formalize packet drop? (These probably exist -- and now I'm motivated to go and look for them).
It occurs to me that most of these queue size tradeoffs would be eliminated if the queue operated in a LIFO manner (a stack) instead of FIFO. That way, a burst can naturally get absorbed and re-emitted, but steady state high load doesn't result in increased latency, just packet loss. Can someone with more knowledge of networking enlighten me on why this is a terrible idea?
Let’s say the buffer is nearly full, and the egress rate matches the ingress rate almost exactly. Wouldn’t that mean that the bottom of the stack (the oldest package) never gets transmitted, while the top of the stack is churned constantly? If such a situation persists, the oldest packages might not get transmitted for hours, which means they’d be lost for all practical purposes, while still taking up valuable buffer space.
However, and this probably is in the same direction as your idea, the author agrees that when deciding what to drop, dropping the oldest packages is better. I’ll quote item 9 in full because I found this a bit surprising (the sentence in parentheses is especially interesting):
> 9. Tail drop is worst drop. There are several variants of AQM (active queue management) with different tradeoffs, but almost all are better than dropping the newest packet when a queue is full. Even the opposite ("head drop") is better in many cases. (Later TCP ACKs encompass all the information from previous ACKs, so if you have to drop one, it might as well be the oldest one.) CoDel is a more refined AQM. Most AQMs are the same speed or only slightly slower than tail drop.
Apologies for not making this clear, I was definitely assuming a LIFO buffer would drop the oldest message when full, since dropping the newest message has the issue you describe. In that case, for a prolonged period of ingress slightly over egress, the buffer will contain the N most recently received messages that haven't been transmitted, which seems optimal, but _also_ will have no queueing-induced latency for new messages. If there's a gap in the incoming data stream, a LIFO buffer would emit the most-recent (and most likely to be useful) data first. Most importantly, there's now no performance drawback for making the buffer bigger, so tuning it is way less dependent on your expected operating environment.
One other optimization that would probably make sense in a practical implementation is a "max age" limit on things popped from the stack, so that an old message can't sit around for several seconds if the ingress rate happens to exactly matches egress. This limit can be fairly long though (~1s is fine), since it's just trying to approximate when a higher-level protocol would already have completed a full roundtrip and requested retransmit.
Head drop, so long as it preserves a "round" so a malignant sender cannot force all other traffic out of the queue, is great. This is what fq-codel, fq-pie, and cake do.
Can you clarify what you mean by a "round" here, please? Do you mean a complete round-robin pass of clients (or whatever segmentation method is used) in the fair queue?
The "sparse" and "rounds" concepts also works really well as a per station wifi scheduler, where it is now the default for a plethora of chipsets in linux. Some plots of which I will forever be proud here:
Close. A major innovation in the fq-codel derived algorithms over former forms of FQ like DRR and SFQ is what we call the sparse flow optimization. Packets from flows that have an arrival rate lower than the departure rate of 1 quantums worth of packets from all other flows (a "round") observe no queueing, where in drr or sfq, new flows always go to the back of the fq'd flows. (still a huge win over fifo or pure aqm)
This among other things made codel's head drop aqm safe and stable enough to deploy.
The step that moves an empty queue from the list of new queues to the
end of the list of old queues before it is removed is crucial to
prevent starvation. Otherwise, the queue could reappear (the next
time a packet arrives for it) before the list of old queues is
visited; this can go on indefinitely, even with a small number of
active flows, if the flow providing packets to the queue in question
transmits at just the right rate. This is prevented by first moving
the queue to the end of the list of old queues, forcing the scheduler
to service all old queues before the empty queue is removed and thus
preventing starvation.
The resulting migration of queues between the different states is
summarised in the state diagram shown in Figure 1. Note that both
the new and old queue states can additionally have arrival and
dequeue events that do not change the state; these are omitted in the
figure.
+-----------------+ +------------------+
| | Empty | |
| Empty |<---------------+ Old +----+
| | | | |
+-------+---------+ +------------------+ |
| ^ ^ |Credits
|Arrival | | |Exhausted
v | | |
+-----------------+ | | |
| | Empty or | | |
| New +-------------------+ +-------+
| | Credits Exhausted
+-----------------+
Figure 1: Partial State Diagram for Queues between Different States
> ...queue size tradeoffs would be eliminated if the queue operated in a LIFO manner (a stack) instead of FIFO.
Kind of. Using 'adaptive lifo' with a variant of CoDel is something Facebook explained they do address tail latency:
Most services process queues in FIFO (first-in first-out) order. During periods of high queuing, however, the first-in request has often been sitting around for so long that the user may have aborted the action that generated the request. Processing the first-in request first expends resources on a request that is less likely to benefit a user than a request that has just arrived. Our services process requests using adaptive LIFO. During normal operating conditions, requests are processed in FIFO order, but when a queue is starting to form, the server switches to LIFO mode. Adaptive LIFO and CoDel play nicely together... CoDel sets short timeouts, preventing long queues from building up, and adaptive LIFO places new requests at the front of the queue, maximizing the chance that they will meet the deadline set by CoDel.
From a networking point of view, (tcp) bufferbloat across hops is a more complicated problem. Ref this exchange between u/jorangreef and others: https://news.ycombinator.com/item?id=10546651
Switching to LIFO adaptively is a cool idea, thanks for sharing that Facebook is using that at scale, I didn't know that.
Thinking about eliminating bufferbloat was the original line of thinking that made me think of LIFO queues, since the latency properties of a drop-oldest LIFO queue are (I think) only related to the burstiness of the traffic flow, not the volume, and so steady flow cannot ever cause any intermediate node to hold a persistently large queue of messages that will ever be transmitted.
From the perspective of an application transmitting over the network, a steady flow through a saturated link results in packet loss but not latency, while a bursty flow results in small burst-sized latencies to random packets, and some reordering as a result. Because the _downstream_ traffic from one saturated link is typically not very bursty when it hits the next saturated link, negative effects won't accumulate in the same way that latency does in bufferbloat.
I guess so, if it's not exactly "one out, one in", that would constantly reorder packets. I think head-drop, like the article says is really the best trivial solution the more I think about it.
Yep, but IP doesn't provide an ordering guarantee at the link level anyway, and higher-level protocols have sequence numbers so they can reassemble things correctly. If those higher-level mechanisms aren't up for the task of handling lots of small reorderings, that would obviously be an issue.
This isn't a great idea at the packet level. You get really bad results by intentionally reordering packet in a tcp connection. It's not going to be good for any flow that expects packets to arrive in order for the most part (voip calls etc).
Plus accumulating old packets isn't great. FIFO preserves order, but the typical behavior is to drop incomming packets when the buffer is full (or to make an absurdly sized buffer), when it's usually better to drop older packets than fresh packets.
Can you elaborate more on why applications handle packet reordering badly? Is it just that applications are written to assume in-order packets are the "fast path" and need to use less optimized code when things come in out-of-order, or is there some other heuristics that assume packet reordering is rare that would misfire here?
TCP is going to be not great because when you receive packets out of order, there's a feedback loop that leads to the source retransmitting the presumed missing, but possibly just delayed packets. Adding more duplicate data to your congested queue makes things worse / keeps them bad. It's much better to drop some packets from the flow than to intentionally misorder them. This property does make it harder to load balance tcp over multiple connections; you get better results if you hash flows, so one tcp connection always goes over the same physical connection, and should remain ordered... Round robin packet sending might be nice, since you wouldn't end up with uneven utilization of individual physical connections in an aggregated connection, but the tcp stack behavior would be much worse and real world bandwidth would not be useful. Actually handling tcp packets out of order isn't nice either, but if the options are drop a packet or delay an earlier packet to send after a later one, we're going to have to handle some packets out of order. This is what ECN (explicit congestion notification) was supposed to help, if you could mark packets as congested before the congestion got bad enough to drop packets; of course, that was hard to get deployed.
Things like voip or gaming are going to have trouble with out of order packets too. If you already did something to workaround the missing packet, if it eventually arrives late, you may not have any use for it. If I already played silence (or ??) to get through a missed sample, I can't go back and put in the late sample. Etc. Late at that point is not better than never.
There are certainly ways to make protocols where late is better than never; you could have a bulk file transfer protocol that sent all data once before resending or something, but that's not common.
I think TCP's logic is more subtle than "if anything comes in out of order, retransmit", in particular, there's a "Retransmission Timeout" that converges to something like ~2x the average roundtrip latency. I assume that custom UDP-based protocols used by VoIP and such will have similar provisions, where they don't declare a packet "lost" until a significant period of time has elapsed. My understanding is that the reason these mechanisms don't handle multiple connections well is because the two connections might have dramatically different typical latencies, not because the packets may be mildly misordered on the other end.
I'm not sure if more modern "smart" protocols that themselves explicitly try to measure and model the underlying network buffers would be confused by LIFO though, especially since the variance in round trip latency will be higher with a LIFO queue.
Looking at voip specifically, you've got what's called a jitter buffer, which lets you collect late packets, either out of order or delayed, and still use them, but delay is bad for user experience, so you tend to make that buffer as small as you can. You don't retransmit voip packets, because if they're not fresh, they're not useful. LIFO doesn't make sense for voip, because either you run a very long jitter buffer and nobody wants to talk with you like that, or the delayed packets can't be used.
The tcp retransmit timer is used when you send a packet (or packets) and don't get any acknowledgements. But if you send many packets, and the peer misses one (because it's delayed/out of order), it will send an ack of the last in order sequence (and hopefully selective ack too, it's 2022). If you recieve enough acks of the same packet, that triggers fast retransmit; by rfc2001 and updates, three duplicate acks is the threshold to retransmit, without waiting for the retransmit timer. LIFO would significantly harm the network in case of bursty traffic: if my flow sends 5 packets back to back (which is common with tcp segmentation offloading), and they get queued, they'll get sent to the peer in the exact wrong order, and thaf peer will send the same ack for the first foud packets, before sending a new ack on the fifth. My side will get those first four acks, and retransmit the first packet of the burst, then get the fifth ack, and maybe release a new burst. That's an extra full packet, plus it's typical to only send every other ack normally, so that's extra acks on the return side. Dropping a packet in the flow would still result in extra acks though, but the retransmitted data packet wouldn't be a duplicate through that bottleneck, because the first one was dropped before the bottleneck.
Then high load will result in some things staying in a stack for arbitrarily long. For generally the same effect without the unfairness, reducing the size of a queue would be equivalent.
To add to this, there is some inherent latency when queuing packets as well so it’s best to avoid it as much as possible if latency is a concern. Low latency/cut-through switches have really shallow buffers on purpose and will quickly drop packets. To avoid queuing/drops you need to make sure you have ample bandwidth to absorb any microbursts. For example, if you have a 1Gb NIC that is constantly bursting at 1.2Gbps you should upgrade it to 10Gb.
I haven't had time to read through the whole text but lots of it rings true with my understanding. I wrote an article that, if you like this one, you will probably also gain from: https://dev.to/rkeene/why-is-there-packet-loss-1p8a
Why wasn’t the speed of the WAN circuit questioned? To paraphrase Google, OC-12 is so slow I forgot how to count that low. If your link is fast enough, you don’t need to worry about QoS, the switch buffer size, or have to buy an expensive WAN traffic shaping device. :)
The link speed of the circuit doesn't indicate the bandwidth of a channel (MPLS or PVC/SVC), and there is no way to know the possible bandwidth since it's a variable (with upper and lower bounds) and so you always have to worry about this (which is why there's packet loss)
You realize you can get point to point Ethernet circuits with a fixed speed/fiber path right? I have 100G between my data centers and QoS/buffering is something I don’t need to worry about on my network. Some are these paths are dark fiber and can be easily upgraded to 400G when the time comes.
I do, of course, know about the possibility of changing the characteristics of a channel -- it's very likely that this question is not meant to be inflammatory but because it's such a basic thing that I cannot read it any other way.
For what it's worth, at the time you could not get 100 Gbps links, but it's unrelated to the fundamental queuing problem and how they are solved in packet switched networks (by dropping packets in the best-effort systems such as IP/Ethernet and by managing credits/counters in guaranteed bandwidth systems like Fibre Channel/ATM).
You will still have packet loss if you try to send packets at 600 Gbps in the case of a channel whose upper and lower bounds of capacity are 400 Gbps. There are no infinity capacity channels.
Since you have likely more than 400 systems connected at 1Gbps in each data center, you have over provisioned that 400 Gbps link and if each node does 1Gbps you will have packet loss. The value of each packet is probably not equal and so it may make sense to do some kind of QoS for those scenarios (or not, I certainly don't have enough information to answer). This is a problem you can have at any time. If you are in data center operations, you'll do capacity planning to try to mitigate this, but it's still a problem (until there are no overprovisioned paths, which is wasteful and bad engineering in most circumstances).
The point of the essay wasn't to describe the scale of a data center, it was to talk about packet loss in a network with queues using a system that was designed around this, and what that means for users of the network.
Aaah, memories. Combined with point 11, pause frames. I was debugging a weird issue with a gbit switch about 15 years ago.
Port A is a server sending to port B and C. C is only capable of 100mbits. I could send from A to B at 950mbits, and A to C at 50, all good. As soon as I didn't artificially throttle the rate to C at A, it would eventually hit 100mbits for A -> C, which resulted in the rate from A to B also dropping to 100mbits, so a total output of 200 at A. After a lot of trying and poking I saw these mysterious pause frames in Wireshark, which I glanced over before because who'd wanna look at anything below IP... Once I looked them up and disabled pause frames on all the machines, I got the expected result of 900 to B and 100 to C. And once I figured that out it was trivial to formulate a google query that resulted in exactly this problem and the solution to it, which I failed at before.
So ever since then disabling pause frames is one of the first things I do when networking is acting weird.
Bonus: Back then when I told an older colleague about my findings, he basically confirmed "pause frames are evil" with another story: Late 90s they started having a problem in another department that entire network segments sometimes became completely unreachable. And the machines in that segment couldn't even communicate with each other. Randomly power-cycling switches and replugging machines solved the problem. After quite some time they tracked it down: Some folks in said department got shiny new laptops, and whenever those entered standby, the NIC "didn't get the message". Its buffer would eventually fill up (as there was no OS running to handle any packets) and from then on, the network segment would get spammed into oblivion with pause frames.