Bottleneck Bandwidth and RTT

Animats · on Sept 17, 2016

This is useful mostly for long-lived connections. As Google moves away from the many-short-connections HTTP model to the persistent connections of HTTP/2, connections live long enough that both bandwidth and delay can be discovered.

This is much better than trying to estimate bandwidth from packet loss.

AdamJacobMuller · on Sept 17, 2016

Any idea of how to define "long lived" ?

Excluding the HTTP/2 situations, obviously if you're fetching a single small resource (image or something) that takes <1s then that's short, but, where's the line there? Is something >1m long?

Animats · on Sept 17, 2016

10 seconds in this code.

    static u32 bbr_min_rtt_win_sec = 10; /* min RTT filter window (in sec) */

The code also uses a moving average over 10 round trips. So that's what the filter needs to get a stable estimate. A point in the paper [1] is that this method is said to work well for maintained TCP connections with idle periods. That means HTTP/2 in practice.

It would be interesting to see test data on this for large numbers of real connections. How much do bandwidth and delay vary in practice across ISP links, cable headends, and cellular links?

[1] http://caia.swin.edu.au/cv/dahayes/content/networking2011-cd...

apenwarr · on Sept 17, 2016

The averaging period says more about how fast they expect min_rtt to change than how long it takes to calculate it. If the rtt is fairly stable, that measurement could be accurate much sooner than 10s.

Super short lived sessions can really only go faster with tricks like increasing the initcwnd. Anything longer than that, I'd expect bbr to work well.

colechristensen · on Sept 17, 2016

HTTP <2 doesn't bring up and tear down TCP connections for every asset. keep-alive is the default for 1.1 and supported in 1.0

watermoose · on Sept 17, 2016

> This is much better than trying to estimate bandwidth from packet loss.

This definitely seems like an improvement, however is it possible that this changed could result in one or more additional attack vectors?

In addition, what about the additional resources needed to pull this off; how many fewer persistent connections could be maintained by a single server with the same specs?

apenwarr · on Sept 18, 2016

There shouldn't be much overhead per stream. The estimators are just a few integers each, which is tiny compared to the TCP socket buffer itself. There could be added cpu overhead from implementing pacing if it isn't done carefully, but my understanding is the Linux pacing is done "right" so that it actually has very little or even negative overhead compared to traditional burst transmissions. (Also, it's generally safe to assume that Google cares more about server scalability than almost anyone else, so they wouldn't use this if it weren't pretty good on that front.)

contingencies · on Sept 17, 2016

For those that don't know, someone revealed recently that Animats is John Nagle, who designed https://en.wikipedia.org/wiki/Nagle%27s_algorithm ... so this perhaps one of the most informed possible comments.

kyrra · on Sept 17, 2016

It's on his profile comment here on HN. :)

hinkley · on Sept 19, 2016

John Nagle revealed recently that Animats is John Nagle.

kev009 · on Sept 17, 2016

This is the same Van Jacobson who was instrumental in working through TCP congestion collapse growing pains over 30 years ago https://en.wikipedia.org/wiki/Network_congestion#Congestive_...

jnordwick · on Sept 17, 2016

Intra network, TCP slow start is often turned off to minimize latency on links especially ones that you have to respond very quickly to data initially or that is bursty.

Google BBR seems to used the same exponential probing that slow start does, so I wonder how it will perform when you are staying in network and don't often have to worry about packet loss or congestion and want the link to start off at full throttle.

Once BBR enters its steady state it intentionally cycles faster and slower, but this seems like it is creating additional latency when you don't want it. Think of a traffic burst that happens just as the link decides to cycle slower.

It also seems like the protocol intentionally runs slower that possible as to not create buffer pressure on the receiving side, if I'm understanding this quick description properly: "then cruising at the estimated bandwidth to utilize the pipe without creating excess queue".

The this line just scares me: "Occasionally, on an as-needed basis, it sends significantly slower to probe for RTT (PROBE_RTT mode).

Google is going to make patches that work for them, but that doesn't always mean it will work for everybody else. This seems very close tailed to Google's traffic issues and serving HTTP over persistent connections, and not a general purpose feature, think of games, intranetwork low-latency applications, etc.

otoburb · on Sept 17, 2016

>>Google is going to make patches that work for them, but that doesn't always mean it will work for everybody else.

That makes sense from a purely commercial standpoint, but I was particularly intrigued when I noticed Van Jacobson[1] listed as one of the authors. For those who have read TCP/IP Illustrated (Vols 1 & 2), the four fundamental algorithms (slow start, congestion avoidance, fast retransmit & fast recovery) were evidently designed by Van Jacobson as described in RFC2001.[2]

If listing Van Jacobson as one of the patch authors was meant to lend credibility to the patch, then it certainly worked on me as an initial appeal to authority. Particularly interested to see how BBR will perform over wireless networks.

[1] http://internethalloffame.org/blog/2012/05/25/van-jacobson-d...

[2] https://tools.ietf.org/html/rfc2001

jnordwick · on Sept 17, 2016

Of course I've read the TCP Illustrated series, and of course I know who Van Jacobson is. I went to UC Berkeley.

otoburb · on Sept 17, 2016

My apologies if my reply came across as a rebuttal to your own but I just found it hard to contain my excitement! I referred to Van Jacobson for others who may not know of his significant contributions and who were unaware that he signed off on this particular patch.

jnordwick · on Sept 17, 2016

not at all. I didn't notice the Van Jacobson attribution until the second time I read the link actually.

convolvatron · on Sept 17, 2016

slow start is kind of orthogonal as you say, so lets forget about that.

the real meat of this work is to try to address the fact that for whatever reason, buffers are being provisioned well in excess of something that might represent the fractional delay bandwidth product end to end.

so this leaves traditional tcp filling up these huge buffers until it gets a loss. a decent red policy would take care of this.

the key observation is that if you look at the end to end latency, once you start excess buffering, there is an inflection point. this is the point we are trying to find, where the send rate actually matches a fair share without just dumping packets up to be queued. if you plot latency vs issue rate you can see it very clearly. latency is flat until you start queueing, and then it goes up because you're waiting in line.

it makes a lot of sense.

particularly if everyone plays along. a really interesting question is what happens if this cc is a minority player in a mix of other adaptation machines.

at the bottom end you're under the right set point. but thats a lot better than taking a classic exponential backoff on the eventual loss.

jnordwick · on Sept 17, 2016

I can definitely see its use over longish connections, and I can definitely see disabling the patch on a local network these shouldn't be issues (because you have control of the program endpoints).

I wonder how it works for things like delivering gaming or real-time-ish feeds where dumping packets might not be such a bad thing (as in they can be written to handle data).

Looks interesting though and definitely better for throughput over random networks.

an_account · on Sept 17, 2016

Correct me if I'm wrong, but doesn't bufferbloat kill latency, which hurts gaming/real time more than normal web browsing? So wouldn't this be a major improvement for those type of applications?

Filligree · on Sept 17, 2016

It does, and the usual solution is to not load your link heavily while gaming. If it's someone else's link that's loaded, then you're screwed.

However, most games are deeply inflexible on how much bandwidth they use, and additionally tend to use UDP. Running out of bandwidth doesn't degrade the game; it makes them break entirely.

As a result of both of these, with a few exceptions, this work won't be directly useful to games.

apenwarr · on Sept 17, 2016

Not directly. But if more people use bbr for their TCP sessions, there will be less latency caused by sessions alongside the gaming session.

jnordwick · on Sept 17, 2016

Not always. It depends on how you are processing incoming packets and if you can do it bulk or elide some processing when you have more to work with.

honkhonkpants · on Sept 17, 2016

"staying in network and don't often have to worry about packet loss or congestion."

I don't think you would find many Googlers making this equivalence. If your network is highly utilized then it is characterized by both loss and congestion. The trick is optimizing throughput on a congested network without suffering from collapse.

Whenever you see anyone talking about "traffic engineering" or "software-defined networking" you should just mentally substitute "packet loss". The only thing that an SDN management plane can do is instruct nodes on the network to drop. Then you can imagine that a highly utilized SDN-managed network has high loss.

Edit: In Google's "B4" paper they show that their network operates at packet loss levels approaching 10%. See Figure 14 in http://cseweb.ucsd.edu/~vahdat/papers/b4-sigcomm13.pdf

jnordwick · on Sept 17, 2016

In my work, I don't deal with Internet traffic too much, so internal packet loss is mostly a sign of poor provisioning or technology (if it is the network's fault) or poor programming (if it is the endpoint's fault). Both of those are fixable. It is very rare to have true congestion and sometimes impossible (if it is a dedicated subnet with only a couple hosts).

For us latency is more important that throughput, so our needs don't match Google's. There are a lot of cases like this.

xorgar831 · on Sept 17, 2016

Some level of packet loss is not a sign of a problem, this is a common confusion which has lead to the buffer bloat issues, and many poor quality behaviours by apps within datacenters.

It's such a point of confusion that cloud providers don't want to even share the network error rates with customers since they feel they will misinterpret them.

honkhonkpants · on Sept 17, 2016

If your network is provisioned such that no congestion is possible, I submit that it might not present very interesting challenges in the domain of flow control :-)

jnordwick · on Sept 17, 2016

I just saying flow control for some thing latency is important as well as throughput. We have to make sure we not packet storming the network down, but we also want data delivered as soon as possible.

TCP is used in a lot of situations now from these dedicated intranets to wireless and from serving HTTP to low-latency games. I'm just hoping TCP isn't being pushed too far in one direction because a large institution lives on one side of the spectrum.

solotronics · on Sept 17, 2016

At my non SDN ISP we do not over provision any of the core connections in the network. We plan and implement link upgrades proactively to make sure congestion is never an issue in network. I have closely followed googles SDN papers and they have much lower costs in comparison. We actually use packet loss as an indicator that there is a hardware issue.

xorgar831 · on Sept 17, 2016

You have to consider that there are other streams on the same network paths, so simply sending faster doesn't actually mean faster transfers.

It's almost identical to exiting a building during a fire, everyone can rush or exit orderly. Where the bottleneck is the door, and if everyone tries to get through at once, no one will.

Also what's interesting about not going off of packet loss, is that blips in the network are less likely to cause TCP connections to become synchronized and all backoff and re-probe at the same time.

toast0 · on Sept 17, 2016

> It also seems like the protocol intentionally runs slower that possible as to not create buffer pressure on the receiving side, if I'm understanding this quick description properly: "then cruising at the estimated bandwidth to utilize the pipe without creating excess queue".

> The this line just scares me: "Occasionally, on an as-needed basis, it sends significantly slower to probe for RTT (PROBE_RTT mode).

I haven't read the proposal, but I think the reason for this is that they're comparing RTT during load with idle RTT to determine packet queuing, but the idle RTT may change over time.

Depending on how accurate you want to be, it could be as simple as after some time or packet count of full data packets sent from socket buffer in response to ack moving the window, leave a small gap for the next packet, and then resume sending. If that packet is acked faster than the rest, the idle RTT is shorter than the under load RTT, which means you should slow down in general (to optimize latency). If the RTT is the same for the after gap packet, then the load RTT is close to idle, and you can keep going at the current rate.

(I probably wouldn't implement it like I described it. With TCP timestamps, we have pretty continuous RTT measurements, some sort of last N packet min/max/average/stddev to drive the congestion window from all measurements, and a mechanism to add a small gap for the PROBE_RTT would make more sense: any low RTT response should inform the system, not just one that comes in response to a probe)

jnordwick · on Sept 18, 2016

I would just never want the situation to be I have open advertised window space and have never dropped a packet, but the other side throttles me to prevent and anticipated packet drop, especially if I'm advertising a very large receive buffer. Just send me the packet.

toast0 · on Sept 18, 2016

Even if the roundtrip delay has grown to several seconds? (I've seen pcaps of this kind of thing, something in the path had ridiculous buffers)

I took a look at how the throttling is done in their patches, and I think it's probably too much (drops the congestion window to minimum for ~200ms every 10 seconds), I would just drop it a packet or two each interval.

YZF · on Sept 17, 2016

Right. I worked on congestion avoidance for a UDP based protocol and it's really difficult to get this right because you have to coexist with lots of other algorithms on the same network. If you are "fighting" a loss based algorithm TCP stream it's virtually impossible for you to get your fair share of bandwidth without getting packet loss.

An algorithm for efficient local networking where everything is under you control is very different than something that runs of the Internet. Not even sure that this style of congestion avoidance is the best approach for a tightly controlled local network.

Edit: and keep in mind that packet loss is correlated with queues filling up. So as long as there's lots of loss based algorithms in the wild it's difficult for someone to come in with a better solution that coexists with those other flavours (at least if you're not allowed to touch the "network" itself).

jamesblonde · on Sept 17, 2016

We are also working on a congestion avoidance for a UDP based protocol - our goal is to share large volumes of data over the Internet but have high throughput for high latency links and support NAT traversal. So no UDT. We're using a varient of LEDBAT (utp/micro-torrent). What are your negative experiences based on?

YZF · on Sept 17, 2016

It's not really a negative experience. It's just that it can be difficult. In general I think it's useful to think as the bandwidth of a TCP stream being a function of loss and latency (and connection time, at least initially, less of a factor for long lived active connections). The different variants of TCP have slightly different curves. You want your UDP algorithm to play nice with these otherwise you may use more than your fair share or you'll get hammered by competing TCP streams. Apologies if this is kind of basic :)

StillBored · on Sept 18, 2016

   If you are "fighting" a loss based algorithm TCP stream it's virtually impossible for you to get your fair share of bandwidth without getting packet loss.

While I agree with the general sentiment, a UDP algorithm matched with a block based FEC, can actually get far more than its "fair share" simply by ignoring the packet loss. Its bad enough, that if your routers aren't deprioritizing UDP traffic you can generally consume really close to 100% of the available bandwidth. Particularly if you pace your packets and pick an algorithm that can handle somewhere in the ballpark of 10-20% packet loss before backing down.

YZF · on Sept 20, 2016

Sure but then you're a bad network citizen... That was the point I was trying to make... In your example you're getting the (high) packet loss and error correcting for those but right at the "fair share" point there is loss.

wmf · on Sept 17, 2016

I suspect that Google uses DC-TCP and/or TIMELY inside their datacenters and BBR for WAN and Internet traffic.

otoburb · on Sept 17, 2016

Many TCP optimization algorithms report their performance improvements using CUBIC as their baseline. Will be very interesting to see how TCP optimization vendors adapt to the new Bottleneck Bandwidth & RTT patch.

From an industry viewpoint, I wonder how this will perform over traditionally higher-latency and higher-loss wireless networks.

As an aside, I love how small the patch is, weighing in at 875 LOC including comments.

stephen_g · on Sept 17, 2016

This is very exciting, and I can't wait to see some performance data from it. Bufferbloat is a huge problem so it's awesome to see work being done in this area. It's really cool also that it can improve things just by patching servers!

How does this interact sending traffic through routers using algorithms like fq_codel to reduce bufferbloat? Is it better to just have one or the other or do they work well together?

bcook · on Sept 17, 2016

CoDel is meant to solve bufferbloat at a single network node, not along the entire path.

TCP BBR, as a consequence of more accurate congestion avoidance, seems to (hopefully) reduce bufferbloat along the entire path.

Both of these algorithms should work together well since they do not really compete. (assuming that I understand...)

mhandley · on Sept 17, 2016

The linked article doesn't provide enough information to understand how they're using Acks to probe bottleneck bandwidth, but they've prior work on this. If it's similar to Van Jacobson's pathchar, I would have thought there might be "interesting" interactions with NICs that do TCP TSO and GRO (using the NIC to do TCP segmentation at the sender and packet aggregation at the receiver), as these will mess with timing and, in particular, how many acks get generated. Still, the authors are very well known and respected in the Networking community, so I expect they will have some way to handle this.

apenwarr · on Sept 17, 2016

It doesn't need to be as complicated as pathchar, because of a lucky fact: if you measure the end to end bandwidth, you are always measuring the bottleneck bandwidth :) So you can just count how many packets arrive at the other end between two time points and call that the bottleneck bandwidth.

vsrinivas · on Sept 18, 2016

You mean LRO instead of GRO, right?

dedalus · on Sept 17, 2016

The key point here is the deployment model: only change sender, no network or receiver side change. This means if you see even 5% improvement people are going to deploy it. Also the key architect seems Neal Cardwell and simply using Van Jacobson as a signoff/vetting authority. Eric Dumazet has long done a lot of work on TCP pacing (look at his prev patches)

Basically all this means is that we have a form of TCP Pacing that can work (but suffers from classic prisoner's dilemma)

tarrga · on Sept 18, 2016

I work at Google in Bandwidth-SRE; watched this as it developed. Van's name is not on this patch as an appeal to authority --- he contributed meaningfully to the design. The analysis to support unloaded vs. loaded probes (to measure time-separated variables) was entirely his.

_RPM · on Sept 17, 2016

This is really interesting work. I wish I was smart enough to do this kind of stuff.

jeff_marshall · on Sept 17, 2016

There is nothing magic about networking! If you are interested, start reading things and playing around. The issue, like most complex subjects, is that most people aren't interested enough in the details to get to the point where they are confident enough to do something innovative and present it to the world.

_RPM · on Sept 17, 2016

I've done a little bit of research on the team that implemented this and they are mostly PhD's. I was never that strong of academic outside of my CS classes. I just didn't care about anything but CS, and spent more time programming than I probably should have in my undergraduate years. Years later I suffer the consequence of that by having a lower GPA than most grad students would.

Cthulhu_ · on Sept 17, 2016

GPA is just a number, and a snapshot taken at a certain (early) time in your life and development at that. The PhD's most likely have at least 10 more years of experience than you, and their achievement - the research you read - is again only a snapshot of their life and level at that point in time.

The impostor syndrome (which you seem to exhibit signals of) is mainly caused by only seeing the higher steps, and forgetting the steps you've already taken.

Ar-Curunir · on Sept 17, 2016

Isn't that pretty much all you need to be good at for a PhD in CS? Well that and writing papers.

unmole · on Sept 17, 2016

Do you have any pointers for a good place to start reading?

aidenn0 · on Sept 17, 2016

Comer's book "Computer Networks And Internets" is a decent introduction. If you are on a budget get one edition out-of-date; it's a college textbook so the previous edition is usually about 90% cheaper than the current edition.

Once you feel comfortable with the basics, start going through RFCs for what you are interested in. In most cases the RFCs describe not just the protocol, but the reasoning behind it as well.

For me though, I need to do more than read; I need to be "hands on." One example:

When I was in college, I wanted to learn the IRC protocol better, and I noticed it was text based, so (after reading through a couple of RFCs) I connected to an IRC server with telnet in one window and the spec in another window. about 6 hours later I was finally able to connect and send messages. Those 6 hours were both less boring and more educational than 6 more hours of reading would be.

Net results on my grades was either slightly negative, or a wash; I probably missed 2 or 3 classes during those 6 hours, but I got nearly double the next highest score on my Networking midterm 3 semesters later.

thr0waway1239 · on Sept 17, 2016

I once did some research on networking and related stuff - it was not my primary topic, but I remember a few things.

During my reading, I found one of the best (as in readable) books was Doug Comer's "Internetworking With TCP/IP vol. 1" - an excellent theoretical reference. [1] However, skip the other volumes from Doug Comer (I think there are 3 volumes).

For writing practical applications, Richard Stevens' "Unix Network Programming" [2] is usually recommended, I didn't find it an easy read though. Perhaps others can pitch in.

For both the books suggested, getting a used old copy for cheap is a good idea because the core information was already there even in the first editions.

Finally, read up on PlanetLab [3]. Its a fascinating project - a small scale internet built on top of a subset of nodes contributed by universities and research organizations across the globe - that people can contribute to, and if you actually manage to get into the developer's list and make a contribution, you can quite honestly claim to have pushed the state of the art forward.

And lastly, be prepared to spend a good amount of time - I don't think it will be a fast or easy process. For whatever reason, I have found that the community around this work to be a little small, especially in comparison to how much it permeates pretty much everyone's life.

[1] https://www.amazon.com/Internetworking-TCP-Vol-1-Principles-...

[2] https://www.amazon.com/Unix-Network-Programming-Sockets-Netw...

[3] http://svn.planet-lab.org/

apenwarr · on Sept 17, 2016

My favourite feature of this algorithm is it ought to fix the problems of very fast TCP over very long distances when there is nonzero packet loss. I think :) Traditionally recovering from packet loss can take several RTT, and if round trips are long, it might never get up to full speed. In theory, I think this will fix it.

wscott · on Sept 17, 2016

Very interesting, but I suspect like many of these it works best if you are only competing with only connections using the same approach. Which is probably why Google is talking about using this inside their backbone. This estimates the queues in the network and tries to keep them empty. TCP ramps until packets start dropping, so it spoils things for the BRR connection. Perhaps combined with Codel on to drop TCP packets early the two could play nicely together.

Hmm, reading the code it says it does play well with TCP, but "requires the fq ("Fair Queue") pacing packet scheduler." In fact, later it says it MUST be used with fq. Hmm.

BTW the code is very readable and well commented.

apenwarr · on Sept 17, 2016

I think the need for fq is just because that's how Linux does pacing, and pacing is essential for the accuracy of measurements used by bbr.

nieksand · on Sept 17, 2016

Has anybody managed to find a preprint of the ACM queue paper floating around?

quietplatypus · on Sept 17, 2016

The name is quite misleading for those not familiar with previous work.

How's this different from TCP Vegas and FAST TCP which also use delay to infer the bottleneck bandwidth?

apenwarr · on Sept 18, 2016

bbr does not infer bandwidth using delay; it just measures the amount of data asked between two points in time (ie. it directly measures the bandwidth). It also directly measures the rtt. The big insight in bbr is that these two values are all you need (and you convert them into a "pace" and a window size); unlike eg. Vegas, you don't slow down just because latency increases, you slow down when measured bandwidth decreases. This makes it far more resilient, especially to other competing TCP flows.

falcolas · on Sept 17, 2016

Is anyone able to speak to the compute and memory overhead this requires, in comparison with the loss-based algorithm? I ask on behalf of firewalls and layer 4 routers everywhere.

Can this really just be patched in, with no changes to specalized hardware?

Filligree · on Sept 17, 2016

It's an endpoint-side algorithm. This goes on whatever terminates the TCP connection; servers, PCs, etc.

Firewalls shouldn't be affected, and switches won't be. On behalf of everyone forced to use L4 switches, though, can you please reduce the buffer sizes? :P