I'm still not entirely sure why the problems inherent in single-stream connectio...

Lukasa · on Feb 12, 2015

Because running multiple TCP connections in parallel plays havoc with TCP congestion control and also plays poorly with the TCP slow-start logic. Every TCP connection begins its receive window again and so it starts small, so fetching many moderately-sized or large resources (think images) will cost you many round trips you didn't need to spend.

bwross · on Feb 12, 2015

Additionally, a TCP connection is essentially an operating system resource; you need to set aside a port and space for a send and receive buffer. It might seem fine for a client to open hundreds connections, but imagine being a server with thousands of clients all opening hundreds of connections to you. You very quickly run out of resources and either have to close connections or reject new connections.

otterley · on Feb 13, 2015

That hasn't been a practical problem in many years. Most servers have gigabytes of RAM and a 64-bit kernel nowadays.

getsat · on Feb 13, 2015

Linux starts to act weird around 200,000 concurrent connections in my experience, even with aggressive sysctl tuning. You end up with weird edge cases like netstat literally taking 15 minutes of CPU time (in kernel) before it dumps the list of connections to stdout.

Not sure about FreeBSD or any other OSes.

btmorex · on Feb 13, 2015

Not saying netstat isn't slow, but:

# time sh -c 'netstat -tn | wc -l' 486206

real 0m13.538s user 0m1.698s sys 0m10.380s

It still works with a whole lost of connections. (in fairness, only about 130k were connected)

getsat · on Feb 14, 2015

The problem is that it doesn't scale linearly. There's some O(n4) algorithm being used in netstat or some kernel syscalls or something. Once you go over 250k, things get _really_ weird.

patrickmcmanus · on Feb 15, 2015

try ss -nt instead.. it uses netlink sockets instead of /proc and generally scales much better

getsat · on Feb 17, 2015

Thanks for the tip!

bsdetector · on Feb 12, 2015

If you start out making 6 TCP connections then in a perfect network their receive windows all expand in parallel -- 6 times faster than a single HTTP2 connection. You're likely to see several HTTP2 connection made as well. There'll at least be a second one in case the first SYN is lost, just like with HTTP 1.

The key TCP benefit is keeping a connection open. That can be done with keep-alive as well.

jude- · on Feb 13, 2015

Moreover, it's bad from a QoS standpoint. Opening many TCP streams in parallel is not fair to other users sharing the routers between you and the server. You'd get more than your fair share of bandwidth.

otterley · on Feb 13, 2015

Only if you don't assume that everyone else will also use the same number of streams in parallel. If everyone else's utilization is increased by the same factor, the utilization balance should remain the same.

jude- · on Feb 13, 2015

The goodput decreases for everyone then, since each flow requires a 3-RTT handshake and 1-2-RTT tear-down. This is particularly bad if you're starting up a lot of small flows, where the control-plane information becomes a non-negligible fraction of the total data sent.

I think ideally, we'd create a TCP variant where localhost maintains a per-destination receiving window for all flows to that destination, so flows running in parallel or flows started in rapid succession won't have to start their windows at 0 and slowly increase them. Moreover, this way congestion control applies to all packet flows for a (source, destination) pair, instead of to individual flows.

HTTP/2 and HTTP pipelining take a crack at this by running multiple application-level flows (i.e. HTTP requests) through the same receiving window (i.e. the same TCP socket), but they're not the only application-level protocols that could stand to benefit.

viraptor · on Feb 13, 2015

Doesn't that depend entirely on the queuing and session mapping algorithm? If you profile for bandwidth-per-source-ip, shouldn't that apply exactly the same restriction regardless of how many connections are started? (with multiple-connections people losing a bit because connection start/stop takes bandwidth they could use for data instead)

jude- · on Feb 13, 2015

Yes and no. Yes in theory, using bandwidth-per-source-IP caps like you suggest could be used to solve this. No in practice, because the prevalence of NATs puts many users behind the same IP address, meaning that unfair users can still hog the upstream bandwidth from everyone else in the same NAT.

What we really want is something more like bandwidth-per-end-host caps.

otterley · on Feb 13, 2015

I don't buy the congestion-control argument. Transit routers have millions of connections going through them at any given point. What's an extra 10x in concurrency to them?

patrickmcmanus · on Feb 13, 2015

Its a fair question but the h2 approach is doing the right thing on balance. The issue isn't really computational or memory requirements.

The startup phase of a TCP stream is essentially not governed by congestion control feedback because there hasn't been enough (or maybe any) feedback yet. It is initially controlled by a constant (IW) and then slowly feels its way before dynamically finding the right rate. IW can generally range from 2 to 10 segments. Whether this is too much or too little for any individual circumstance is somewhat immaterial - its generally going to be wrong just because that's the essence of the probing phase - you start with a guess and go from there.

Each stream has a beginning a middle and an end. 1 large stream has 1 beginning 1 (large) middle and 1 end, but N small streams have N beginnings N (smaller) middles, and N ends. The amount of data in the beginning is not a factor of the stream size (other than it being the max), it is rather governed by the latency and bandwidth of the network. So more streams means more data gets carried in the startup phase. If the beginning is known to be a poor performing stage (and for TCP it is) then creating more of them and having them cover more of the data is a bad strategy.

In practice, IW is too small for the median stream - but there is a wide distribution of "right sizes" so its ridiculously hard to get right.. maybe it is IW=10 and the right size is 30 segments; but that's one stream 3x too small- it isn't 20x or 50x too small, so when you open 50 parallel tcp connections you are effectively sending at IW * 50. And that does indeed cause congestion and packet loss.. and its not the kind of "I dropped 1 packet from a run of 25 please use fast-retransmit or SACK to fix it for me" packet loss we like to see.. its more of the "that was a train wreck I need slow timers on the order of hundreds of milliseconds to try again for me" packet loss that brings tears to my eyes. One of the reasons for this goes back to the N beginnings problem - if you lose a SYN or your last data packet the recovery process is inherently much slower, and N streams have N times more SYNS and "last" packets than 1 stream does. Oh, and 50 isn't an exaggeration. HTTP routinely wants to generate a burst of 100 simultaneous requests these days (which is why header compression when doing multiplexing is critical - but that's another post).

So the busier larger flow both induces less loss and is more responsive when it experiences loss. That's a win.

And after all that you still have the priority problem. 50 uncoordinated streams all entering the network at slightly different times with slightly different amounts of data will be extremely chaotic wrt which data gets transferred first. And "first" matters a lot to the web - things like js/css/fonts all block you from using the web, but images might not.. and even within those images some are more important than others (some might not even turn out to be on the screen at first - oh wait, you just scrolled I need to change those priorities). Providing a coordination mechanism for that is one of the real pieces of untapped potential hiding in h2's approach to mux everything together.

There is a downside. If you have a single non-induced loss (i.e. due to some other data source) and it impacts the early part of the 1 single tcp connection then it impacts all the other virtual streams because of tcp's in-order delivery properties. If they were split into N tcp connections then only one of them would be impacted. This is a much discussed property in the networking community, and I've seen it in the wild - but nobody has demonstrated that it is a significant operational problem.

The h2 arrangement is the right thing to do in a straightforward TLS/HTTP1-COMPATIBLE-SEMANTIC/TCP.. making further improvements will involve breaking out of that traditional tcp box a bit (quic is an example, minion is also related, even mosh is related) and is appropriately separated as next-stage work that wasn't part of h2. Its considerably more experimental.

patrickmcmanus · on Feb 12, 2015

The first answer is that parallelism without priority (which is what parallel h1 is) can lead to some really horrible outcomes. Critical pieces of the page get totally shoved out of the way while bulky, but less important parts, get the bandwidth they need.That's why H2 and SPDY are both mux'd and prioritized.

also you can definitely over-shard with h1.. as said downthread that can cause congestion problems and indeed even packet loss. For a little while pinterest had gigantic packet loss problems that were due to over sharding of images.

The really annoying thing is that the "right amount of sharding" has to do with available bandwidth, the size of the resources being sent, and the latency between client and server.. Those things aren't really knowable on a generic per-origin basis when setting your links up - so the spdy/h2 approach works better in practice.

if I had a criticism here, its that implementing priority right is a lot trickier than just have a bunch of independent connections. We will probably see some bad implementations in the early days until folks internalize how important it is.

youngtaff · on Feb 12, 2015

Will Chan of Chromium wrote a good post explaining the congestion issues with opening many TCP connections - https://insouciant.org/tech/network-congestion-and-web-brows...

otterley · on Feb 13, 2015

His test showed what happens on a slow network - too much parallelism leads to retransmits and congestion on a slow network. But it doesn't show ill effects on a fast network with plenty of bandwidth.

The question is how to get it right, and the problem is that you can't get it right without knowing the amount of bandwidth available to you in advance. Limiting concurrency limits congestion on slow networks, but it caps you unnecessarily on fast ones. The same is true for SPDY/http2; using a single stream will never give you the same concurrency as multiple streams.

ovi256 · on Feb 13, 2015

Retransmits and congestion are not bad by nature - they're a symptom that the network is being heavily used.

Edit: never mind, misread the test, his tools clearly show goodput (good, desirable throughput) going down in congestion. Lowering initcwnd (as Chrome 29 did) eliminates this on slower connections, improving user experience. I would like to see page load time though, as a proxy for time to screen. It's intriguing that the 6s total page load time did not seem to change.

patrickmcmanus · on Feb 13, 2015

I wanted to support the big picture idea of your comment about packet loss not being inherently bad. Too many strategies are based on the principle that every packet is precious rather than total system goodput. Indeed TCP congestion control is really premised on loss happening - it keeps inching up the sending rate until a loss is induced and then it backs off a bit.

OTOH, TCP really performs poorly in the face of significant levels of loss. So high levels of loss specifically in HTTP really are a bad sign, at least as currently constructed.

Also worth being conerned with: losses that occur late in the path waste a lot of resources getting to that point that could instead be used by other streams sharing only part of the path. (e.g. If a stream from NYC to LAX experiences losses in SFO it is wasting bandwidth that could be used on someone else's PHI to Denver stream). A packet switched network has to be sensitive to total system goodput, not just that of one stream.

mobiplayer · on Feb 13, 2015

Yes, you're right and that's why parallelism is restricted: So slow connections can also be part of the Internet.