TCP incast: What is it? How can it affect Erlang applications?

sounds · on Dec 28, 2012

Computer Science includes research into networking. TCP incast is a way of describing network traffic statistical properties, particularly in data centers [1]

A typical CS student's view of the network is that "everything is so random that it averages out." The exact opposite is actually the case most often, to the chagrin of your local network admin.

Network traffic tends to act like everyone knew you were going to youtube right now, and everybody jumped on the network all at once. The statistical term is "self-similarity" and the Hurst Parameter (H) measures how badly a network's traffic is _not_ averaging out, but bursting as if everyone knew you were going to youtube.

It might help to mention another place we see self-similarity: fractals.

This article just breaks down the situation in a cluster. Again, self-similar traffic patterns mean everyone tries to talk at once, and their TCP stacks all back off randomly, so the total bandwidth of the network is poor.

Unfortunately, the blog post says, "What’s the remedy? We don’t have a good remedy for this yet."

Sure we do. Please google some of the relevant terms for great articles on network traffic analysis and optimization. For example "self similar network traffic" and "hurst parameter." Even the CMU site linked from the blog post has a great writeup under the section "SOLUTIONS" :) [2]

Additionally, as a comment on the blog points out, larger buffers on routers can be a really _bad_ thing! Buffer bloat tends to hide core issues with larger latencies, but not solve them.

[1] http://en.wikipedia.org/wiki/Long-tail_traffic

[2] http://www.pdl.cmu.edu/Incast/

xtacy · on Dec 29, 2012

That's right -- bufferbloat introduces more problems than "solving" incast.

One of the main reasons for incast is the synchronised bursts+backoffs causing senders to timeout. As the CMU paper pointed out, the 200ms min-RTO is too conservative for senders to recover from timeouts. Reducing it can go a long way in mitigating the incast effect.

There's a project called "R2D2" at Stanford University that proposed adding a tiny shim layer that rapidly retransmits lost TCP segments. It was done in a transparent manner to hide packet losses from TCP thus preventing TCP from experiencing a timeout. You can read more about it here: http://sedcl.stanford.edu/files/r2d2.pdf. EDIT: The highlight (relevant to the blog post) is that it is a loadable kernel module requiring no changes to TCP stack!

Disclaimer: I work in the same group, and I am familiar with R2D2.

vrv · on Dec 29, 2012

The degenerate/extreme/unrealistic case of Incast is you have switch buffer capacity to store N segments, you talk to M servers that each return one segment, and M >> N. Although RTO is calculated based on RTT and RTTVAR, in the extreme case you can get clumps (and waves) of retransmissions (depending on the properties of the network) such that even eliminating the minRTO altogether may not solve the problem at some scale. In simulation we experimented with adding an adaptive staggering to the exponential backoff algorithm and found that it helped at high server counts [1], but it was only simulation so I'd take that approach with a grain of salt.

The R2D2 work is pretty neat: different than a lot of other approaches I've seen. I'm excited to see how the FPGA implemention works!

Some comments: 1) I think any significant change to the control algorithm requires careful analysis: the variance in throughput with the multi-client experiment looks interesting, though I don't know whether that is steady state. From the graphs, R2D2 suffers more with larger filesizes whereas TCP actually improves. 2) Real datacenters can have very different traffic patterns that can break some of the assumptions about bandwidth uniformity and latency, though it's harder for academics to tackle that. 3) If you are going down the path of TCP offload, you presumably can avoid the overhead of CPU interrupts/timer programming when reducing the RTO into microseconds :). I'd be interested in seeing how R2D2's algorithms/constants work when you're able to reduce the 3ms timer to microseconds in hardware!

Also, if some kernel programmer wants to fix my once-working patch to support microsecond-granularity TCP retransmissions [2], I personally know a bunch of people who would be happy :)

[1] http://vijay.vasu.org/static/papers/sigcomm147-vasudevan.pdf

[2] https://github.com/vrv/linux-microsecondrto

jstclair · on Dec 29, 2012

What are your thoughts about flat networks (mentioned by someone in the comments on the original article)?

scottlf · on Dec 29, 2012

Good luck trying to convince your virtual server hosting service provider to provide R2D2 or similar.

Dylan16807 · on Dec 28, 2012

Yikes, that switch is only buffering a third of a millisecond's worth of packets. Easy to see why the connections would collapse.

jauer · on Dec 29, 2012

Yeah. Cisco 3750s are notorious for having small buffers. They are more appropriately used for connecting desktops in a office.

Cisco 4948s are better positioned for top of rack and are more comparable to the Juniper EX4200s that were shown to be better in the Tolly Report shown in the post.

Spidler · on Dec 29, 2012

In many cases this is a good thing (tm). The large buffers tend to "hide" retransmits and dropped packets with a buffelength of latency, which will instead increase the burstiness pattern in many cases.

ambrop7 · on Dec 28, 2012

> ... head-of-line blocking ... What’s the remedy?

I don't know anything about Erlang or the software they're dealing with, but I'm surprised that "program your software to avoid unnecessary head-of-line blocking" didn't make the list.

ricardobeat · on Dec 28, 2012

It's in the post. Cmd + F "backpressure/feedback mechanism built in to Erlang"

ambrop7 · on Dec 28, 2012

Yeah, I get that it may be non-trivial to avoid head-of-line blocking in that context, but I doubt that it's impossible. If it is impossible, maybe Erlang just isn't the best tool for the job.

From what I've read, I think you could create a separate process dedicated to sending stuff to B, and have other processes deal with other clients. It's probably not possible, but it would also be nice if you could detect these hung processes and start more workers to deal with other clients, each dealing with potentially more then one client. (yes, I may be totally wrong)

darwinGod · on Dec 29, 2012

Nice article, with good related references-an excellent read-Didn't know about Tcp incast before- good way to start a Saturday morning!

zurn · on Dec 29, 2012

Sounds like they're using wrong kind of switches and/or haven't configured ethernet flow control correctly?

zobzu · on Dec 29, 2012

The question might be, can we use iproute to change TCP's RTO in Linux yet?

mh- · on Dec 30, 2012

yes.

http://www.kernel.org/doc/Documentation/networking/ip-sysctl...

look at rto_min, rto_max, rto_initial.

you can do something like:

  ip route replace dev eth0 rto_min 10ms

zobzu · on Dec 31, 2012

Thanks!