Optimizing global message transit latency: a journey through TCP configuration

tbarbugli · 2024-08-19T20:44:26 1724100266

Nice article which matches my experience when it comes to optimizing for performance: Linux defaults are never good defaults and you don't need webscale or anything before you get bitten by them.

To make a few examples: on many distributions you get 1024 as the file limits, 4KB of shared memory (shmall) and Nagle's algorithm is enabled by default.

Another thing that we noticed at work (shameless plug to getstream.io) when it comes to tail latency for APIs / HTTP services:

- TLS over HTTP is annoyingly slow (too many roundtrips)

- Having edge nodes / POPs close to end-users greatly improves tail latency (and reduces latency related errors). This works incredibly well for simple relays (the "weak" link has lower latency)

- QUIC is awesome

kevincox · 2024-08-19T21:38:08 1724103488

> 1024 as the file limits

https://0pointer.net/blog/file-descriptor-limits.html is a good overview of the unfortunate reason why this is and how it should be handled.

o11c · 2024-08-20T04:33:38 1724128418

That fails to note that `FD_SETSIZE` only applies if you're statically allocating your sets and rely on the libc defaults. If you do dynamic allocation (or, on some libcs, if you define the macro before including headers), you can select() a million FDs just fine.

It's still a bad idea for performance reasons (though `poll` can actually be worse on dense sets), but it's not actually the open-file limit that's the problem.

kevincox · 2024-08-20T10:32:16 1724149936

I don't think it fails to address that in a way that matters. The fact is that the default is still 1024. If you have bumped the side at build time it dynamically allocate them then you are more than welcome to bump the soft limit to the hard limit at the start of your program. (And set it back to the default before you exec anything else)

xpl · 2024-08-19T23:22:54 1724109774

> QUIC is awesome

It would've been more awesome if it supported BBR for congestion control. QUIC gains in practice can be annihilated just by not having BBR implemented in the protocol, so sometimes QUIC could be even slower than HTTP2 over TCP (if TCP is properly configured).

klabb3 · 2024-08-20T10:11:52 1724148712

In my experience QUIC is worse than TCP in a heterogeneous environment when optimizing for throughput. The added CPU usage from user space packet switching is a big factor for battery powered devices and congestion control lives its own life which often means it doesn’t get its fair share of bandwidth in the presence of TCP traffic.

I think QUIC can be awesome, and I hope it will. But I wouldn’t say we’re there yet. Low level kernel-adjacent things takes time. Networks are extremely heterogeneous and weird. Maybe in 5-10 years.

In the short-medium term, I think we could get much more bang for the buck if there was an easy way to improve the defaults on Linux and/or its distros.

stiglitz · 2024-08-19T23:49:38 1724111378

FYI quic is compatible with bbr, and at least the google and msft quic implementations have bbr (albeit not by default afaik).

pocketarc · 2024-08-19T21:36:26 1724103386

That honestly just makes me think: Is there a distro where this isn't the case? Where the defaults are set up for performance in a modern server context, with the expectation that the system will be admin'd by someone technical who knows the tradeoffs? Heck, the decisions + tradeoffs can all be documented in docs.

Is there a reason I'm missing why this wouldn't be worth jumping on?

jeffbee · 2024-08-19T19:30:07 1724095807

Reupping my assertion that kernel network protocol stacks are painfully obsolete in a world of containers, namespaces, and virtual machines. Default parameters are trying to balance competing interests, none of which will be relevant to your use case. Userspace protocol stacks are more aligned with the end-to-end principle. QUIC is a convenient compromise that moves most of the complexity up into the application while still benefitting from the kernel UDP stack with relatively fewer knobs.

electricshampo1 · 2024-08-19T20:41:11 1724100071

On prod servers I see a bunch of frontend stalls & code misses in the L2 for the kernel tcp stack; having each process statically embed its own network stack may make that worse (though using dynamic shared quic lib for ex. in userspace shared across multiple processes partially addresses that but with other tradeoffs).

Of course depending on usecase etc the benefit from first-order network behavior improvements is almost certainly more important than the second-order cache pollution effects of replicated/seperate network stacks.

jeffbee · 2024-08-19T20:55:42 1724100942

When using a userspace stack you can (and should!) optimize your program during and after link to put hot code together on same/nearby lines and pages. You cannot do this or anything approximating this between an application and the Linux kernel. When Linux is built the linker doesn't know which parts of its sprawling network stack are hot or cold.

01HNNWZ0MV43FF · 2024-08-20T21:55:31 1724190931

I wonder how far we are from "Birth and death of JavaScript" making such a thing possible

vrnvu · 2024-08-19T19:28:37 1724095717

I always enjoy reading posts about optimization like this one.

Optimizing a running service is often underrated. Many engineers focus on scaling horizontally or vertically, adding more instances or using more powerful machines to solve problems. But there’s a third option: service optimization, which offers significant benefits.

Whether it's tuning TCP configurations or profiling to identify CPU and memory bottlenecks, optimizing the service itself can lead to better performance and cost savings. It’s a smart approach that shouldn’t be overlooked.

WhyLikeThis · 2024-08-20T04:23:15 1724127795

> The congestion window size is calculated with the formula cwnd * 2^wscale . In the output above, notice that wscale equals 7 , and cwnd equals 10. This comes out to 10 x 2^7 according to the formula, equal to 1280 bytes.

That's just not true.

If the author is reading this - cwnd unit (in linux) is in packets not bytes. I'm not sure where they got the formula from (that's the receive window scaling formula, and has nothing to do with cwnd), but iirc slow start defaults to 10 packets. With their mss being 32K (a loopback connection) the actual cwnd size in bytes is at least 10 * 32K.

amnonbc · 2024-08-21T10:21:52 1724235712

Hi WhyLikeThis,

Thanks for taking the trouble to comment on my blog post. You are correct that cwnd unit is indeed in packets.

And the mss of a loopback connection is, as you say, 32K. But here we are concerned not with connecting to other processes on the same server, but routing across the WAN to the other side of the globe. And unfortunately, unless we are able to enable jumbo frames, we are stuck with the historical relic of 1500 packet sizes. This means that the initial congestion window size of only 15K.

Thanks for correcting my error about cwnd units. I'll fix the post later today.

WhyLikeThis · 2024-08-22T00:46:55 1724287615

No problem! Agreed that restart from idle is a little dumb especially with connections that go idle all the time. IMO whatever timer it uses should be in minutes not round trips.

Thankfully newer congestion control algorithms like BBR (which you can enable, though linux currently ships an older version) do not suffer from this and do not go to slowstart.

divbzero · 2024-08-19T21:09:35 1724101775

Are there any Linux distributions (or packages) that apply the best network configurations by default?

iscmt · 2024-08-19T22:13:03 1724105583

Wrote my Computer Networking final last Friday. Some parts of TCP felt dry to study, but it's incredible to see how those same concepts are being used for optimization.