Golang disables Nagle's Algorithm by default

Game_Ender · on Dec 30, 2022

If you trace this all the way back it's been in the Go networking stack since the beginning with the simple commit message of "preliminary network - just Dial for now " [0] by Russ Cox himself. You can see the exact line in the 2008 our repository here [1].

As an aside it was interesting to chase the history of this line of code as it was made with a public SetNoDelay function, then with a direct system call, then back to an abstract call. Along the way it was also broken out into a platform specific library, then back into a general library and go other with a pass from gofmt, all over a "short" 14 years.

0 - https://github.com/golang/go/commit/e8a02230f215efb075cccd41...

1 - https://github.com/golang/go/blob/e8a02230f215efb075cccd4146...

rsc · on Dec 30, 2022

That code was in turn a loose port of the dial function from Plan 9 from User Space, where I added TCP_NODELAY to new connections by default in 2004 [1], with the unhelpful commit message "various tweaks". If I had known this code would eventually be of interest to so many people maybe I would have written a better commit message!

I do remember why, though. At the time, I was working on a variety of RPC-based systems that ran over TCP, and I couldn't understand why they were so incredibly slow. The answer turned out to be TCP_NODELAY not being set. As John Nagle points out [2], the issue is really a bad interaction between delayed acks and Nagle's algorithm, but the only option on the FreeBSD system I was using was TCP_NODELAY, so that was the answer. In another system I built around that time I ran an RPC protocol over ssh, and I had to patch ssh to set TCP_NODELAY, because at the time ssh only set it for sessions with ptys [3]. TCP_NODELAY being off is a terrible default for trying to do anything with more than one round trip.

When I wrote the Go implementation of net.Dial, which I expected to be used for RPC-based systems, it seemed like a no-brainer to set TCP_NODELAY by default. I have a vague memory of discussing it with Dave Presotto (our local networking expert, my officemate at the time, and the listed reviewer of that commit) which is why we ended up with SetNoDelay as an override from the very beginning. If it had been up to me, I probably would have left SetNoDelay out entirely.

As others have pointed out at length elsewhere in these comments, it's a completely reasonable default.

I will just add that it makes no sense at all that git-lfs (lf = large file!) should be sending large files 50 bytes at a time. That's a huge number of system calls that could be avoided by doing larger writes. And then the larger writes would work better for the TCP stack anyway.

And to answer the question in the article:

> Much (all?) of Kubernetes is written Go, and how has this default affected that?

I'm quite confident that this default has greatly improved the default server latency in all the various kinds of servers Kubernetes has. It was the right choice for Go, and it still is.

[1] https://github.com/9fans/plan9port/commit/d51419bf4397cf13d0...

[2] https://news.ycombinator.com/item?id=34180239

[3] http://publications.csail.mit.edu/lcs/pubs/pdf/MIT-LCS-TM-65...

Aissen · on Dec 30, 2022

> I will just add that it makes no sense at all that git-lfs (lf = large file!) should be sending large files 50 bytes at a time. That's a huge number of system calls that could be avoided by doing larger writes. And then the larger writes would work better for the TCP stack anyway.

FWIW, at least one git-lfs contributor agrees with you: https://github.com/git-lfs/git-lfs/issues/5242#issuecomment-...

> I think the first thing we should probably look at here is whether Git LFS (and the underlying Go libraries) are optimizing TCP socket writes or not. We should be avoiding making too many small writes where we can instead make a single larger one, and avoiding the "write-write-read" pattern if it appears anywhere in our code, so we don't have reads waiting on the final write in a sequence of writes. Regardless of the setting of TCP_NODELAY, any such changes should be a net benefit.

My 2ct: this type of low-hanging fruit optimization is often found even in largely-used software, so it shouldn't really be a surprise. It's always frustrating when you're the first to find those, though.

silisili · on Dec 30, 2022

As one on the 'supports this decision' side, thanks for taking time from your day to give us the history.

It would be really nice if such context existed elsewhere other than a rather ephemeral forum. It would be awesome to somehow have annotations around certain decisions in a centralized place, though I have no idea how to do that cleanly.

nicolast · on Dec 30, 2022

For this kind of decisions, why not simply keep notes as comments in the code? These can easily be added later, even 14+ years after the code was written. Then, when someone dives into the codebase to figure out why something was done this or that way, the answer is right there. No need to dive into (and scavenge, sometimes) VCS history.

pwdisswordfish9 · on Dec 30, 2022

To "alter" a commit message after it has already been widely disseminated, branch from the offending commit, make a new commit with a message that contains the relevant info, switch back to mainline, and then merge that branch.

silisili · on Dec 30, 2022

That would be an awesome start.

It would also be really nice to have a 'book' of sorts of this type of lore. Though admittedly, it would probably be hard to remember what to even include without stories like this.

wongarsu · on Dec 30, 2022

The Arc42 documentation template hat was one of the 12 sections dedicated to "important, expensive or critical design descisions". It makes a pretty good structure for big-picture documetation in a "book" next to the code.

[1] https://news.ycombinator.com/item?id=32353500

cerved · on Dec 30, 2022

just write a note instead

aae42 · on Dec 30, 2022

Let me introduce you to [ADRs](https://cognitect.com/blog/2011/11/15/documenting-architectu...)

satyanash · on Dec 30, 2022

use git notes for attaching information to important commits, after the fact, without altering their SHA

mbakke · on Dec 30, 2022

How do you share notes with other users?

yawaramin · on Dec 30, 2022

Notes are just objects stored in the git repo. They are distributed along with all the other objects.

mbakke · on Dec 30, 2022

You had me worried there for a bit, but notes are not distributed by default.

For the curious, you can push/fetch

  refs/notes/*

...to share notes.

joatmon-snoo · on Dec 30, 2022

Filed https://github.com/golang/go/issues/57530.

francislavoie · on Dec 30, 2022

Thanks for the explanation, Russ!

As a maintainer of Caddy, I was wondering if you have an opinion on whether it makes sense to have on for a general purpose HTTP server. Do you think it makes sense for us to change the default in Caddy?

Also, would there be appetite for making it easier to change the mode in an http.Server? It feels like needing to reach too deep to change that when using APIs at a higher level than TCP (although I may have missed some obvious way to set it more easily). For HTTP clients it can obviously be changed easily in the dialer where we have access to the connection early on.

philosopher1234 · on Dec 31, 2022

Caddie is likely to serve rpcs right? In an rpc context I doubt it ever really makes sense as latency is typically more important than throughput

allanrbo · on Dec 30, 2022

Thanks for the insight and history brief, Russ!

withinboredom · on Dec 30, 2022

Thanks for the history!

Philip-J-Fry · on Dec 30, 2022

In my opinion, I think it's correct to be disabled by default.

I think Nagle's algorithm does more harm than good if you're unaware of it. I've seen people writing C# applications and wondering why stuff is taking 200ms. Some people don't even realise it's Nagle's algorithm (edit: interacting with Delayed ACKs) and think it's network issues or a performance problem they're introduced.

I'd imagine most Go software is deployed in datacentres where the network is high quality and it doesn't really matter too much. Fast data transfer is probably preferred. I think Nagle's algorithm should be an optimisation you can optionally enable (which you can) to more efficiently use the network at the expense of latency. Being more "raw" seems like the sensible default to me.

Animats · on Dec 30, 2022

The basic problem, as I've written before[1][2], is that, after I put in Nagle's algorithm, Berkeley put in delayed ACKs. Delayed ACKs delay sending an empty ACK packet for a short, fixed period based on human typing speed, maybe 100ms. This was a hack Berkeley put in to handle large numbers of dumb terminals going in to time-sharing computers using terminal to Ethernet concentrators. Without delayed ACKs, each keystroke sent a datagram with one payload byte, and got a datagram back with no payload, just an ACK, followed shortly thereafter by a datagram with one echoed character. So they got a 30% load reduction for their TELNET application.

Both of those algorithms should never be on at the same time. But they usually are.

Linux has a socket option, TCP_QUICKACK, to turn off delayed ACKs. But it's very strange. The documentation is kind of vague, but apparently you have to re-enable it regularly.[3]

Sigh.

[1] https://news.ycombinator.com/item?id=10608356

[2] https://developers.slashdot.org/comments.pl?cid=14515105&sid...

[3] https://stackoverflow.com/questions/46587168/when-during-the...

rjbwork · on Dec 30, 2022

Gotta love HN. The man himself shows up to explain.

thiht · on Dec 30, 2022

Imagine being on a math forum discussing Fermat’s theorem and the guy shows up.

This is such a cool aspect of CS being a young field: influent people are still alive!

cbolton · on Dec 30, 2022

It happens in math too! for example: https://mathoverflow.net/questions/81960/the-dzhanibekov-eff...

Beltalowda · on Dec 30, 2022

Andrew Wiles showing up would probably be the next best thing.

sillysaurusx · on Dec 30, 2022

Russ also showed up. https://news.ycombinator.com/item?id=34181846

Readers might also enjoy his writeup on how google code search worked. https://swtch.com/~rsc/regexp/regexp4.html Just discovered it.

arkadiytehgraet · on Dec 30, 2022

Fermat would probably get banned for too much trolling around the proof of his last theorem...

tristanbvk · on Dec 30, 2022

Can I show up for the future?

eddsh1994 · on Dec 30, 2022

For those like me who didn't know, GP designed Nagle's Algorithm in 1984 working at Ford Aerospace.

bsaul · on Dec 30, 2022

i thought you mistyped your comment and wanted to reply to rsc ... then i clicked on animats profile. Yeah HN is becoming a treasure trove for CS.

matheusmoreira · on Dec 30, 2022

Yeah, it's pretty cool. I'm gonna start saving these moments every time they happen. Last time I witnessed something like this was:

https://news.ycombinator.com/item?id=24455758

Matthias247 · on Dec 30, 2022

> The documentation is kind of vague, but apparently you have to re-enable it regularly.[3]

This is correct. And in the end it means more or less that setting the socket option is more of a way of sending an explicit ACK from userspace than a real setting.

It's not great for common use-cases, because making userspace care about ACKs will obviously degrade efficiency (more syscalls).

However it can make sense for some use-cases. E.g. I saw the s2n TLS library using QUICKACK to avoid the TLS handshake being stuck [1]. Maybe also worthwhile to be set in some specific RPC scenarios where the server might not immediately send a response on receiving the request, and where the client could send additional frames (e.g. gRPC client side streaming, or in pipelined HTTP requests if the server would really process those in parallel and not just let them sit in socket buffers).

[1] https://github.com/aws/s2n-tls/blob/46c47a71e637cabc312ce843...

sph · on Dec 30, 2022

Any kernel engineer reading that can explain why TCP_QUICKACK isn't enabled by default? Maybe it's time to turn it on by default, if it was just a workaround for old terminals.

Matthias247 · on Dec 30, 2022

Enabling it will lead to more ACK packets being sent, which leads to lower efficiency of TCP (the stack spends time in processing ACK packets) and lower link utilization (these packets also need space somewhere).

My thought is that the behavior is probably correct by default, since a receiver without knowledge of the application protocol is not able to know whether follow-up data will immediately, and therefore not able to decide whether it should send an ACK or wait for more data. It could wait for a signal from userspace to send that ACK - which is exactly what QUICKACK is doing - but that comes with the drawback of now needing an extra syscall per read.

On the sender side the problem seems solvable more efficiently. If one aggregates data in the application, and just sends as everything at once using an explicit flush signal (either using CORKing APIs or enabling TCP_NODELAY), no extra syscall is required while minimal latency can be maintained.

However I think it might be a good question on whether the delayed ACK periods are still the best choices for the modern internet, or whether much smaller delays (e.g. 5ms, or something along a fraction of the RTT) could be helpful.

renox · on Dec 30, 2022

Thanks for this reply. What I find specially annoying is that the TCP client and the servers starts by a synchronization round-trip which is supposed to be used to synchronise options and this isn't the case here! Why can't the client and the servers agree on a sensible set of options (no delayed ack if the client is using the Nagle algorithm)??

silisili · on Dec 30, 2022

Is this referring to Nagle on the server, and delayed ACK on the client?

wtarreau · on Dec 30, 2022

TCP_QUICKACK is mostly used to send initial data along with the first ACK upon establishing a connection, or to make sure to merge the FIN with the last segment.

nextaccountic · on Jan 4, 2023

How it's possible that delayed acks and nagle's algorithms are both defaults, anywhere? Isn't this a matter of choosing one, or another?

emmelaich · on Dec 30, 2022

Did the move from line oriented input to character input also occur around then?

I remember as a student, vi was installed and we all went from using ed to vi.

There was much gnashing and wailing from the admins of the VAX.

erosenbe0 · on Dec 30, 2022

1984 would have been largely character if desired -- you already had desktop PCs with joystick and mouse too. The problem was the original party-line ethernet with large numbers of telnet clients or some other [nonstop, nonburst] byte-oriented protocol or serial hardware concentrator, which was a universal situation at educational institutions of the mid-to-late eighties. The Berkeley hack referred to above likely boosted the number of clients you could run on one ethernet sub with acceptable responsiveness.

zamalek · on Dec 30, 2022

From the bottom of the article:

> Most people turn to TCP_NODELAY because of the “200ms” latency you might incur on a connection. Fun fact, this doesn’t come from Nagle’s algorithm, but from Delayed ACKs or Corking. Yet people turn off Nagle’s algorithm … :sigh:

Philip-J-Fry · on Dec 30, 2022

Yeah but Nagle's Algorithm and Delayed ACKs interaction is what causes the 200ms.

Servers tend to enable Nagle's algorithm by default. Clients tend to enabled Delayed ACK by default, and then you get this horrible interaction all because they're trying to be more efficient but stalling eachother.

I think Go's behavior is the right default because you can't control every server. But if Nagle's was off by default on servers then we wouldn't need to disabled Delayed ACKs on clients.

Terretta · on Dec 30, 2022

Part of OPs point is 'most clients' do not have an ideal congestionless/lossless network between them and, well, anything.

jrockway · on Dec 30, 2022

Why does a congestionless network matter here? Nagle's algorithm aggregates writes together in order to fill up a packet. But you can just do that yourself, and then you're not surprised. I find it very rare that anyone is accidentally sending partially-filled packets; they have some data and they want it to be sent now, and are instead surprised by the fact that it doesn't get sent now because their data doesn't happen to be too large to fit in a single packet. Nobody is reading a file a byte at a time and then passing that 1 byte buffer to Write on a socket. (Except... git-lfs I guess?)

Nagle's algorithm is super weird as it's saying "I'm sure the programmer did this wrong, here, let me fix it." Then the 99.99% of the time when you're not doing it wrong, the latency it introduces is too high for anything realtime. Kind of a weird tradeoff, but I'm sure it made sense to quickly fix broken telnet clients at the time.

avianlyric · on Dec 30, 2022

> Nagle's algorithm aggregates writes together in order to fill up a packet.

Not quite an accurate description of Nagles algorithm. It only aggregates writes together if you already have in-flight data. The second you get back an ACK, the next packet will be sent regardless of how full it is. Equally your first write to the socket will always be sent without delay.

The case where you want to send many tiny packets with minimal latency doesn’t really make sense for TCP, because eventuality the packet overhead and traffic control algorithms will end up throttling your thought put and latency. Nagle only impact cases where you’re trying to TCP in an almost pathological manner, and elegantly handles that behaviour to minimise overheads, and associated throughput and latency costs.

If there’s a use case where latency is your absolute top priority, then you should be using UDP, and not TCP. Because TCP will always nobble your latency because it insists on ordered data delivery, and will delay just received packets if they arrive ahead of preceding packets. Only UDP gives you the ability to opt-out of that behaviour, and ensure that data is sent and received as quickly as your network allows, and lets your application decide for itself the handling of missing data.

tashbarg · on Dec 30, 2022

It makes perfect sense if you consider the right abstraction. TCP connections are streams. There are no packets on that abstraction level. You’re not supposed to care about packets. You’re not supposed to know how large a packet even is.

The default is an efficient stream of bytes that has some trade-off to latency. If you care about latency, then you can set a flag.

Ferret7446 · on Dec 30, 2022

There is no perfect abstraction. Speed matters. A stream where data is delivered ASAP is better than a stream where the data gets delayed... maybe... because the OS decides you didn't write enough data.

The default actually violates the abstraction more because now you care how large a packet is, because somehow writing a smaller amount of data causes your latency to spike for some mysterious reason.

avianlyric · on Dec 30, 2022

> A stream where data is delivered ASAP is better than a stream where the data gets delayed

That depends on your situation, because as you say no abstraction is perfect. Having a stream delivered “faster” isn’t helpful if means your overhead makes up 50% of your traffic, exactly what nagle avoids.

Nagles algorithm is also pretty smart, it’s only going to delay your next packet until it’s either full, or the far end has acknowledged your preceding packet. If your got a crap ton of data to send, and you’re dumping straight into the TCP buffer, then Nagle won’t delay anything because there’s enough data to fill packets. Nagle only kicks in if you’re doing many frequent tiny writes to a TCP connection, which is rarely a valid thing to do if you care about latency and throughput, so Nagles algorithm assuming the dev has made a mistake is reasonable.

If you really care about stream latency, then UDP is your friend. Then you can completely dispense with all the traffic control processes in TCP and have stuff sent exactly when you want it sent.

iforgotpassword · on Dec 30, 2022

Often times when people want to send five structs, they just call send five times. I find delayed acks a lot more weird compared to nagle.

another2another · on Dec 30, 2022

In those cases it would be better to call writev() which was designed to coalesce multiple buffers into one write call.

How it sends the data is however up to the implementation, and whether it delays the last send if the TCP buffer isn't entitrely full I'm not sure - but it doesn't make sense to do so, so I would guess not.

https://linux.die.net/man/2/writev

p_l · on Dec 30, 2022

Nagle's algorithm matters because the abstraction that TCP works on, and which was inherited by BSD Socket interface, is that of emulating a full duplex serial port.

Compare with OSI stack, where packetization is explicit at all layers and thus it wouldn't have such an issue in the first place.

throwawaylinux · on Dec 30, 2022

Yeah it seems crazy to have that kind of hack in the entire network stack and on by default just because some interactive remote terminal clients didn't handle that behavior themselves.

fragmede · on Dec 30, 2022

Most clients that OP deals with, anyway. If your code runs exclusively in a data center, like the kind I suspect Google has, then the situation is probably reversed.

mmis1000 · on Dec 30, 2022

Consider the rising of mobile device. The devices that don't have a good internet is probably everywhere now.

It's no longer like 10 years ago. You either have good internet or don't have internet. The devices that have shitty network grow a lot compare to the past.

Karrot_Kream · on Dec 30, 2022

Almost every application I've written atop a TCP socket batches up writes into a buffer and then flushes out the buffer. I'd be curious to see how often this doesn't happen.

mmis1000 · on Dec 30, 2022

Are you replying to the correct people? I think I never mention how you should write a program. I only say that assume user have a good internet connection is a naive idea nowadays. (The gta 5 is the worst example in my opinion, lost of a few udp packets and your whole game exit to main menu. How the f**k the dev assume udp packets never lost?)

Karrot_Kream · on Dec 30, 2022

What I mean to say is that, whether or not your mobile device has bad internet or not shouldn't matter. Most applications are buffering their reads and writes. This makes TCP_NODELAY a non-issue

Most importantly buffering doesn't spend a whole bunch of CPU time context switching into the kernel. Even if you are taking advantage of Nagle's, every call to write is a syscall, which calls into the kernel to perform the write. On a mobile device this would tank your battery. This is the main reason writes are buffered in applications.

vidarh · on Dec 30, 2022

This is basically the first thing I check if diagnosing performance issues with network apps. Most probably are buffering now, but surprisingly many don't. MySQLs client library for years didn't for example (it's probably been fixed for a decade or more at this point).

vitus · on Dec 30, 2022

If you run all of your code in one datacenter, and it never talks to the outside world, sure. That is a fairly rare usage pattern for production systems at Google, though.

Just like anyone else, we have packet drops and congestion within our backbone. We like to tell ourselves that the above is less frequent in our network than the wider internet, but it still exists.

oefrha · on Dec 30, 2022

If your DC-DC links are regularly as noisy as shitty apartment WiFi routers competing for air time on a narrow band, fix your DC links.

pclmulqdq · on Dec 30, 2022

Clients having delayed acks has a very good reason: those ACKs cost data, and clients tend to have much higher download bandwidth than upload bandwidth. Really, clients should probably be delaying acks and nagling packets, while servers should probably be doing neither.

charleslmunger · on Dec 30, 2022

Clients should not be nagling unless the connection is emitting tiny bytes at high frequency. But that's a very odd thing to do, and in most/all cases there's some reasonable buffering occuring higher up in the stack that the nagle's algorithm will only add overhead to. Making things worse are tcp-within-tcp things like http/2.

Nagle's algorithm works great for things like telnet but should not be applied as a default to general purpose networking.

avianlyric · on Dec 30, 2022

Why would Nagles algorithm add delay to “reasonable buffering up the stack”? Assuming that buffering is resulting in writes to the TCP stack greater than the packet size, Nagles algorithm won’t add any delay.

The only place where Nagles algorithm adds delay is when your doing many tiny writes to a socket, which is exactly the situation you believe Nagles should be applied to.

avianlyric · on Dec 30, 2022

The size of an ACK is minuscule (40 bytes) compared to any reasonable packet size (usually around 1400 bytes).

In most client situations where you have high down bandwidth, but limited up, that suggests the vast majority of data is heading towards the client, and client isn’t sending much outbound. In which case your client may end up delaying every ACK to maximum timeout, simply because it doesn’t often send reply data in response to a server response.

HTTP is clear example of this. Client issues a request to the server, server replies. Client accepts rely, but never sends any further data to the server. In this case, delaying the client ACK is just a waste of time.

dccoolgai · on Dec 30, 2022

"Be conservative in what you send and liberal in what you accept"

I would cite Postels Law: Nagle's is the "conservative send" side. An ACK is a signal of acceptance, and should be issued more liberally (even though it's also sent, I guess).

iofiiiiiiiii · on Dec 30, 2022

> I've seen people writing C# applications and wondering why stuff is taking 200ms

I observe that in the most recent generation of its HTTP client (SocketsHttpHandler), .NET also sets NoDelay by default.

https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...

kcartlidge · on Dec 30, 2022

TIL - thanks!

silisili · on Dec 30, 2022

Agreed. The post should be titled 'Go enables TCP_NODELAY by default', and a body may or may not even be needed. It's documented, even https://pkg.go.dev/net#TCPConn.SetNoDelay

To know why would be interesting, I guess. But you should be buffering writes anyways in most cases. And if you refuse to do that, just turn it back off on the socket. This is on the code author.

sneak · on Dec 30, 2022

> I'd imagine most Go software is deployed in datacentres where the network is high quality

The problem is that those datacenters are plugged into the Internet, where the network is not always high quality. TFA mentions the Caddy webserver - this is "datacenter" software designed to talk to diverse clients all over the internet. The stdlib should not tamper with the OS defaults unless the OS defaults are pathological.

tptacek · on Dec 30, 2022

That doesn't make much sense. There are all sorts of socket and file descriptor parameters with defaults that are situational; NDELAY is one of them, as is buffer size, nonblockingness, address reuse, &c. Maybe disabling Nagle is a bad default, maybe it isn't, but the appeal to "OS defaults" is a red herring.

Teckla · on Dec 30, 2022

In my opinion, the Principle of Least Surprise applies here.

Go is defaulting to surprising (unexpected) behavior.

giovannibajo1 · on Dec 30, 2022

I think also “least surprise” depends on your background. In Go, also files don’t buffer by default, contrary to many languages including C. If you call Write() 100 times, you run exactly 100 syscalls. Intermediate Go programmers learn this and that they must explicitly manage buffering (eg: via bufio).

I don’t think it’s wrong that sockets follow the same design. It gives me less surprise.

blibble · on Dec 30, 2022

that Write() doesn't call fsync() though, does it?

so there's no buffering going on in the application, but the bytes almost certainly don't hit the disk before Write() returns

they've just been staged into an OS buffer, with the OS promising to write them out to the disk at a later time (probably, maybe...? hopefully!)

which is exactly the same as a regular TCP socket (with Nagle disabled, i.e. the default, non Go way)

Ferret7446 · on Dec 30, 2022

For userland programming, what matters is the syscall level, as that is expensive (and also the API you have for the kernel). Whether the kernel then does internal buffering is irrelevant and uncontrollable beyond any other syscalls which may or may not be implemented (maybe you're running on a custom kernel that doesn't buffer disk writes?).

One write == one syscall, easy. If you want buffering, you add it.

blibble · on Dec 30, 2022

> For userland programming, what matters is the syscall level, as that is expensive

which is why pretty much every programming language buffers file output by default

even C

(other than Go, obviously)

> Whether the kernel then does internal buffering is irrelevant

everyone that's attempted to write reliable software that cares about what ends up on disk, or the other side of the socket will disagree

incompatible · on Dec 30, 2022

I think my C is getting rusty, but "write" operates on a file descriptor, doesn't it? It's unbuffered. The buffered versions are things like printf and puts.

tptacek · on Dec 30, 2022

That's POSIX; C's equivalent is a FILE, which generally is buffered.

Kab1r · on Dec 30, 2022

I thought Linux does in kernel buffering with `write`

squeaky-clean · on Dec 30, 2022

Only if you know that Nagle's algorithm exists and is used everywhere. For anyone with networking experience it's unexpected, but I still remember learning about Nagle's algorithm when trying to fix latency on a game server I was hosting as a teen. That was surprising behavior to me at the time.

jrockway · on Jan 3, 2023

Articles with titles like "How I spent 3 weeks discovering that Nagle's Algorithm exists" are a HN staple. Turning off Nagle follows the principle of least surprise. This article is the first time anyone has ever written about being surprised by Nagle's Algorithm being off.

Xorlev · on Dec 30, 2022

Have you ever had to chase down strange latency issues? Arguably, this behavior is the least surprising for Go's typical deployment environment.

lanstin · on Dec 30, 2022

Twice I have run into this behavior having known and forgotten it. Chatty non http protocols with a few small messages doing auth or whatever before bulk data flow. Pissed me off and surprised me. Now I make sure my defaults for any framework I am using are no delay, and I make sure to plug my computing device into Ethernet whenever possible.

pclmulqdq · on Dec 30, 2022

In my experience, you usually don't want to be Nagling in code that lives in a datacenter. Go's default is likely set up around that idea.

Terretta · on Dec 30, 2022

You guys had this convo before, in 2015, on what the interaction of the two settings should be doing:

https://news.ycombinator.com/item?id=34180239

fragmede · on Dec 30, 2022

I think you meant to link to https://news.ycombinator.com/item?id=10608356

mdolah · on Dec 30, 2022

Love this historical find. How did you find this relevant conversation? Was this something with HN search?

subroutine · on Dec 30, 2022

You haven't committed to memory every hn comment from Ptacek over the last decade?

drpixie · on Dec 30, 2022

Also, for small packets, disabling consolidation means adding LOTS of packet overhead. You're not sending 1 million * 50 bytes of data, you're sending 1 million * (50 bytes of data + about 80 bytes of TCP+ethernet header).

Disabling Nagle makes sense for tiny request/replys (like RPC calls) but it's counterproductive for bulk transfers.

I'm not the only one who don't like the thought of a standard library quietly changing standard system behaviour ... so know I have to know the standard routines and their behaviour AND I have to know which platforms/libraries silently reverse things :(

boring_twenties · on Dec 30, 2022

Bulk transfer applications should just use larger buffers

josefx · on Dec 30, 2022

Wouldn't that still end up with sub optimal network messages unless those large buffers are an exact multiple of the MTU on the network?

boring_twenties · on Jan 3, 2023

Hm, I'm not 100% sure about this. If your first buffer is big enough, your next write should be issued before the OS has managed to write it all.

tptacek · on Dec 30, 2022

This isn't a defect, which makes the whole comment kind of strange. I blame the post title, which should be "Golang disables Nagle's Tinygram Algorithm By Default"; then we could just debate Nagle vs. Delayed ACK, which would be 100x more interesting than subthreads like this.

nightpool · on Dec 30, 2022

Certainly you'd agree that this is a bug in git lfs though, correct? And users doing "git push" with their 500MB files shouldn't have to think about tinygrams or delayed ack?

It's reasonable to think about what other programs might have been affected by this default choice (I'm sure I used one myself two weeks ago—a Dropbox API client with inexplicably awful throughput) and what a better API design that could have avoided this problems might look like

anonymoushn · on Dec 30, 2022

Maybe golang should default to panicking if the application repeatedly calls send() with tiny amounts of data :)

tptacek · on Dec 30, 2022

I don't know enough about git-lfs to say. Things that need buffering should deliberately buffer, I guess?

dang · on Dec 30, 2022

Ok, I've replaced the title with that. Thanks!

though I kind of liked "This adventure starts with git-lfs" (the old use-first-sentence-as-title trick) which was the replacement before this

AaronFriel · on Dec 31, 2022

I think it's a false dichotomy. Delayed ACK and Nagle's algorithm each improve the network in different ways, but Nagle's specifically allows applications to be written without knowledge of the underlying network socket's characteristics.

But there's another way, a third path not taken: Nagle's algorithm plus a syscall (such as fsync()) to immediately clear the buffer.

I believe virtually all web applications - and RPC frameworks - would benefit from this over setting TCP_NODELAY.

It would also be more elegant than TCP_CORK, which has a tremendous pitfall: failing to uncork can result in never sending the last packet. And it's easy to implement by adding a syscall at the end of each request and response. Applications almost always know when they're done writing to a stream.

maxbond · on Dec 30, 2022

Why isn't this a defect? It brought OP's transfer speed over Ethernet to 2.5MB/s.

jonas21 · on Dec 30, 2022

Because it's a tradeoff. The author touches on this in the last sentence:

> Here’s the thing though, would you rather your user wait 200ms, or 40s to download a few megabytes on an otherwise gigabit connection?

Though I'd phrase it as "would you rather add 200ms of latency to every request, or take 40s to download a few megabytes when you're on an extremely unreliable wifi network and the application isn't doing any buffering?"

In the use cases that Go was designed for, it probably makes sense to set the default to do poorly in the latter case in order to get the latency win. And if that's not the case for a given application, it can set the option to the other value.

tptacek · on Dec 30, 2022

It's an option, with a default. Arguably (I mean, I'd argue it, other reasonable people would disagree), Go's default is the right one for most circumstances. That's not a "defect"; it's a design decision people disagree with.

rdslw · on Dec 31, 2022

you clearly (in this post and yours others) did not read OP and other comments on this thread where it's documented that it WAS NOT design decision. why use it as an argument where it's written it was NOT by design.

the same with LFS -> this post clearly shows detriment to LFS usage, and probably many other tools written with golang.

'most circumstances': prove it, or dont use.

loeg · on Dec 30, 2022

If there is a defect, it's in git-lfs. Picking a reasonable default is not a defect.

maxbond · on Dec 30, 2022

It being reasonable is what's in dispute.

tptacek · on Dec 30, 2022

Not really, not on this thread. The debate is valid (though maybe not in this hyperbolic framing), but this is subthread where I'm responding to someone who "picked their jaw up off the floor" at this "defect" of a very obvious default in the Go standard library that has been there I think since its inception, as if no network software in the history of software had ever deliberately disabled Nagle, rather than that being literally standard socket programming advice for decades.

(Again, being standard advice doesn't make it not debatable!)

jacquesm · on Dec 30, 2022

I think part of the reason for the response is that people tend to just use libraries and assume they will work without reading the documentation or the code and when that strategy backfires they are surprised.

At another level: this is also caused by the fact that most users of said libraries would not be able to write those libraries in the first place and so are not qualified to read/understand the the code.

tptacek · on Dec 30, 2022

I mean, the behavior we're talking about here is in fact documented; they don't have to read the code. Every mainstream language in the world (that supports socket programming) has a setting to enable or disable Nagle, so it's not like it's hard to know where to look.

jacquesm · on Dec 30, 2022

Likely the first time when they realize something is up is when it doesn't work to their expectations. I can see why though: the Go eco-system, and many others besides treats including dependencies as a black box operation, and with auto completion you can include a library and start using it without ever really understanding it, its design trade-offs, default settings and so on. They might show up briefly by name during some dependencies installation process but all it takes is one level of indirection to hide the presence of some library fairly effectively.

Just like someone who installs a refrigerator likely has no idea how a heatpump works, they just need a box that is cold and as long as it is cold they're happy. Cue them surprised when the box starts working in unpredictable ways when the environment temperature changes outside of the design parameters.

tedunangst · on Dec 30, 2022

Back up a step: why would anyone who's never read documentation assume something like Nagle's algorithm is in effect? I call send(), I expect data to be sent.

jacquesm · on Dec 30, 2022

Indeed. But the devil is in the details and many networking protocols have layer upon layer of fixes to ensure that things normally speaking go smoothly. Depart from the beaten path and you are most likely going to find some of your assumptions challenged.

One of the more frequent occurrences is the silent fragmentation and re-assembly of packets and/or the attempts to transmit packets that exceed the MTU. These are all but guaranteed to lead to surprising outcomes and much headscratching.

A name like send_but_make_sure_you_read_the_documentation() would have probably been more appropriate but it's a bit unwieldy, and in the default case it is precisely the silent activation of various algorithms to fix common problems that allows you to get away with calling it 'send()' in the first place.

dagmx · on Dec 30, 2022

Probably because it’s the default in most other scenarios you’d call send (other languages etc).

So having a rare inverted default is bad for intuition.

tptacek · on Dec 30, 2022

I took the implication here to be that the kinds of people who don't read documentation don't know what Nagle is or have any expectations about it to begin with.

maxbond · on Dec 30, 2022

I. Don't at all understand this comment. You don't own this subthread? I don't really even really recognize the boundaries of subthreads from the larger thread, at least not in the way you're suggesting? The article is about surprising consequences of this decision. This being a "good default" is very much a subject of contention in this discussion.

> (Again, being standard advice doesn't make it not debatable!)

This seems to accept my premise that it's what's in dispute?

tptacek · on Dec 30, 2022

I don't claim to own the thread, but since you've jumped in to respond on behalf of the other person I responded to, I'm going to to tell you again that what you want to talk about here isn't what I'm here to talk about. There are plenty of other subthreads here talking about whether disabling Nagle by default is a good thing or not; maybe join one of them.

maxbond · on Dec 30, 2022

I'm not responding on anyone's behalf. I think your attitude here is really weird. You are in fact asserting that you are the arbiter of what can be discussed in this subthread. If you don't want to discuss what I'm discussing - just don't respond? Telling me to go away is so strangely aggressive, I'm baffled.

I'm not going to respond any further because this seems very unproductive.

jonas21 · on Dec 30, 2022

Isn't that just how threads (and debates in general) work?

metadat · on Dec 30, 2022

I even tried emailing tptacek to try and be part of the change he said he wanted to see. Crickets.

HN folk can be a bit hypersensitive and / or opaque at times. Text medium is not always ideal as it provides no signals for tone, and our brains backfill this information in a biased manner.

tptacek · on Dec 30, 2022

You emailed me less than an hour ago (I found out about it here, just now) and then tried to dunk on me for not replying. I think we can save ourselves some time and disengage.

riwsky · on Dec 30, 2022

Respond to my skywriting on the subject of error-handling, you coward! That plane was expensive and my opinions are important!

tptacek · on Dec 30, 2022

If I'd seen the skywriting before you complained, I would have! I like getting email--- err, skywriting messages! But I only check the, uh, sky a couple times a day!

metadat · on Dec 30, 2022

That's reasonable. My bad.

stonemetal · on Dec 30, 2022

Delayed ACK seems like the better default to me, whether it is telnet or web servers, network programming is almost always request response. Delaying the ACK so that part of that response is ready seems like the correct choice. In today's network programming how often is tinygram really an issue?

In this case I would consider the bug to be git lfs. Even if Nagle's was enabled I would still consider it a bug, because of the needless syscall overhead of doing 50 byte writes.

jchw · on Dec 29, 2022

Actually if you're sending a file or something, do you really need Nagle's algorithm? It seems like the real mistake might be not using a large enough buffer for writing to the socket, but I could be speaking out my ass.

There's actually a lot of prevailing wisdom that suggests disabling Nagle's algorithm is (often) a good idea. While the problem with latency is caused by delayed ACKs, the sender can't do anything about that, because it's the receiver side that controls this.

Not saying that it's good the standard library defaults this necessarily... But this post paints the decision in an oddly uncharitable light. That said, I can't find the original thread where this was discussed, if there ever was one, so I have no idea why they chose to do this, and perhaps it shouldn't be this way by default.

AaronFriel · on Dec 30, 2022

It's often a good idea when the application has its own buffering, as is common in many languages and web frameworks which implement some sort of 'reader' interface which can alternate symbols of "chunks" and "flushes" or only emit entire payloads (a single chunk). With scatter-gather support for IO, it's generally OK for the application to produce small chunks followed by a flush. Those application layer frameworks want Nagle's algorithm turned off at the TCP layer to avoid double-buffering and incurring extra latency.

Go however is disabling Nagle's by default as opposed to letting it be a framework level decision.

scaredginger · on Dec 30, 2022

This is a great point. Why is Git LFS uploading a large file in 50 byte chunks?

rsaxvc · on Dec 30, 2022

Ideally large files would upload in MTU sized packets, which Nagle's algorithm will often give you, otherwise you may have a small amount of additional overhead at the boundary where the larger chunk may not be divisible into MTU sized packets.

Edit: I mostly work in embedded (systems that don't run git-lfs), perhaps my view is isn't sensible here.

Matthias247 · on Dec 30, 2022

Dividing packets into MTUs is the job of the tcp stack - or even the driver or NIC in the case of offloads. Userspace software shouldn’t deal with MTUs and always use buffer sizes that make sense for the application - eg 64kB or even more. Otherwise the stack wouldn’t be very efficient with every tiny piece of data causing a syscall and independent processing by the networking stack

josephg · on Dec 30, 2022

Right; it sounds to me like the real bug is that git-lfs isn't buffering writes to the network driver. Correct me if I'm wrong but if git-lfs was buffering its writes (or using sendfile) then Nagle's algorithm wouldn't matter.

rsaxvc · on Dec 30, 2022

It matters less often - it can still matter at the end of each write buffer. Larger write-buffers remove a lot of chances for this to happen.

If the application can buffer the entire file or use sendfile, probably best to disable Nagle's algorithm so the last packet goes out immediately. Nginx does this.

Another option is turning off Nagle's algorithm at the end of each transfer, and on at the start of the next, but this causes extra syscalls.

jesprenj · on Dec 30, 2022

I do not know Go. But what if there are so many high level abstractions in the Go language that it operates on streams directly?

yencabulator · on Dec 30, 2022

The standard convention is to slap bufio.Reader/bufio.Writer on streams to make them more performant.

Though how LFS ends up with ~50 byte chunks is probably something very, very, dumb in the LFS code itself. Better to fix that mistake than to paper over it.

morelisp · on Dec 30, 2022

bufio is for adding buffering regardless of source/dest. Better in this case is ReaderFrom (which will also be used transparently by io.Copy) to let the socket control the buffering and apply even more optimizations. For something like git-lfs I could expect sendfile to provide a huge improvement, depending on the underlying storage.

Karrot_Kream · on Dec 30, 2022

The footnote has a brief note about delayed ACKs but it's not like the creator of the socket can control whether the remote is delaying ACKs or not. If ACKs are delayed from the remote, you're eating the bad Nagle's latency.

The TCP_NODELAY behavior is settable and documented here [1]. It might be better to more prominently display this behavior, but it is there. Not sure what's up with the hyperbolic title or what's so interesting about this article. Bulk file transfers are far from the most common use of a socket and most such implementations use application-level buffering.

[1]: https://pkg.go.dev/net#TCPConn.SetNoDelay

burnished · on Dec 30, 2022

The title is hyperbolic because a real person got frustrated and wrote about it, the article is interesting because a real person got frustrated at something many of us can imagine encountering but not so many successfully dig into and understand.

“Mad at slow, discovers why slow” is a timeless tale right up there with “weird noise at night, discovers it was a fan all along”, I think it’s just human nature to appreciate it.

morelisp · on Dec 30, 2022

> There's actually a lot of prevailing wisdom that suggests disabling Nagle's algorithm is (often) a good idea.

Because even in mediocre networks it is a good idea.

Don’t write a small amount of data if you want (or in this case even need) to send a large amount of data!

leighmcculloch · on Dec 30, 2022

Some prior discussion about why turn on TCP_NODELAY: https://jvns.ca/blog/2015/11/21/why-you-should-understand-a-...

John Nagle's comments about it: https://news.ycombinator.com/item?id=10608356

amluto · on Dec 30, 2022

IMO the real problem is that the socket API is insufficient, and the Nagle algorithm is a kludge around that.

When sending data, there are multiple logical choices:

1. This is part of a stream of data but more is coming soon (once it gets computed, once there is buffer space, or simply once the sender loops again).

2. This is the end of a logical part of the stream, and no more is coming right now.

3. This is latency-sensitive.

For case 1, there is no point in sending a partially full segment. Nagle may send a partial segment, which is silly. For case 2, Nagle is probably reasonable, but may be too conservative. For case 3, Nagle is wrong.

But the socket API is what it is, no one seems to want to fix this, and we’re stuck with a lousy situation.

pclmulqdq · on Dec 30, 2022

I'm pretty convinced that every foundational OS abstraction that we use today, most of which were invented in the 70's or 80's, is wrong for modern computing environments. It just sucks less for some people than for other people.

I do think Golang's choice of defaulting to TCP_NODELAY is probably right - they expect you to have some understanding that you should probably send large packets if you want to send a lot of stuff, and you likely do not want packets being Nagled if you have 20 bytes you want to send now. TCP_QUICKACK also seems wrong in a world with data caps - the unnecessary ACKs are going to add up.

Issuing a SEND syscall every 50 bytes is also horrendously CPU-inefficient, and certainly should be expected to trigger pathological cases.

At this point, the OS is basically expected to guess what you actually want to do from how you incant around their bad abstractions, so it's not surprising that sending megabytes of data 50 bytes at a time would trigger some weird slowdowns.

klabb3 · on Dec 30, 2022

> Issuing a SEND syscall every 50 bytes is also horrendously CPU-inefficient

This is the real crime here. The fact that it maxed out at 2.5Mb/s might be quite literally due to CPU limit.

If you are streaming a large amount of data, you should use a user space buffer anyway, especially if you have small chunks. In Golang, buffers are standard practice and a one-liner to add.

dtjohnnymonkey · on Dec 30, 2022

*pedantry warning*

In practice, buffers are more than a one-liner, as you probably want to deal with flushing them at some out-of-band moment (+1 line) as well as handle the error from that (+3 lines).

nextaccountic · on Jan 4, 2023

> Issuing a SEND syscall every 50 bytes is also horrendously CPU-inefficient

io_uring is supposed to help with that

jandrese · on Dec 30, 2022

This seems like it should be very simple to fix without having to do much to the API. Just implement a flush() function for TCP sockets that tells the stack to kick the current buffer out to the wire immediately. It seems so obvious that I think I must be missing something. Why didn't this appear in the 80s?

erik_seaberg · on Dec 30, 2022

It’s not portable but Linux has a TCP_CORK socket option that does this.

olvy0 · on Dec 30, 2022

Here's how to emulate TCP_CORK using TCP_NODELAY, from [0]:

- unset the TCP_NODELAY flag on the socket

- Call send() zero or more times to add your outgoing data into the Nagle-queue

- set the TCP_NODELAY flag on the socket

- call send() with the number-of-bytes argument set to zero, to force an immediate send of the Nagle-queued data

[0] https://stackoverflow.com/a/22118709

jandrese · on Dec 30, 2022

Wow, that's not awkward at all.

AaronFriel · on Dec 30, 2022

It's a downside of the "everything is a file" mindset. As all abstractions are, it's leaky.

Nagle's algorithm is elegant because it allows poorly written applications to saturate a PHY.

Disabling it requires the application layer to implement its own buffer.

If I had a time machine and access to the early *nixes, I'd extend Nagle's algorithm and the kernel to treat fsync() as a signal to flush immediately.

blibble · on Dec 30, 2022

> But the socket API is what it is, no one seems to want to fix this, and we’re stuck with a lousy situation.

Linux/FreeBSD/... have had the TCP corking API for what, 20 years?

amluto · on Dec 30, 2022

IMO MSG_MORE is a substantially better interface. Sadly it seems to be rarely used.

rwmj · on Dec 30, 2022

My colleague added MSG_MORE support throughout libnbd[1]. It proved quite an elegant way to solve a common problem: You want to assemble a message in whatever protocol you're using, but it's probably being assembled across many functions (or in the case of libnbd, states in a complicated state machine), and using expanding buffers or whatever is a pain. So instead we let the kernel assemble it, or allow the kernel to make the decision to group the data or send it. The down side is multiple socket calls, but combining it with io_uring is a possibility to avoid this.

[1] https://gitlab.com/search?search=MSG_MORE&nav_source=navbar&...

sph · on Dec 30, 2022

Oh that is truly elegant, I didn't know about that.

Basically you set the MSG_MORE flag when you call `send` if you know you will have more data to send very soon, so the kernel is free to wait to form an optimally-sized packet instead of sending many small packets every time you run that syscall.

rubatuga · on Dec 30, 2022

Latency can be affected by both CPU load and network congestion, so it's possible that Nagle's algorithm can help in Case 3. It's really trial and error to see what works best in practice.

Matthias247 · on Dec 30, 2022

The article is way too opinionated about „golang is doing it wrong“ for a decision that neither has a right or wrong.

Nagle can make sense for some applications, but also has drawbacks for others - as countless articles about the interaction with delayed acks and 40ms pauses (which are pretty huge in the days of modern internet) describe.

If one uses application side buffering and syscalls which transmit all available data at once, enabling NODELAY seems like a valid choice. And that pattern is the one that is used by GOs http libraries, all TLS libraries (you want to encrypt a 16kB record anyway), and probably most other applications using TCP. It’s are seeing anything doing direct syscalls with tiny payloads.

The main question should be why LFS has this behavior - which also isn’t great from an efficiency standpoint. But that question is best discussed in a bug report, and not a blog post of this format.

withinboredom · on Dec 30, 2022

I prefer reliability over latency, always. The world won’t fall apart in 200ms, let alone 40ms. If you’re doing something where latency does matter (like stocks) then you probably shouldn’t be using TCP, honestly (avoid the handshake!)

When it comes to code, readability and maintainability are more important. If your code is reading chunks of a file then sending it to a packet, you won’t know the MTU or changes to the MTU along the path. Send your chunk and let Nagle optimize it.

Further, principle of least surprise always applies. The OS default is for Nagle to be enabled. For a language to choose a different default (without providing a reason), and one that actively is harmful in poor network conditions at that, was truly surprising.

Matthias247 · on Dec 30, 2022

TCP is always reliable, the choice of this algorithm will never impact this - it will only impact performance (bandwidth/latency) and efficiency.

Enabling nagle by default will lead to elevated latencies with some protocols that don't require the peer to send a response (and thereby a piggybacked ACK) after each packet. Even a "modern" TLS1.3 0RTT handshake might fall into that category. This is a performance degradation.

The scenario that is described in the blog post where too many small packets due to nothing aggregating them causing elevated packet loss is a different performance degradation, and nothing else.

Both of those can be fixed - the former only by enabling TCP_NODELAY (since the client won't have control over servers), the second by either keeping TCP_NODELAY disabled *or* by aggregating data in userspace (e.g. using a BufferedWriter - which a lot of TLS stacks might integrate by default).

> The world won’t fall apart in 200ms, let alone 40ms.

You might be underestimating the the latency sensitivity of the modern internet. Websites are using CDNs to get to a typical latency in the 20ms range. If this suddenly increases to 40ms, the internet experience of a lot of people might get twice as bad as it is at the moment. 200ms might directly push the average latency into what is currently the P99.9 percentile.

And it would get even worse for intra datacenter use-cases, where the average is in the 1ms range - and where accumulated latencies would still end up being user-experiencable (the latency of any RPC call is the accumulated latency of upstream calls).

> If your code is reading chunks of a file then sending it to a packet, you won’t know the MTU or changes to the MTU along the path

Sure - you don't have to. As mentioned, you would just read into an intermediate application buffer of a reasonable size (definitely bigger than 16kB or 10 MTUs) and let the OS deal with it. A loop along `n = read(socket, buffer); write(socket, buffer[0..n])` will not run into the described issue if the buffer is reasonably sized and will be a lot more CPU efficient than doing tiny syscalls and expecting all aggregation to happen in TCP send buffers.

lanstin · on Dec 30, 2022

Much of the world is doing ok with TCP and TLS but with session resumption and long lived connections. Many links will be marked bad in 200 ms and retries or new links issues. Imagine you are doing 20k / second / CPU. That is four thousand backed up calls for no reason, just randomness.

Ferret7446 · on Dec 30, 2022

> I prefer reliability over latency, always.

I imagine all the engineers who serve millions/billions of requests per second disagree with adding 200ms to each request, especially since their datacenter networks are reliable.

> Send your chunk and let Nagle optimize it.

Or you could buffer yourself and save dozens/hundreds of expensive syscalls. If adding buffering makes your code unreadable, your code has bigger maintainability problems.

withinboredom · on Dec 30, 2022

I’ve done quite a bit of testing on my shitty network (plus a test bench using Docker and plumba) in the last 24 hours — I’m not finished so take the rest of this with a grain of salt. There will be a blog post about this in the near future… once I finish the analysis.

Random connection resets are much more likely when disabling Nagle’s algorithm. As in 2-4x more likely, especially with larger payloads. Most devs just see “latency bad” without considering the other benefits of Nagle: you won’t send a packet until you receive an ACK or the packet is full. On poor networks, you always see terrible latency (even with Nagle’s disabled, 200-500ms is the norm) and with Nagle’s the throughput is a bit higher than without, even with proper buffering on the application side.

throwdbaaway · on Jan 2, 2023

> And that pattern is the one that is used by GOs http libraries

I don't think that is correct. In https://news.ycombinator.com/item?id=34213383, I notice that Go's HTTP/2 library would write the HEADERS frame, the DATA frame, and the terminal HEADERS frame in 3 different syscalls. In a sample application using the Go's HTTP/2 library, a gRPC response without Nagle's algorithm would transmit 497 bytes over 6 packets, while a gRPC response with Nagle's algorithm would transmit 275 bytes over 2 packets.

With a starting point where both Nagle's algorithm and delayed ack are enabled, I guess this is the order of preference:

1. delayed ack disabled, applications do the right thing by buffering accordingly - ideal performance, but it is difficult to disable delayed ack, and it may require a lot of works to fix the applications.

2a. Nagle's algorithm disabled, applications do the right thing by buffering accordingly - almost ideal performance (may perform worse than #1 over bad connection), but it may require a lot of works to fix the applications.

2b. delayed ack disabled, real world applications - almost ideal performance (may have higher syscall overhead than #1), but it is difficult to disable delayed ack.

3. Nagle's algorithm disabled, real world application - not ideal as some applications can suffer from high packet overhead, e.g. git-lfs, and this is where we are at with Go.

4. baseline - far from ideal as many applications can suffer from high latency due to bad interaction between Nagle's algorithm and delayed ack.

I would say Go has made the right trade-off, albeit with a slight hint of "we know better than you". Going forward, it is probably cheaper for linux kernel to come up with a better API to disable delayed ack (i.e. to achieve #2b), than getting the affected applications to do the right thing by buffering accordingly (i.e. to achieve #1 or #2a). We will see how soon https://github.com/git-lfs/git-lfs/issues/5242 can be resolved.

In the mean time, #2b can actually be achieved with a "SRE approach" by patching the kernel to remove delayed ack and patching the Go library to remove the `setNoDelay` call. Something for OP to try?

throwdbaaway · on Jan 2, 2023

I just learnt about "ip route change ROUTE quickack 1" from https://news.ycombinator.com/item?id=10662061, so we don't even need to patch the kernel. This makes 2b a really attractive option.

ialad · on Dec 30, 2022

I'm using Go's default HTTP client to make a few requests per second. I set a context timeout of a few seconds for each request. There are random 16 minute intervals where I only get the error `context deadline exceeded`.

From what I found, Go's default client uses HTTP/2 by default. When a TCP connection stops working, it relies on the OS to decide when to time out the connection. Over HTTP/1.1, it closes the connection itself [1] on timeout and makes a new connection.

In Linux, I guess the timeout for a TCP connection depends on `tcp_retries2` which defaults to 15 and corresponds to a time of ~15m40s [2].

This can be simulated by making a client and some requests and then blocking traffic with an `iptables` rule [3]. My solution for now is to use a client that only uses HTTP/1.1.

[1] https://github.com/golang/go/issues/36026#issuecomment-56902...

[2] https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/

[3] https://github.com/golang/go/issues/30702

iampims · on Dec 30, 2022

You can configure the HTTP/2 client to use a timeout + heartbeat.

https://go.googlesource.com/net/+/master/http2/transport.go

francislavoie · on Dec 30, 2022

That's a big file. Mind pointing to a specific line number?

iampims · on Dec 30, 2022

https://go.googlesource.com/net/+/master/http2/transport.go#...

Looks like it got cut off when I originally pasted it.

benmmurphy · on Dec 30, 2022

that sounds like there is pooling going on and not invalidating the pooled connection when a timeout happens. I've actually seen a lot of libraries in other languages do a similar thing (my experience is some of the elixir libraries don't have good pool invalidation for http connections). having a default invalidation policy that handles all situations is a bit difficult but I think a default policy that invalidates on any timeout is much better than a default policy that never invalidates on a timeout. as long as invalidation means just evicting it from the pool and not tearing down other channels on the HTTP/2 connection. for example you could have a timeout on a HTTP/2 connection that is just on an individual channel but there is still data flowing through the other channels.

rattray · on Dec 30, 2022

Wow. Can you easily change the tcp connection timeout?

iampims · on Dec 30, 2022

You can. It’s trivial once you know it’s possible. Not sure why it’s not set by default. https://go.googlesource.com/net/+/master/http2/transport.go

klabb3 · on Dec 30, 2022

To be clear, this is for http/2, not tcp. You can very easily set read and write deadlines on tcp conns, but you can’t detect if a peer has disappeared without data. You can set keepalive but it’s not reliable and varies wildly between OSs.

You need a heartbeat or ping message together with an advancing deadline to detect dead peers reliably.

benmmurphy · on Dec 30, 2022

HTTP/2 supports a heartbeat in the protocol using PING frames. But I guess a lot of clients probably don’t support it or use it by default.

PathOfEclipse · on Dec 29, 2022

As a counter-argument, I've ran into serious issues that were caused by TCP delay being enabled by default, so I ended up disabling it. I actually think having it disabled by default is the right choice, assuming you have the control to re-enable it if you need to.

Also, in my opinion, if you want to buffer your writes, then buffer them in the application layer. Don't rely on the kernel to do it for you.

erik_seaberg · on Dec 30, 2022

The kernel has to buffer everything you send in a sliding window, to retry missed acks. Userspace buffering only reduces syscalls.

A lot of people with strong preferences about segment boundaries and timing are arguing with TCP and probably shouldn’t be using it.

Ferret7446 · on Dec 30, 2022

> Userspace buffering only reduces syscalls.

"only". The kernel also buffers disk writes, but god help you if you're writing files to disk byte by byte.

withinboredom · on Dec 30, 2022

I talked a bit about that in the post. Use your own buffers if possible, but there are times you can’t do that reliably (proxies come to mind) where you’d have to basically implement an application specific Nagles algorithm. If you find yourself writing something similar, it’s probably better to let the kernel do it and keep your code simpler to reason about.

morelisp · on Dec 30, 2022

If you are writing a serious proxy you should be working at either a much lower level (eg splice) or a much higher level (ReadFrom, Copy). If you’re messing around with TCPConn parameters and manual buffer sizes you’ve already lost.

withinboredom · on Dec 30, 2022

Not just network proxies, but possibly proxying/transforming a device i/o (like usb-over-ethernet).

morelisp · on Dec 30, 2022

Goalposts are receding, but this is exactly the higher level I mentioned. Use io.Copy, and if you need any kind of transforms implement them as Readers.

PathOfEclipse · on Dec 30, 2022

I haven't thought about this hard, but, would a proxy not serve it's clients best by being as transparent as possible, meaning to forward packets whenever it receives them from either side? I think this would imply setting no_delay on all proxies by default. If either side of the connection has a delay, then the delay will be honored because the proxy will receive packets later than it would otherwise.

to11mtm · on Dec 30, 2022

IFF you are LAN->LAN or even DC->DC, NoDelay is usually better nowadays. If you are having to retransmit at that level you have far larger problems somewhere else.

If you're buffering at the abstracted transport level, Same.

bradfitz · on Dec 30, 2022

Because you're supposed to have buffering at a different layer.

Ameo · on Dec 30, 2022

Networking is the place where I notice how tall modern stacks are getting the most.

Debugging networking issues inside of Kubernetes feels like searching for a needle in a haystack. There are so, so many layers of proxies, sidecars, ingresses, hostnames, internal DNS resolvers, TLS re/encryption points, and protocols that tracking down issues can feel almost impossible.

Even figuring out issues with local WiFi can be incredibly difficult. There are so many failure modes and many of them are opaque or very difficult to diagnose. The author here resorted to WireShark to figure out that 50% of their packets were re-transmissions.

I wonder how many of these things are just inherent complexity that comes with different computers talking to each other and how many are just side effects of the way that networking/the internet developed over time.

parasubvert · on Dec 30, 2022

Kubernetes has no inherent or required proxies or sidecars or ingresses, or TLS re-encryption points.

Those are added by “application architects”, or “security architects” and existed long before Kubernetes, for the same debatable reasons: they read about it in a book or article and thought it was a neat idea to solve a problem. Unfortunately, they may not understand the tradeoffs deeply, and may have created more problems than were solved.

imglorp · on Dec 30, 2022

There's been a highly annoying kubectl port-forward heisenbug open for several years which smells an awful lot like one of these dark Go network layer corners. You get a good connection establish and some data flows, but at some random point it decides to drop. It's not annoying enough for any wizards to fix. I immediately thought of this bug when Nagle in Go came up here.

https://github.com/kubernetes/kubernetes/issues/74551

anthk · on Dec 30, 2022

Wireshark exists since forever.

Beltalowda · on Dec 30, 2022

Wireshark doesn't tell you anything about what's wrong with your code. It just tells you "yup, the code is doing something wrong!"

Figuring that out in Kubernetes ... yeah, good luck with that.

lanstin · on Dec 30, 2022

And that or tcpdump should be the first thing you grab to diagnose a network issue.

anthk · on Dec 30, 2022

Tcpdump to dump, yes, but wireshark is better to visualize.

danpalmer · on Dec 30, 2022

Go was explicitly designed for writing servers. This means two things are normally true:

- latency matters, for delivering a response to a client

- the network is probably a relatively good datacenter network (high bandwidth, low packet loss/retransmission)

Between these things, I think the default is reasonable, even if not what most would choose. As long as it’s documented.

The fact that other languages have other defaults, or the fact that people use Go for all sorts of other things like system software, doesn’t invalidate the decision made by the designers.

andrewxdiamond · on Dec 30, 2022

> the network is probably a relatively good datacenter network (high bandwidth, low packet loss/retransmission)

The first lesson I learned about Distributed Systems Engineering is the network is never reliable. A system (or language) designed with the assumption the network is reliable will tank.

But I also I don’t agree that Go was written with that assumption. Google has plenty of experience in distributed systems, and their networks are just as fundamentally unreliable as any

danpalmer · on Dec 30, 2022

“Relatively” may have needed some emphasis here, but in general, networking done by mostly the same boxes operated by the same people, in the same climate controlled building, are going to be far more reliable than home networks, ISPs running across countries, regional phone networks, etc.

Obviously nothing is perfect, but applications deploying in data centres should probably make the trade offs that give better performance on “perfect” networks, at the cost of poorer performance on bad networks. Those deploying on mobile devices or in home networks may better suit the opposite trade offs.

shadowgovt · on Dec 30, 2022

> The first lesson I learned about Distributed Systems Engineering is the network is never reliable

Yep, and it's a good rule. It's the one Google applies across datacenters.

... but within a datacenter (i.e. where most Go servers are speaking to each other, and speaking to the world-accessible endpoint routers, which are not written in Go), the fabric is assumed to be very clean. If the fabric is not clean, that's a hardware problem that SRE or HwOps needs to address; it's not generally something addressed by individual servers.

(In other words, were the kind of unreliability the article author describes here on their router to occur inside a Google datacenter, it might be detected by the instrumentation on the service made of Go servers, but the solution would be "If it's SRE-supported, SRE either redistributes load or files a ticket to have someone in the datacenter track down the offending faulty switch and smash it with a hammer.")

AnimalMuppet · on Dec 30, 2022

Relatively reliable. Not "shitty". If you've got a datacenter network that can be described as "shitty", fix your network rather than blaming Go.

morelisp · on Dec 30, 2022

This is an embarrassing response. The second lesson you should’ve learned as a systems engineer, long before any distributed stuff, is “turn off Nagle’s algorithm.” (The first being “it’s always DNS”.)

When the network is unreliable larger TCP packets ain’t gonna fix it.

viraptor · on Dec 30, 2022

Usually you have control over one of them only. If you run the whole network, sure, fix that instead. But if you don't, sending fewer larger packets can actually improve the situation even if it doesn't fix it.

Karrot_Kream · on Dec 30, 2022

Fewer packets yes, but I've been on several networks where sending large packets ends up with bad reordering and dropping behavior.

withinboredom · on Dec 30, 2022

But it will at least let it get out of slow-start.

sillysaurusx · on Dec 30, 2022

It's strange you're getting hammered for this. Everyone in 6.824 would probably agree with you. https://pdos.csail.mit.edu/6.824/

Let's weigh the engineering tradeoffs. If someone is using Go for high-performance networking, does the gain from enabling NDELAY by default outweigh the pain caused by end users?

Defaults matter; doubly so for a popular language like Go.

morelisp · on Dec 30, 2022

I have worked on networked projects ranging from modern datacenters to ca. 2005 consumer-grade ADSL in Ohio to cellular networks in rural South Asia.

There are situations where you want Nagle's algorithm on; when you have stable connections but noisy transmission, streams of data with no ability to buffer, and no application-level latency requirements. There are not many such situations. It is not any of these, and it's certainly not within any datacenter.

pclmulqdq · on Dec 30, 2022

Nagle's algorithm also really screws with distributed systems - you are going to be sending quite a few packets with time bounds, and you REALLY don't want them getting Nagled.

In fact, Nagle's algorithm is a big part of why a lot of programmers writing distributed systems think that datacenter networks are unreliable.

sillysaurusx · on Dec 30, 2022

I don't think this is correct. 6.824 emphasizes reliability over latency. They mention it in several places: https://pdos.csail.mit.edu/6.824/labs/guidance.html

> It should be noted that tweaking timeouts rarely fixes bugs, and that doing so should be a last resort. We frequently see students willing to keep making arbitrary tweaks to their code (especially timeouts) rather than following a careful debugging process. Doing this is a great way to obscure underlying bugs by masking them instead of fixing them; they will often still show up in rare cases, even if they appear fixed in the common case.

> In particular, in Raft, there are wide ranges of timeouts that will let your code work. While you CAN pick bad timeout values, it won't take much time to find timeouts that are functional.

Their unit tests are quite worthwhile to read, if only to absorb how many ways latency assumptions can bite you.

It's true that in the normal case, it's good to have low latency. But correctly engineered distributed systems won't reorganize themselves due to a ~200ms delay.

To put it another way, if a random 200ms fluctuation causes service disruptions, your system probably wasn't going to work very well to begin with. Blaming it on Nagle's algorithm is a punt.

parasubvert · on Dec 30, 2022

In my decades of experience in telco, capital markers, and core banking, unexplained latency spikes of hundreds of ms are usually analyzed to death as they can have ripple effects. I’ve had 36 hour severity 1 incidents with multiple VPs taking notes on 24/7 conference calls when a distributed system starts showing latency spikes in the 400ms range.

No, the system isn’t going haywire, but 200-400ms is concerning inside a datacenter for core apps.

But let’s forget IT apps, let’s talk about the network. In a network 200ms is catastrophic.

Presumably you know BGP is the very popular distributed system that converges Internet routes?

Inside a datacenter the Bidirectional Forwarding Protocol (BFD) is used to drop BGP convergence times to be sub-second if you’re using it as an IGP. BFD is also useful with other protocols but anyway. It has heartbeats of 100-300ms. If there’s a fluctuation of the network 3x that interval, it will drop the link and trigger a round of convergence. This is essential in core networks or telco 4G/5G transport networks.

Of course, flapping can be the consequence of setting too low an interval. Tradeoffs.

Back to the original point, I’ve contributed to the code of equity and bond trading apps, telco apps, core banking systems. And cloud/Kubernetes systems. All RPC distributed systems. Every. Single. One. That performed well… For 30 years! Has enabled TCP_NODELAY. Except when serving up large amounts of streaming data. And the reason fundamentally is that most of the time you have less control over client settings (delayed TCP acks), so it’s easier to control the server.

pclmulqdq · on Dec 30, 2022

That is all well and good in an academic setting. Many distributed systems in the real world like having time bounds under 200 ms for certain things like Paxos consensus within a datacenter. It turns out that latency, at some level, is equivalent to reliability, and 200 milliseconds is almost always well beyond that level.

sillysaurusx · on Dec 30, 2022

I’m not sure what else to say than “this isn’t true.” 6.824’s labs have been paxos-based for at least the better part of a decade, and at no point did they emphasize latency as a key factor in reliability of distributed systems. If anything, it’s the opposite.

Dismissing rtm as “academic” seems like a bad bet. He’s rarely mistaken. If something were so fundamental to real-world performance, it certainly wouldn’t be missing from his course.