Hacker News new | past | comments | ask | show | jobs | submit login
Golang disables Nagle's Algorithm by default (withinboredom.info)
806 points by withinboredom on Dec 29, 2022 | hide | past | favorite | 371 comments



If you trace this all the way back it's been in the Go networking stack since the beginning with the simple commit message of "preliminary network - just Dial for now " [0] by Russ Cox himself. You can see the exact line in the 2008 our repository here [1].

As an aside it was interesting to chase the history of this line of code as it was made with a public SetNoDelay function, then with a direct system call, then back to an abstract call. Along the way it was also broken out into a platform specific library, then back into a general library and go other with a pass from gofmt, all over a "short" 14 years.

0 - https://github.com/golang/go/commit/e8a02230f215efb075cccd41...

1 - https://github.com/golang/go/blob/e8a02230f215efb075cccd4146...


That code was in turn a loose port of the dial function from Plan 9 from User Space, where I added TCP_NODELAY to new connections by default in 2004 [1], with the unhelpful commit message "various tweaks". If I had known this code would eventually be of interest to so many people maybe I would have written a better commit message!

I do remember why, though. At the time, I was working on a variety of RPC-based systems that ran over TCP, and I couldn't understand why they were so incredibly slow. The answer turned out to be TCP_NODELAY not being set. As John Nagle points out [2], the issue is really a bad interaction between delayed acks and Nagle's algorithm, but the only option on the FreeBSD system I was using was TCP_NODELAY, so that was the answer. In another system I built around that time I ran an RPC protocol over ssh, and I had to patch ssh to set TCP_NODELAY, because at the time ssh only set it for sessions with ptys [3]. TCP_NODELAY being off is a terrible default for trying to do anything with more than one round trip.

When I wrote the Go implementation of net.Dial, which I expected to be used for RPC-based systems, it seemed like a no-brainer to set TCP_NODELAY by default. I have a vague memory of discussing it with Dave Presotto (our local networking expert, my officemate at the time, and the listed reviewer of that commit) which is why we ended up with SetNoDelay as an override from the very beginning. If it had been up to me, I probably would have left SetNoDelay out entirely.

As others have pointed out at length elsewhere in these comments, it's a completely reasonable default.

I will just add that it makes no sense at all that git-lfs (lf = large file!) should be sending large files 50 bytes at a time. That's a huge number of system calls that could be avoided by doing larger writes. And then the larger writes would work better for the TCP stack anyway.

And to answer the question in the article:

> Much (all?) of Kubernetes is written Go, and how has this default affected that?

I'm quite confident that this default has greatly improved the default server latency in all the various kinds of servers Kubernetes has. It was the right choice for Go, and it still is.

[1] https://github.com/9fans/plan9port/commit/d51419bf4397cf13d0...

[2] https://news.ycombinator.com/item?id=34180239

[3] http://publications.csail.mit.edu/lcs/pubs/pdf/MIT-LCS-TM-65...


> I will just add that it makes no sense at all that git-lfs (lf = large file!) should be sending large files 50 bytes at a time. That's a huge number of system calls that could be avoided by doing larger writes. And then the larger writes would work better for the TCP stack anyway.

FWIW, at least one git-lfs contributor agrees with you: https://github.com/git-lfs/git-lfs/issues/5242#issuecomment-...

> I think the first thing we should probably look at here is whether Git LFS (and the underlying Go libraries) are optimizing TCP socket writes or not. We should be avoiding making too many small writes where we can instead make a single larger one, and avoiding the "write-write-read" pattern if it appears anywhere in our code, so we don't have reads waiting on the final write in a sequence of writes. Regardless of the setting of TCP_NODELAY, any such changes should be a net benefit.

My 2ct: this type of low-hanging fruit optimization is often found even in largely-used software, so it shouldn't really be a surprise. It's always frustrating when you're the first to find those, though.


As one on the 'supports this decision' side, thanks for taking time from your day to give us the history.

It would be really nice if such context existed elsewhere other than a rather ephemeral forum. It would be awesome to somehow have annotations around certain decisions in a centralized place, though I have no idea how to do that cleanly.


For this kind of decisions, why not simply keep notes as comments in the code? These can easily be added later, even 14+ years after the code was written. Then, when someone dives into the codebase to figure out why something was done this or that way, the answer is right there. No need to dive into (and scavenge, sometimes) VCS history.


To "alter" a commit message after it has already been widely disseminated, branch from the offending commit, make a new commit with a message that contains the relevant info, switch back to mainline, and then merge that branch.


That would be an awesome start.

It would also be really nice to have a 'book' of sorts of this type of lore. Though admittedly, it would probably be hard to remember what to even include without stories like this.


The Arc42 documentation template hat was one of the 12 sections dedicated to "important, expensive or critical design descisions". It makes a pretty good structure for big-picture documetation in a "book" next to the code.

[1] https://news.ycombinator.com/item?id=32353500


just write a note instead



use git notes for attaching information to important commits, after the fact, without altering their SHA


How do you share notes with other users?


Notes are just objects stored in the git repo. They are distributed along with all the other objects.


You had me worried there for a bit, but notes are not distributed by default.

For the curious, you can push/fetch

  refs/notes/*
...to share notes.



Thanks for the explanation, Russ!

As a maintainer of Caddy, I was wondering if you have an opinion on whether it makes sense to have on for a general purpose HTTP server. Do you think it makes sense for us to change the default in Caddy?

Also, would there be appetite for making it easier to change the mode in an http.Server? It feels like needing to reach too deep to change that when using APIs at a higher level than TCP (although I may have missed some obvious way to set it more easily). For HTTP clients it can obviously be changed easily in the dialer where we have access to the connection early on.


Caddie is likely to serve rpcs right? In an rpc context I doubt it ever really makes sense as latency is typically more important than throughput


Thanks for the insight and history brief, Russ!


Thanks for the history!


In my opinion, I think it's correct to be disabled by default.

I think Nagle's algorithm does more harm than good if you're unaware of it. I've seen people writing C# applications and wondering why stuff is taking 200ms. Some people don't even realise it's Nagle's algorithm (edit: interacting with Delayed ACKs) and think it's network issues or a performance problem they're introduced.

I'd imagine most Go software is deployed in datacentres where the network is high quality and it doesn't really matter too much. Fast data transfer is probably preferred. I think Nagle's algorithm should be an optimisation you can optionally enable (which you can) to more efficiently use the network at the expense of latency. Being more "raw" seems like the sensible default to me.


The basic problem, as I've written before[1][2], is that, after I put in Nagle's algorithm, Berkeley put in delayed ACKs. Delayed ACKs delay sending an empty ACK packet for a short, fixed period based on human typing speed, maybe 100ms. This was a hack Berkeley put in to handle large numbers of dumb terminals going in to time-sharing computers using terminal to Ethernet concentrators. Without delayed ACKs, each keystroke sent a datagram with one payload byte, and got a datagram back with no payload, just an ACK, followed shortly thereafter by a datagram with one echoed character. So they got a 30% load reduction for their TELNET application.

Both of those algorithms should never be on at the same time. But they usually are.

Linux has a socket option, TCP_QUICKACK, to turn off delayed ACKs. But it's very strange. The documentation is kind of vague, but apparently you have to re-enable it regularly.[3]

Sigh.

[1] https://news.ycombinator.com/item?id=10608356

[2] https://developers.slashdot.org/comments.pl?cid=14515105&sid...

[3] https://stackoverflow.com/questions/46587168/when-during-the...


Gotta love HN. The man himself shows up to explain.


Imagine being on a math forum discussing Fermat’s theorem and the guy shows up.

This is such a cool aspect of CS being a young field: influent people are still alive!



Andrew Wiles showing up would probably be the next best thing.


Russ also showed up. https://news.ycombinator.com/item?id=34181846

Readers might also enjoy his writeup on how google code search worked. https://swtch.com/~rsc/regexp/regexp4.html Just discovered it.


Fermat would probably get banned for too much trolling around the proof of his last theorem...


Can I show up for the future?


For those like me who didn't know, GP designed Nagle's Algorithm in 1984 working at Ford Aerospace.


i thought you mistyped your comment and wanted to reply to rsc ... then i clicked on animats profile. Yeah HN is becoming a treasure trove for CS.


Yeah, it's pretty cool. I'm gonna start saving these moments every time they happen. Last time I witnessed something like this was:

https://news.ycombinator.com/item?id=24455758


> The documentation is kind of vague, but apparently you have to re-enable it regularly.[3]

This is correct. And in the end it means more or less that setting the socket option is more of a way of sending an explicit ACK from userspace than a real setting.

It's not great for common use-cases, because making userspace care about ACKs will obviously degrade efficiency (more syscalls).

However it can make sense for some use-cases. E.g. I saw the s2n TLS library using QUICKACK to avoid the TLS handshake being stuck [1]. Maybe also worthwhile to be set in some specific RPC scenarios where the server might not immediately send a response on receiving the request, and where the client could send additional frames (e.g. gRPC client side streaming, or in pipelined HTTP requests if the server would really process those in parallel and not just let them sit in socket buffers).

[1] https://github.com/aws/s2n-tls/blob/46c47a71e637cabc312ce843...


Any kernel engineer reading that can explain why TCP_QUICKACK isn't enabled by default? Maybe it's time to turn it on by default, if it was just a workaround for old terminals.


Enabling it will lead to more ACK packets being sent, which leads to lower efficiency of TCP (the stack spends time in processing ACK packets) and lower link utilization (these packets also need space somewhere).

My thought is that the behavior is probably correct by default, since a receiver without knowledge of the application protocol is not able to know whether follow-up data will immediately, and therefore not able to decide whether it should send an ACK or wait for more data. It could wait for a signal from userspace to send that ACK - which is exactly what QUICKACK is doing - but that comes with the drawback of now needing an extra syscall per read.

On the sender side the problem seems solvable more efficiently. If one aggregates data in the application, and just sends as everything at once using an explicit flush signal (either using CORKing APIs or enabling TCP_NODELAY), no extra syscall is required while minimal latency can be maintained.

However I think it might be a good question on whether the delayed ACK periods are still the best choices for the modern internet, or whether much smaller delays (e.g. 5ms, or something along a fraction of the RTT) could be helpful.


Thanks for this reply. What I find specially annoying is that the TCP client and the servers starts by a synchronization round-trip which is supposed to be used to synchronise options and this isn't the case here! Why can't the client and the servers agree on a sensible set of options (no delayed ack if the client is using the Nagle algorithm)??


Is this referring to Nagle on the server, and delayed ACK on the client?


TCP_QUICKACK is mostly used to send initial data along with the first ACK upon establishing a connection, or to make sure to merge the FIN with the last segment.


How it's possible that delayed acks and nagle's algorithms are both defaults, anywhere? Isn't this a matter of choosing one, or another?


Did the move from line oriented input to character input also occur around then?

I remember as a student, vi was installed and we all went from using ed to vi.

There was much gnashing and wailing from the admins of the VAX.


1984 would have been largely character if desired -- you already had desktop PCs with joystick and mouse too. The problem was the original party-line ethernet with large numbers of telnet clients or some other [nonstop, nonburst] byte-oriented protocol or serial hardware concentrator, which was a universal situation at educational institutions of the mid-to-late eighties. The Berkeley hack referred to above likely boosted the number of clients you could run on one ethernet sub with acceptable responsiveness.


From the bottom of the article:

> Most people turn to TCP_NODELAY because of the “200ms” latency you might incur on a connection. Fun fact, this doesn’t come from Nagle’s algorithm, but from Delayed ACKs or Corking. Yet people turn off Nagle’s algorithm … :sigh:


Yeah but Nagle's Algorithm and Delayed ACKs interaction is what causes the 200ms.

Servers tend to enable Nagle's algorithm by default. Clients tend to enabled Delayed ACK by default, and then you get this horrible interaction all because they're trying to be more efficient but stalling eachother.

I think Go's behavior is the right default because you can't control every server. But if Nagle's was off by default on servers then we wouldn't need to disabled Delayed ACKs on clients.


Part of OPs point is 'most clients' do not have an ideal congestionless/lossless network between them and, well, anything.


Why does a congestionless network matter here? Nagle's algorithm aggregates writes together in order to fill up a packet. But you can just do that yourself, and then you're not surprised. I find it very rare that anyone is accidentally sending partially-filled packets; they have some data and they want it to be sent now, and are instead surprised by the fact that it doesn't get sent now because their data doesn't happen to be too large to fit in a single packet. Nobody is reading a file a byte at a time and then passing that 1 byte buffer to Write on a socket. (Except... git-lfs I guess?)

Nagle's algorithm is super weird as it's saying "I'm sure the programmer did this wrong, here, let me fix it." Then the 99.99% of the time when you're not doing it wrong, the latency it introduces is too high for anything realtime. Kind of a weird tradeoff, but I'm sure it made sense to quickly fix broken telnet clients at the time.


> Nagle's algorithm aggregates writes together in order to fill up a packet.

Not quite an accurate description of Nagles algorithm. It only aggregates writes together if you already have in-flight data. The second you get back an ACK, the next packet will be sent regardless of how full it is. Equally your first write to the socket will always be sent without delay.

The case where you want to send many tiny packets with minimal latency doesn’t really make sense for TCP, because eventuality the packet overhead and traffic control algorithms will end up throttling your thought put and latency. Nagle only impact cases where you’re trying to TCP in an almost pathological manner, and elegantly handles that behaviour to minimise overheads, and associated throughput and latency costs.

If there’s a use case where latency is your absolute top priority, then you should be using UDP, and not TCP. Because TCP will always nobble your latency because it insists on ordered data delivery, and will delay just received packets if they arrive ahead of preceding packets. Only UDP gives you the ability to opt-out of that behaviour, and ensure that data is sent and received as quickly as your network allows, and lets your application decide for itself the handling of missing data.


It makes perfect sense if you consider the right abstraction. TCP connections are streams. There are no packets on that abstraction level. You’re not supposed to care about packets. You’re not supposed to know how large a packet even is.

The default is an efficient stream of bytes that has some trade-off to latency. If you care about latency, then you can set a flag.


There is no perfect abstraction. Speed matters. A stream where data is delivered ASAP is better than a stream where the data gets delayed... maybe... because the OS decides you didn't write enough data.

The default actually violates the abstraction more because now you care how large a packet is, because somehow writing a smaller amount of data causes your latency to spike for some mysterious reason.


> A stream where data is delivered ASAP is better than a stream where the data gets delayed

That depends on your situation, because as you say no abstraction is perfect. Having a stream delivered “faster” isn’t helpful if means your overhead makes up 50% of your traffic, exactly what nagle avoids.

Nagles algorithm is also pretty smart, it’s only going to delay your next packet until it’s either full, or the far end has acknowledged your preceding packet. If your got a crap ton of data to send, and you’re dumping straight into the TCP buffer, then Nagle won’t delay anything because there’s enough data to fill packets. Nagle only kicks in if you’re doing many frequent tiny writes to a TCP connection, which is rarely a valid thing to do if you care about latency and throughput, so Nagles algorithm assuming the dev has made a mistake is reasonable.

If you really care about stream latency, then UDP is your friend. Then you can completely dispense with all the traffic control processes in TCP and have stuff sent exactly when you want it sent.


Often times when people want to send five structs, they just call send five times. I find delayed acks a lot more weird compared to nagle.


In those cases it would be better to call writev() which was designed to coalesce multiple buffers into one write call.

How it sends the data is however up to the implementation, and whether it delays the last send if the TCP buffer isn't entitrely full I'm not sure - but it doesn't make sense to do so, so I would guess not.

https://linux.die.net/man/2/writev


Nagle's algorithm matters because the abstraction that TCP works on, and which was inherited by BSD Socket interface, is that of emulating a full duplex serial port.

Compare with OSI stack, where packetization is explicit at all layers and thus it wouldn't have such an issue in the first place.


Yeah it seems crazy to have that kind of hack in the entire network stack and on by default just because some interactive remote terminal clients didn't handle that behavior themselves.


Most clients that OP deals with, anyway. If your code runs exclusively in a data center, like the kind I suspect Google has, then the situation is probably reversed.


Consider the rising of mobile device. The devices that don't have a good internet is probably everywhere now.

It's no longer like 10 years ago. You either have good internet or don't have internet. The devices that have shitty network grow a lot compare to the past.


Almost every application I've written atop a TCP socket batches up writes into a buffer and then flushes out the buffer. I'd be curious to see how often this doesn't happen.


Are you replying to the correct people? I think I never mention how you should write a program. I only say that assume user have a good internet connection is a naive idea nowadays. (The gta 5 is the worst example in my opinion, lost of a few udp packets and your whole game exit to main menu. How the f**k the dev assume udp packets never lost?)


What I mean to say is that, whether or not your mobile device has bad internet or not shouldn't matter. Most applications are buffering their reads and writes. This makes TCP_NODELAY a non-issue

Most importantly buffering doesn't spend a whole bunch of CPU time context switching into the kernel. Even if you are taking advantage of Nagle's, every call to write is a syscall, which calls into the kernel to perform the write. On a mobile device this would tank your battery. This is the main reason writes are buffered in applications.


This is basically the first thing I check if diagnosing performance issues with network apps. Most probably are buffering now, but surprisingly many don't. MySQLs client library for years didn't for example (it's probably been fixed for a decade or more at this point).


If you run all of your code in one datacenter, and it never talks to the outside world, sure. That is a fairly rare usage pattern for production systems at Google, though.

Just like anyone else, we have packet drops and congestion within our backbone. We like to tell ourselves that the above is less frequent in our network than the wider internet, but it still exists.


If your DC-DC links are regularly as noisy as shitty apartment WiFi routers competing for air time on a narrow band, fix your DC links.


Clients having delayed acks has a very good reason: those ACKs cost data, and clients tend to have much higher download bandwidth than upload bandwidth. Really, clients should probably be delaying acks and nagling packets, while servers should probably be doing neither.


Clients should not be nagling unless the connection is emitting tiny bytes at high frequency. But that's a very odd thing to do, and in most/all cases there's some reasonable buffering occuring higher up in the stack that the nagle's algorithm will only add overhead to. Making things worse are tcp-within-tcp things like http/2.

Nagle's algorithm works great for things like telnet but should not be applied as a default to general purpose networking.


Why would Nagles algorithm add delay to “reasonable buffering up the stack”? Assuming that buffering is resulting in writes to the TCP stack greater than the packet size, Nagles algorithm won’t add any delay.

The only place where Nagles algorithm adds delay is when your doing many tiny writes to a socket, which is exactly the situation you believe Nagles should be applied to.


The size of an ACK is minuscule (40 bytes) compared to any reasonable packet size (usually around 1400 bytes).

In most client situations where you have high down bandwidth, but limited up, that suggests the vast majority of data is heading towards the client, and client isn’t sending much outbound. In which case your client may end up delaying every ACK to maximum timeout, simply because it doesn’t often send reply data in response to a server response.

HTTP is clear example of this. Client issues a request to the server, server replies. Client accepts rely, but never sends any further data to the server. In this case, delaying the client ACK is just a waste of time.


"Be conservative in what you send and liberal in what you accept"

I would cite Postels Law: Nagle's is the "conservative send" side. An ACK is a signal of acceptance, and should be issued more liberally (even though it's also sent, I guess).


> I've seen people writing C# applications and wondering why stuff is taking 200ms

I observe that in the most recent generation of its HTTP client (SocketsHttpHandler), .NET also sets NoDelay by default.

https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...


TIL - thanks!


Agreed. The post should be titled 'Go enables TCP_NODELAY by default', and a body may or may not even be needed. It's documented, even https://pkg.go.dev/net#TCPConn.SetNoDelay

To know why would be interesting, I guess. But you should be buffering writes anyways in most cases. And if you refuse to do that, just turn it back off on the socket. This is on the code author.


> I'd imagine most Go software is deployed in datacentres where the network is high quality

The problem is that those datacenters are plugged into the Internet, where the network is not always high quality. TFA mentions the Caddy webserver - this is "datacenter" software designed to talk to diverse clients all over the internet. The stdlib should not tamper with the OS defaults unless the OS defaults are pathological.


That doesn't make much sense. There are all sorts of socket and file descriptor parameters with defaults that are situational; NDELAY is one of them, as is buffer size, nonblockingness, address reuse, &c. Maybe disabling Nagle is a bad default, maybe it isn't, but the appeal to "OS defaults" is a red herring.


In my opinion, the Principle of Least Surprise applies here.

Go is defaulting to surprising (unexpected) behavior.


I think also “least surprise” depends on your background. In Go, also files don’t buffer by default, contrary to many languages including C. If you call Write() 100 times, you run exactly 100 syscalls. Intermediate Go programmers learn this and that they must explicitly manage buffering (eg: via bufio).

I don’t think it’s wrong that sockets follow the same design. It gives me less surprise.


that Write() doesn't call fsync() though, does it?

so there's no buffering going on in the application, but the bytes almost certainly don't hit the disk before Write() returns

they've just been staged into an OS buffer, with the OS promising to write them out to the disk at a later time (probably, maybe...? hopefully!)

which is exactly the same as a regular TCP socket (with Nagle disabled, i.e. the default, non Go way)


For userland programming, what matters is the syscall level, as that is expensive (and also the API you have for the kernel). Whether the kernel then does internal buffering is irrelevant and uncontrollable beyond any other syscalls which may or may not be implemented (maybe you're running on a custom kernel that doesn't buffer disk writes?).

One write == one syscall, easy. If you want buffering, you add it.


> For userland programming, what matters is the syscall level, as that is expensive

which is why pretty much every programming language buffers file output by default

even C

(other than Go, obviously)

> Whether the kernel then does internal buffering is irrelevant

everyone that's attempted to write reliable software that cares about what ends up on disk, or the other side of the socket will disagree


I think my C is getting rusty, but "write" operates on a file descriptor, doesn't it? It's unbuffered. The buffered versions are things like printf and puts.


That's POSIX; C's equivalent is a FILE, which generally is buffered.


I thought Linux does in kernel buffering with `write`


Only if you know that Nagle's algorithm exists and is used everywhere. For anyone with networking experience it's unexpected, but I still remember learning about Nagle's algorithm when trying to fix latency on a game server I was hosting as a teen. That was surprising behavior to me at the time.


Articles with titles like "How I spent 3 weeks discovering that Nagle's Algorithm exists" are a HN staple. Turning off Nagle follows the principle of least surprise. This article is the first time anyone has ever written about being surprised by Nagle's Algorithm being off.


Have you ever had to chase down strange latency issues? Arguably, this behavior is the least surprising for Go's typical deployment environment.


Twice I have run into this behavior having known and forgotten it. Chatty non http protocols with a few small messages doing auth or whatever before bulk data flow. Pissed me off and surprised me. Now I make sure my defaults for any framework I am using are no delay, and I make sure to plug my computing device into Ethernet whenever possible.


In my experience, you usually don't want to be Nagling in code that lives in a datacenter. Go's default is likely set up around that idea.


You guys had this convo before, in 2015, on what the interaction of the two settings should be doing:

https://news.ycombinator.com/item?id=34180239



Love this historical find. How did you find this relevant conversation? Was this something with HN search?


You haven't committed to memory every hn comment from Ptacek over the last decade?


Also, for small packets, disabling consolidation means adding LOTS of packet overhead. You're not sending 1 million * 50 bytes of data, you're sending 1 million * (50 bytes of data + about 80 bytes of TCP+ethernet header).

Disabling Nagle makes sense for tiny request/replys (like RPC calls) but it's counterproductive for bulk transfers.

I'm not the only one who don't like the thought of a standard library quietly changing standard system behaviour ... so know I have to know the standard routines and their behaviour AND I have to know which platforms/libraries silently reverse things :(


Bulk transfer applications should just use larger buffers


Wouldn't that still end up with sub optimal network messages unless those large buffers are an exact multiple of the MTU on the network?


Hm, I'm not 100% sure about this. If your first buffer is big enough, your next write should be issued before the OS has managed to write it all.


This isn't a defect, which makes the whole comment kind of strange. I blame the post title, which should be "Golang disables Nagle's Tinygram Algorithm By Default"; then we could just debate Nagle vs. Delayed ACK, which would be 100x more interesting than subthreads like this.


Certainly you'd agree that this is a bug in git lfs though, correct? And users doing "git push" with their 500MB files shouldn't have to think about tinygrams or delayed ack?

It's reasonable to think about what other programs might have been affected by this default choice (I'm sure I used one myself two weeks ago—a Dropbox API client with inexplicably awful throughput) and what a better API design that could have avoided this problems might look like


Maybe golang should default to panicking if the application repeatedly calls send() with tiny amounts of data :)


I don't know enough about git-lfs to say. Things that need buffering should deliberately buffer, I guess?


Ok, I've replaced the title with that. Thanks!

though I kind of liked "This adventure starts with git-lfs" (the old use-first-sentence-as-title trick) which was the replacement before this


I think it's a false dichotomy. Delayed ACK and Nagle's algorithm each improve the network in different ways, but Nagle's specifically allows applications to be written without knowledge of the underlying network socket's characteristics.

But there's another way, a third path not taken: Nagle's algorithm plus a syscall (such as fsync()) to immediately clear the buffer.

I believe virtually all web applications - and RPC frameworks - would benefit from this over setting TCP_NODELAY.

It would also be more elegant than TCP_CORK, which has a tremendous pitfall: failing to uncork can result in never sending the last packet. And it's easy to implement by adding a syscall at the end of each request and response. Applications almost always know when they're done writing to a stream.


Why isn't this a defect? It brought OP's transfer speed over Ethernet to 2.5MB/s.


Because it's a tradeoff. The author touches on this in the last sentence:

> Here’s the thing though, would you rather your user wait 200ms, or 40s to download a few megabytes on an otherwise gigabit connection?

Though I'd phrase it as "would you rather add 200ms of latency to every request, or take 40s to download a few megabytes when you're on an extremely unreliable wifi network and the application isn't doing any buffering?"

In the use cases that Go was designed for, it probably makes sense to set the default to do poorly in the latter case in order to get the latency win. And if that's not the case for a given application, it can set the option to the other value.


It's an option, with a default. Arguably (I mean, I'd argue it, other reasonable people would disagree), Go's default is the right one for most circumstances. That's not a "defect"; it's a design decision people disagree with.


you clearly (in this post and yours others) did not read OP and other comments on this thread where it's documented that it WAS NOT design decision. why use it as an argument where it's written it was NOT by design.

the same with LFS -> this post clearly shows detriment to LFS usage, and probably many other tools written with golang.

'most circumstances': prove it, or dont use.


If there is a defect, it's in git-lfs. Picking a reasonable default is not a defect.


It being reasonable is what's in dispute.


Not really, not on this thread. The debate is valid (though maybe not in this hyperbolic framing), but this is subthread where I'm responding to someone who "picked their jaw up off the floor" at this "defect" of a very obvious default in the Go standard library that has been there I think since its inception, as if no network software in the history of software had ever deliberately disabled Nagle, rather than that being literally standard socket programming advice for decades.

(Again, being standard advice doesn't make it not debatable!)


I think part of the reason for the response is that people tend to just use libraries and assume they will work without reading the documentation or the code and when that strategy backfires they are surprised.

At another level: this is also caused by the fact that most users of said libraries would not be able to write those libraries in the first place and so are not qualified to read/understand the the code.


I mean, the behavior we're talking about here is in fact documented; they don't have to read the code. Every mainstream language in the world (that supports socket programming) has a setting to enable or disable Nagle, so it's not like it's hard to know where to look.


Likely the first time when they realize something is up is when it doesn't work to their expectations. I can see why though: the Go eco-system, and many others besides treats including dependencies as a black box operation, and with auto completion you can include a library and start using it without ever really understanding it, its design trade-offs, default settings and so on. They might show up briefly by name during some dependencies installation process but all it takes is one level of indirection to hide the presence of some library fairly effectively.

Just like someone who installs a refrigerator likely has no idea how a heatpump works, they just need a box that is cold and as long as it is cold they're happy. Cue them surprised when the box starts working in unpredictable ways when the environment temperature changes outside of the design parameters.


Back up a step: why would anyone who's never read documentation assume something like Nagle's algorithm is in effect? I call send(), I expect data to be sent.


Indeed. But the devil is in the details and many networking protocols have layer upon layer of fixes to ensure that things normally speaking go smoothly. Depart from the beaten path and you are most likely going to find some of your assumptions challenged.

One of the more frequent occurrences is the silent fragmentation and re-assembly of packets and/or the attempts to transmit packets that exceed the MTU. These are all but guaranteed to lead to surprising outcomes and much headscratching.

A name like send_but_make_sure_you_read_the_documentation() would have probably been more appropriate but it's a bit unwieldy, and in the default case it is precisely the silent activation of various algorithms to fix common problems that allows you to get away with calling it 'send()' in the first place.


Probably because it’s the default in most other scenarios you’d call send (other languages etc).

So having a rare inverted default is bad for intuition.


I took the implication here to be that the kinds of people who don't read documentation don't know what Nagle is or have any expectations about it to begin with.


I. Don't at all understand this comment. You don't own this subthread? I don't really even really recognize the boundaries of subthreads from the larger thread, at least not in the way you're suggesting? The article is about surprising consequences of this decision. This being a "good default" is very much a subject of contention in this discussion.

> (Again, being standard advice doesn't make it not debatable!)

This seems to accept my premise that it's what's in dispute?


I don't claim to own the thread, but since you've jumped in to respond on behalf of the other person I responded to, I'm going to to tell you again that what you want to talk about here isn't what I'm here to talk about. There are plenty of other subthreads here talking about whether disabling Nagle by default is a good thing or not; maybe join one of them.


I'm not responding on anyone's behalf. I think your attitude here is really weird. You are in fact asserting that you are the arbiter of what can be discussed in this subthread. If you don't want to discuss what I'm discussing - just don't respond? Telling me to go away is so strangely aggressive, I'm baffled.

I'm not going to respond any further because this seems very unproductive.


Isn't that just how threads (and debates in general) work?


I even tried emailing tptacek to try and be part of the change he said he wanted to see. Crickets.

HN folk can be a bit hypersensitive and / or opaque at times. Text medium is not always ideal as it provides no signals for tone, and our brains backfill this information in a biased manner.


You emailed me less than an hour ago (I found out about it here, just now) and then tried to dunk on me for not replying. I think we can save ourselves some time and disengage.


Respond to my skywriting on the subject of error-handling, you coward! That plane was expensive and my opinions are important!


If I'd seen the skywriting before you complained, I would have! I like getting email--- err, skywriting messages! But I only check the, uh, sky a couple times a day!


That's reasonable. My bad.


Delayed ACK seems like the better default to me, whether it is telnet or web servers, network programming is almost always request response. Delaying the ACK so that part of that response is ready seems like the correct choice. In today's network programming how often is tinygram really an issue?

In this case I would consider the bug to be git lfs. Even if Nagle's was enabled I would still consider it a bug, because of the needless syscall overhead of doing 50 byte writes.


Actually if you're sending a file or something, do you really need Nagle's algorithm? It seems like the real mistake might be not using a large enough buffer for writing to the socket, but I could be speaking out my ass.

There's actually a lot of prevailing wisdom that suggests disabling Nagle's algorithm is (often) a good idea. While the problem with latency is caused by delayed ACKs, the sender can't do anything about that, because it's the receiver side that controls this.

Not saying that it's good the standard library defaults this necessarily... But this post paints the decision in an oddly uncharitable light. That said, I can't find the original thread where this was discussed, if there ever was one, so I have no idea why they chose to do this, and perhaps it shouldn't be this way by default.


It's often a good idea when the application has its own buffering, as is common in many languages and web frameworks which implement some sort of 'reader' interface which can alternate symbols of "chunks" and "flushes" or only emit entire payloads (a single chunk). With scatter-gather support for IO, it's generally OK for the application to produce small chunks followed by a flush. Those application layer frameworks want Nagle's algorithm turned off at the TCP layer to avoid double-buffering and incurring extra latency.

Go however is disabling Nagle's by default as opposed to letting it be a framework level decision.


This is a great point. Why is Git LFS uploading a large file in 50 byte chunks?


Ideally large files would upload in MTU sized packets, which Nagle's algorithm will often give you, otherwise you may have a small amount of additional overhead at the boundary where the larger chunk may not be divisible into MTU sized packets.

Edit: I mostly work in embedded (systems that don't run git-lfs), perhaps my view is isn't sensible here.


Dividing packets into MTUs is the job of the tcp stack - or even the driver or NIC in the case of offloads. Userspace software shouldn’t deal with MTUs and always use buffer sizes that make sense for the application - eg 64kB or even more. Otherwise the stack wouldn’t be very efficient with every tiny piece of data causing a syscall and independent processing by the networking stack


Right; it sounds to me like the real bug is that git-lfs isn't buffering writes to the network driver. Correct me if I'm wrong but if git-lfs was buffering its writes (or using sendfile) then Nagle's algorithm wouldn't matter.


It matters less often - it can still matter at the end of each write buffer. Larger write-buffers remove a lot of chances for this to happen.

If the application can buffer the entire file or use sendfile, probably best to disable Nagle's algorithm so the last packet goes out immediately. Nginx does this.

Another option is turning off Nagle's algorithm at the end of each transfer, and on at the start of the next, but this causes extra syscalls.


I do not know Go. But what if there are so many high level abstractions in the Go language that it operates on streams directly?


The standard convention is to slap bufio.Reader/bufio.Writer on streams to make them more performant.

Though how LFS ends up with ~50 byte chunks is probably something very, very, dumb in the LFS code itself. Better to fix that mistake than to paper over it.


bufio is for adding buffering regardless of source/dest. Better in this case is ReaderFrom (which will also be used transparently by io.Copy) to let the socket control the buffering and apply even more optimizations. For something like git-lfs I could expect sendfile to provide a huge improvement, depending on the underlying storage.


The footnote has a brief note about delayed ACKs but it's not like the creator of the socket can control whether the remote is delaying ACKs or not. If ACKs are delayed from the remote, you're eating the bad Nagle's latency.

The TCP_NODELAY behavior is settable and documented here [1]. It might be better to more prominently display this behavior, but it is there. Not sure what's up with the hyperbolic title or what's so interesting about this article. Bulk file transfers are far from the most common use of a socket and most such implementations use application-level buffering.

[1]: https://pkg.go.dev/net#TCPConn.SetNoDelay


The title is hyperbolic because a real person got frustrated and wrote about it, the article is interesting because a real person got frustrated at something many of us can imagine encountering but not so many successfully dig into and understand.

“Mad at slow, discovers why slow” is a timeless tale right up there with “weird noise at night, discovers it was a fan all along”, I think it’s just human nature to appreciate it.


> There's actually a lot of prevailing wisdom that suggests disabling Nagle's algorithm is (often) a good idea.

Because even in mediocre networks it is a good idea.

Don’t write a small amount of data if you want (or in this case even need) to send a large amount of data!


Some prior discussion about why turn on TCP_NODELAY: https://jvns.ca/blog/2015/11/21/why-you-should-understand-a-...

John Nagle's comments about it: https://news.ycombinator.com/item?id=10608356


IMO the real problem is that the socket API is insufficient, and the Nagle algorithm is a kludge around that.

When sending data, there are multiple logical choices:

1. This is part of a stream of data but more is coming soon (once it gets computed, once there is buffer space, or simply once the sender loops again).

2. This is the end of a logical part of the stream, and no more is coming right now.

3. This is latency-sensitive.

For case 1, there is no point in sending a partially full segment. Nagle may send a partial segment, which is silly. For case 2, Nagle is probably reasonable, but may be too conservative. For case 3, Nagle is wrong.

But the socket API is what it is, no one seems to want to fix this, and we’re stuck with a lousy situation.


I'm pretty convinced that every foundational OS abstraction that we use today, most of which were invented in the 70's or 80's, is wrong for modern computing environments. It just sucks less for some people than for other people.

I do think Golang's choice of defaulting to TCP_NODELAY is probably right - they expect you to have some understanding that you should probably send large packets if you want to send a lot of stuff, and you likely do not want packets being Nagled if you have 20 bytes you want to send now. TCP_QUICKACK also seems wrong in a world with data caps - the unnecessary ACKs are going to add up.

Issuing a SEND syscall every 50 bytes is also horrendously CPU-inefficient, and certainly should be expected to trigger pathological cases.

At this point, the OS is basically expected to guess what you actually want to do from how you incant around their bad abstractions, so it's not surprising that sending megabytes of data 50 bytes at a time would trigger some weird slowdowns.


> Issuing a SEND syscall every 50 bytes is also horrendously CPU-inefficient

This is the real crime here. The fact that it maxed out at 2.5Mb/s might be quite literally due to CPU limit.

If you are streaming a large amount of data, you should use a user space buffer anyway, especially if you have small chunks. In Golang, buffers are standard practice and a one-liner to add.


*pedantry warning*

In practice, buffers are more than a one-liner, as you probably want to deal with flushing them at some out-of-band moment (+1 line) as well as handle the error from that (+3 lines).


> Issuing a SEND syscall every 50 bytes is also horrendously CPU-inefficient

io_uring is supposed to help with that


This seems like it should be very simple to fix without having to do much to the API. Just implement a flush() function for TCP sockets that tells the stack to kick the current buffer out to the wire immediately. It seems so obvious that I think I must be missing something. Why didn't this appear in the 80s?


It’s not portable but Linux has a TCP_CORK socket option that does this.


Here's how to emulate TCP_CORK using TCP_NODELAY, from [0]:

- unset the TCP_NODELAY flag on the socket

- Call send() zero or more times to add your outgoing data into the Nagle-queue

- set the TCP_NODELAY flag on the socket

- call send() with the number-of-bytes argument set to zero, to force an immediate send of the Nagle-queued data

[0] https://stackoverflow.com/a/22118709


Wow, that's not awkward at all.


It's a downside of the "everything is a file" mindset. As all abstractions are, it's leaky.

Nagle's algorithm is elegant because it allows poorly written applications to saturate a PHY.

Disabling it requires the application layer to implement its own buffer.

If I had a time machine and access to the early *nixes, I'd extend Nagle's algorithm and the kernel to treat fsync() as a signal to flush immediately.


> But the socket API is what it is, no one seems to want to fix this, and we’re stuck with a lousy situation.

Linux/FreeBSD/... have had the TCP corking API for what, 20 years?


IMO MSG_MORE is a substantially better interface. Sadly it seems to be rarely used.


My colleague added MSG_MORE support throughout libnbd[1]. It proved quite an elegant way to solve a common problem: You want to assemble a message in whatever protocol you're using, but it's probably being assembled across many functions (or in the case of libnbd, states in a complicated state machine), and using expanding buffers or whatever is a pain. So instead we let the kernel assemble it, or allow the kernel to make the decision to group the data or send it. The down side is multiple socket calls, but combining it with io_uring is a possibility to avoid this.

[1] https://gitlab.com/search?search=MSG_MORE&nav_source=navbar&...


Oh that is truly elegant, I didn't know about that.

Basically you set the MSG_MORE flag when you call `send` if you know you will have more data to send very soon, so the kernel is free to wait to form an optimally-sized packet instead of sending many small packets every time you run that syscall.


Latency can be affected by both CPU load and network congestion, so it's possible that Nagle's algorithm can help in Case 3. It's really trial and error to see what works best in practice.


The article is way too opinionated about „golang is doing it wrong“ for a decision that neither has a right or wrong.

Nagle can make sense for some applications, but also has drawbacks for others - as countless articles about the interaction with delayed acks and 40ms pauses (which are pretty huge in the days of modern internet) describe.

If one uses application side buffering and syscalls which transmit all available data at once, enabling NODELAY seems like a valid choice. And that pattern is the one that is used by GOs http libraries, all TLS libraries (you want to encrypt a 16kB record anyway), and probably most other applications using TCP. It’s are seeing anything doing direct syscalls with tiny payloads.

The main question should be why LFS has this behavior - which also isn’t great from an efficiency standpoint. But that question is best discussed in a bug report, and not a blog post of this format.


I prefer reliability over latency, always. The world won’t fall apart in 200ms, let alone 40ms. If you’re doing something where latency does matter (like stocks) then you probably shouldn’t be using TCP, honestly (avoid the handshake!)

When it comes to code, readability and maintainability are more important. If your code is reading chunks of a file then sending it to a packet, you won’t know the MTU or changes to the MTU along the path. Send your chunk and let Nagle optimize it.

Further, principle of least surprise always applies. The OS default is for Nagle to be enabled. For a language to choose a different default (without providing a reason), and one that actively is harmful in poor network conditions at that, was truly surprising.


TCP is always reliable, the choice of this algorithm will never impact this - it will only impact performance (bandwidth/latency) and efficiency.

Enabling nagle by default will lead to elevated latencies with some protocols that don't require the peer to send a response (and thereby a piggybacked ACK) after each packet. Even a "modern" TLS1.3 0RTT handshake might fall into that category. This is a performance degradation.

The scenario that is described in the blog post where too many small packets due to nothing aggregating them causing elevated packet loss is a different performance degradation, and nothing else.

Both of those can be fixed - the former only by enabling TCP_NODELAY (since the client won't have control over servers), the second by either keeping TCP_NODELAY disabled *or* by aggregating data in userspace (e.g. using a BufferedWriter - which a lot of TLS stacks might integrate by default).

> The world won’t fall apart in 200ms, let alone 40ms.

You might be underestimating the the latency sensitivity of the modern internet. Websites are using CDNs to get to a typical latency in the 20ms range. If this suddenly increases to 40ms, the internet experience of a lot of people might get twice as bad as it is at the moment. 200ms might directly push the average latency into what is currently the P99.9 percentile.

And it would get even worse for intra datacenter use-cases, where the average is in the 1ms range - and where accumulated latencies would still end up being user-experiencable (the latency of any RPC call is the accumulated latency of upstream calls).

> If your code is reading chunks of a file then sending it to a packet, you won’t know the MTU or changes to the MTU along the path

Sure - you don't have to. As mentioned, you would just read into an intermediate application buffer of a reasonable size (definitely bigger than 16kB or 10 MTUs) and let the OS deal with it. A loop along `n = read(socket, buffer); write(socket, buffer[0..n])` will not run into the described issue if the buffer is reasonably sized and will be a lot more CPU efficient than doing tiny syscalls and expecting all aggregation to happen in TCP send buffers.


Much of the world is doing ok with TCP and TLS but with session resumption and long lived connections. Many links will be marked bad in 200 ms and retries or new links issues. Imagine you are doing 20k / second / CPU. That is four thousand backed up calls for no reason, just randomness.


> I prefer reliability over latency, always.

I imagine all the engineers who serve millions/billions of requests per second disagree with adding 200ms to each request, especially since their datacenter networks are reliable.

> Send your chunk and let Nagle optimize it.

Or you could buffer yourself and save dozens/hundreds of expensive syscalls. If adding buffering makes your code unreadable, your code has bigger maintainability problems.


I’ve done quite a bit of testing on my shitty network (plus a test bench using Docker and plumba) in the last 24 hours — I’m not finished so take the rest of this with a grain of salt. There will be a blog post about this in the near future… once I finish the analysis.

Random connection resets are much more likely when disabling Nagle’s algorithm. As in 2-4x more likely, especially with larger payloads. Most devs just see “latency bad” without considering the other benefits of Nagle: you won’t send a packet until you receive an ACK or the packet is full. On poor networks, you always see terrible latency (even with Nagle’s disabled, 200-500ms is the norm) and with Nagle’s the throughput is a bit higher than without, even with proper buffering on the application side.


> And that pattern is the one that is used by GOs http libraries

I don't think that is correct. In https://news.ycombinator.com/item?id=34213383, I notice that Go's HTTP/2 library would write the HEADERS frame, the DATA frame, and the terminal HEADERS frame in 3 different syscalls. In a sample application using the Go's HTTP/2 library, a gRPC response without Nagle's algorithm would transmit 497 bytes over 6 packets, while a gRPC response with Nagle's algorithm would transmit 275 bytes over 2 packets.

With a starting point where both Nagle's algorithm and delayed ack are enabled, I guess this is the order of preference:

1. delayed ack disabled, applications do the right thing by buffering accordingly - ideal performance, but it is difficult to disable delayed ack, and it may require a lot of works to fix the applications.

2a. Nagle's algorithm disabled, applications do the right thing by buffering accordingly - almost ideal performance (may perform worse than #1 over bad connection), but it may require a lot of works to fix the applications.

2b. delayed ack disabled, real world applications - almost ideal performance (may have higher syscall overhead than #1), but it is difficult to disable delayed ack.

3. Nagle's algorithm disabled, real world application - not ideal as some applications can suffer from high packet overhead, e.g. git-lfs, and this is where we are at with Go.

4. baseline - far from ideal as many applications can suffer from high latency due to bad interaction between Nagle's algorithm and delayed ack.

I would say Go has made the right trade-off, albeit with a slight hint of "we know better than you". Going forward, it is probably cheaper for linux kernel to come up with a better API to disable delayed ack (i.e. to achieve #2b), than getting the affected applications to do the right thing by buffering accordingly (i.e. to achieve #1 or #2a). We will see how soon https://github.com/git-lfs/git-lfs/issues/5242 can be resolved.

In the mean time, #2b can actually be achieved with a "SRE approach" by patching the kernel to remove delayed ack and patching the Go library to remove the `setNoDelay` call. Something for OP to try?


I just learnt about "ip route change ROUTE quickack 1" from https://news.ycombinator.com/item?id=10662061, so we don't even need to patch the kernel. This makes 2b a really attractive option.


I'm using Go's default HTTP client to make a few requests per second. I set a context timeout of a few seconds for each request. There are random 16 minute intervals where I only get the error `context deadline exceeded`.

From what I found, Go's default client uses HTTP/2 by default. When a TCP connection stops working, it relies on the OS to decide when to time out the connection. Over HTTP/1.1, it closes the connection itself [1] on timeout and makes a new connection.

In Linux, I guess the timeout for a TCP connection depends on `tcp_retries2` which defaults to 15 and corresponds to a time of ~15m40s [2].

This can be simulated by making a client and some requests and then blocking traffic with an `iptables` rule [3]. My solution for now is to use a client that only uses HTTP/1.1.

[1] https://github.com/golang/go/issues/36026#issuecomment-56902...

[2] https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/

[3] https://github.com/golang/go/issues/30702


You can configure the HTTP/2 client to use a timeout + heartbeat.

https://go.googlesource.com/net/+/master/http2/transport.go


That's a big file. Mind pointing to a specific line number?


https://go.googlesource.com/net/+/master/http2/transport.go#...

Looks like it got cut off when I originally pasted it.


that sounds like there is pooling going on and not invalidating the pooled connection when a timeout happens. I've actually seen a lot of libraries in other languages do a similar thing (my experience is some of the elixir libraries don't have good pool invalidation for http connections). having a default invalidation policy that handles all situations is a bit difficult but I think a default policy that invalidates on any timeout is much better than a default policy that never invalidates on a timeout. as long as invalidation means just evicting it from the pool and not tearing down other channels on the HTTP/2 connection. for example you could have a timeout on a HTTP/2 connection that is just on an individual channel but there is still data flowing through the other channels.


Wow. Can you easily change the tcp connection timeout?


You can. It’s trivial once you know it’s possible. Not sure why it’s not set by default. https://go.googlesource.com/net/+/master/http2/transport.go


To be clear, this is for http/2, not tcp. You can very easily set read and write deadlines on tcp conns, but you can’t detect if a peer has disappeared without data. You can set keepalive but it’s not reliable and varies wildly between OSs.

You need a heartbeat or ping message together with an advancing deadline to detect dead peers reliably.


HTTP/2 supports a heartbeat in the protocol using PING frames. But I guess a lot of clients probably don’t support it or use it by default.


As a counter-argument, I've ran into serious issues that were caused by TCP delay being enabled by default, so I ended up disabling it. I actually think having it disabled by default is the right choice, assuming you have the control to re-enable it if you need to.

Also, in my opinion, if you want to buffer your writes, then buffer them in the application layer. Don't rely on the kernel to do it for you.


The kernel has to buffer everything you send in a sliding window, to retry missed acks. Userspace buffering only reduces syscalls.

A lot of people with strong preferences about segment boundaries and timing are arguing with TCP and probably shouldn’t be using it.


> Userspace buffering only reduces syscalls.

"only". The kernel also buffers disk writes, but god help you if you're writing files to disk byte by byte.


I talked a bit about that in the post. Use your own buffers if possible, but there are times you can’t do that reliably (proxies come to mind) where you’d have to basically implement an application specific Nagles algorithm. If you find yourself writing something similar, it’s probably better to let the kernel do it and keep your code simpler to reason about.


If you are writing a serious proxy you should be working at either a much lower level (eg splice) or a much higher level (ReadFrom, Copy). If you’re messing around with TCPConn parameters and manual buffer sizes you’ve already lost.


Not just network proxies, but possibly proxying/transforming a device i/o (like usb-over-ethernet).


Goalposts are receding, but this is exactly the higher level I mentioned. Use io.Copy, and if you need any kind of transforms implement them as Readers.


I haven't thought about this hard, but, would a proxy not serve it's clients best by being as transparent as possible, meaning to forward packets whenever it receives them from either side? I think this would imply setting no_delay on all proxies by default. If either side of the connection has a delay, then the delay will be honored because the proxy will receive packets later than it would otherwise.


IFF you are LAN->LAN or even DC->DC, NoDelay is usually better nowadays. If you are having to retransmit at that level you have far larger problems somewhere else.

If you're buffering at the abstracted transport level, Same.


Because you're supposed to have buffering at a different layer.


Networking is the place where I notice how tall modern stacks are getting the most.

Debugging networking issues inside of Kubernetes feels like searching for a needle in a haystack. There are so, so many layers of proxies, sidecars, ingresses, hostnames, internal DNS resolvers, TLS re/encryption points, and protocols that tracking down issues can feel almost impossible.

Even figuring out issues with local WiFi can be incredibly difficult. There are so many failure modes and many of them are opaque or very difficult to diagnose. The author here resorted to WireShark to figure out that 50% of their packets were re-transmissions.

I wonder how many of these things are just inherent complexity that comes with different computers talking to each other and how many are just side effects of the way that networking/the internet developed over time.


Kubernetes has no inherent or required proxies or sidecars or ingresses, or TLS re-encryption points.

Those are added by “application architects”, or “security architects” and existed long before Kubernetes, for the same debatable reasons: they read about it in a book or article and thought it was a neat idea to solve a problem. Unfortunately, they may not understand the tradeoffs deeply, and may have created more problems than were solved.


There's been a highly annoying kubectl port-forward heisenbug open for several years which smells an awful lot like one of these dark Go network layer corners. You get a good connection establish and some data flows, but at some random point it decides to drop. It's not annoying enough for any wizards to fix. I immediately thought of this bug when Nagle in Go came up here.

https://github.com/kubernetes/kubernetes/issues/74551


Wireshark exists since forever.


Wireshark doesn't tell you anything about what's wrong with your code. It just tells you "yup, the code is doing something wrong!"

Figuring that out in Kubernetes ... yeah, good luck with that.


And that or tcpdump should be the first thing you grab to diagnose a network issue.


Tcpdump to dump, yes, but wireshark is better to visualize.


Go was explicitly designed for writing servers. This means two things are normally true:

- latency matters, for delivering a response to a client

- the network is probably a relatively good datacenter network (high bandwidth, low packet loss/retransmission)

Between these things, I think the default is reasonable, even if not what most would choose. As long as it’s documented.

The fact that other languages have other defaults, or the fact that people use Go for all sorts of other things like system software, doesn’t invalidate the decision made by the designers.


> the network is probably a relatively good datacenter network (high bandwidth, low packet loss/retransmission)

The first lesson I learned about Distributed Systems Engineering is the network is never reliable. A system (or language) designed with the assumption the network is reliable will tank.

But I also I don’t agree that Go was written with that assumption. Google has plenty of experience in distributed systems, and their networks are just as fundamentally unreliable as any


“Relatively” may have needed some emphasis here, but in general, networking done by mostly the same boxes operated by the same people, in the same climate controlled building, are going to be far more reliable than home networks, ISPs running across countries, regional phone networks, etc.

Obviously nothing is perfect, but applications deploying in data centres should probably make the trade offs that give better performance on “perfect” networks, at the cost of poorer performance on bad networks. Those deploying on mobile devices or in home networks may better suit the opposite trade offs.


> The first lesson I learned about Distributed Systems Engineering is the network is never reliable

Yep, and it's a good rule. It's the one Google applies across datacenters.

... but within a datacenter (i.e. where most Go servers are speaking to each other, and speaking to the world-accessible endpoint routers, which are not written in Go), the fabric is assumed to be very clean. If the fabric is not clean, that's a hardware problem that SRE or HwOps needs to address; it's not generally something addressed by individual servers.

(In other words, were the kind of unreliability the article author describes here on their router to occur inside a Google datacenter, it might be detected by the instrumentation on the service made of Go servers, but the solution would be "If it's SRE-supported, SRE either redistributes load or files a ticket to have someone in the datacenter track down the offending faulty switch and smash it with a hammer.")


Relatively reliable. Not "shitty". If you've got a datacenter network that can be described as "shitty", fix your network rather than blaming Go.


This is an embarrassing response. The second lesson you should’ve learned as a systems engineer, long before any distributed stuff, is “turn off Nagle’s algorithm.” (The first being “it’s always DNS”.)

When the network is unreliable larger TCP packets ain’t gonna fix it.


Usually you have control over one of them only. If you run the whole network, sure, fix that instead. But if you don't, sending fewer larger packets can actually improve the situation even if it doesn't fix it.


Fewer packets yes, but I've been on several networks where sending large packets ends up with bad reordering and dropping behavior.


But it will at least let it get out of slow-start.


It's strange you're getting hammered for this. Everyone in 6.824 would probably agree with you. https://pdos.csail.mit.edu/6.824/

Let's weigh the engineering tradeoffs. If someone is using Go for high-performance networking, does the gain from enabling NDELAY by default outweigh the pain caused by end users?

Defaults matter; doubly so for a popular language like Go.


I have worked on networked projects ranging from modern datacenters to ca. 2005 consumer-grade ADSL in Ohio to cellular networks in rural South Asia.

There are situations where you want Nagle's algorithm on; when you have stable connections but noisy transmission, streams of data with no ability to buffer, and no application-level latency requirements. There are not many such situations. It is not any of these, and it's certainly not within any datacenter.


Nagle's algorithm also really screws with distributed systems - you are going to be sending quite a few packets with time bounds, and you REALLY don't want them getting Nagled.

In fact, Nagle's algorithm is a big part of why a lot of programmers writing distributed systems think that datacenter networks are unreliable.


I don't think this is correct. 6.824 emphasizes reliability over latency. They mention it in several places: https://pdos.csail.mit.edu/6.824/labs/guidance.html

> It should be noted that tweaking timeouts rarely fixes bugs, and that doing so should be a last resort. We frequently see students willing to keep making arbitrary tweaks to their code (especially timeouts) rather than following a careful debugging process. Doing this is a great way to obscure underlying bugs by masking them instead of fixing them; they will often still show up in rare cases, even if they appear fixed in the common case.

> In particular, in Raft, there are wide ranges of timeouts that will let your code work. While you CAN pick bad timeout values, it won't take much time to find timeouts that are functional.

Their unit tests are quite worthwhile to read, if only to absorb how many ways latency assumptions can bite you.

It's true that in the normal case, it's good to have low latency. But correctly engineered distributed systems won't reorganize themselves due to a ~200ms delay.

To put it another way, if a random 200ms fluctuation causes service disruptions, your system probably wasn't going to work very well to begin with. Blaming it on Nagle's algorithm is a punt.


In my decades of experience in telco, capital markers, and core banking, unexplained latency spikes of hundreds of ms are usually analyzed to death as they can have ripple effects. I’ve had 36 hour severity 1 incidents with multiple VPs taking notes on 24/7 conference calls when a distributed system starts showing latency spikes in the 400ms range.

No, the system isn’t going haywire, but 200-400ms is concerning inside a datacenter for core apps.

But let’s forget IT apps, let’s talk about the network. In a network 200ms is catastrophic.

Presumably you know BGP is the very popular distributed system that converges Internet routes?

Inside a datacenter the Bidirectional Forwarding Protocol (BFD) is used to drop BGP convergence times to be sub-second if you’re using it as an IGP. BFD is also useful with other protocols but anyway. It has heartbeats of 100-300ms. If there’s a fluctuation of the network 3x that interval, it will drop the link and trigger a round of convergence. This is essential in core networks or telco 4G/5G transport networks.

Of course, flapping can be the consequence of setting too low an interval. Tradeoffs.

Back to the original point, I’ve contributed to the code of equity and bond trading apps, telco apps, core banking systems. And cloud/Kubernetes systems. All RPC distributed systems. Every. Single. One. That performed well… For 30 years! Has enabled TCP_NODELAY. Except when serving up large amounts of streaming data. And the reason fundamentally is that most of the time you have less control over client settings (delayed TCP acks), so it’s easier to control the server.


That is all well and good in an academic setting. Many distributed systems in the real world like having time bounds under 200 ms for certain things like Paxos consensus within a datacenter. It turns out that latency, at some level, is equivalent to reliability, and 200 milliseconds is almost always well beyond that level.


I’m not sure what else to say than “this isn’t true.” 6.824’s labs have been paxos-based for at least the better part of a decade, and at no point did they emphasize latency as a key factor in reliability of distributed systems. If anything, it’s the opposite.

Dismissing rtm as “academic” seems like a bad bet. He’s rarely mistaken. If something were so fundamental to real-world performance, it certainly wouldn’t be missing from his course.


I'll be sure to tell my former colleagues (who build distributed storage systems at Google) that they are wrong about network latency being an important factor in the reliability of their distributed systems because an MIT course said so.

I'm not insinuating that your professor doesn't know the whole picture - I'm sure he does research in the area, which would mean that he is very familiar with the properties of datacenter networks, and he likely does research into how to make distributed systems fast. I'm suggesting that he may not be telling it to you because it would complicate his course beyond the point where it is useful for your learning.


Tell you what. If you ask your colleague “Do you feel that a 100ms delay will cause our distributed storage system to become less reliable?” and they answer yes, I’ll venmo you $200. If you increase it to 200ms and they say yes, I’ll venmo you $100. No strings attached, and I’ll take you at your word. But you have to actually ask them, and the phrasing should be as close as possible.

If we were talking >1s delays, I might agree. But from what I know about distributed systems, it seems $200-unlikely that a Googler whose primary role is distributed systems would claim such a thing.

The other possibility is that we’re talking past each other, so maybe framing it as a bet will highlight any diffs.

Note that the emphasis here is “reliability,” not performance. That’s why it’s worth it to me to learn a $200 lesson if I’m mistaken. I would certainly agree as a former gamedev that a 100ms delay degrades performance.


RAFT can easily be tuned to handle any amount of latency. It even discusses this in the paper. The issue is “how long are you willing to be in a leaderless state” and for some applications, it’s very tight. If your application needs a leader to make a distributed decision, but that is currently unavailable, the application might not know how to handle that or it might block until one becomes available causing throughput issues.

However, you shouldn’t be using TCP for latency-sensitive applications IMHO. Firstly, TCP requires a handshake on any new connection. This takes time. Secondly, if 3 packets are sent and the first one is lost, you won’t get those last packets for a couple hundred ms anyway (default retransmission times). So you’re better off using something like UDP. So, if you need the properties of TCP, you aren’t doing latency-sensitive anything.


TCP within datacenters is tuned to have shorter retransmit times, and tends to use long-lived connections.

See this RFC for more info on retransmit times: https://www.rfc-editor.org/rfc/rfc6298


I'll give you a few examples, and maybe I'll run a casual poll next time we get a beer. No venmo.

I will point out that leader election generally has very long timeouts (seconds), but a common theme here is that you do lots of things that are not leader election but have short timeouts which can cause systems to reconfigure because the system generally wants to run in a useful configuration, not just a safe configuration.

In a modern datacenter, 100 milliseconds is ample time to determine whether a server is "available" and retry on a different server - servers and network switches can get slow, and when they get THAT slow, something is clearly wrong, so it is better to drain them and move that data somewhere else. When the control plane hears about these timeout failures from clients, it dutifully assumes that a server is offline and drains it.

Usually, this works well: The machine to machine latency within a datacenter has way less than 100 microseconds, and if you include the OS stack under heavy load, it might get all the way to 1 millisecond. Something almost always is wrong if a very simple server can't respond within 10-100 milliseconds. This results in 10-100 millisecond response times meaning "not available" at the lower layers of the stack. As I mentioned before, enough reports of "unavailable" results in a machine being drained, and a critical number of these results in an outage.

Attack of the killer microseconds is a good paper that addresses the issue here (albeit obliquely): https://dl.acm.org/doi/10.1145/3015146

Here are a couple of examples:

* There is a very important 10 ms timeout in Colossus (distributed filesystem) to determine the liveness of a disk server - I have seen one instance where enough of a cell broke this timeout due to a software change, and made the entire cell go read-only. In another instance, a small test cell went down due to this timeout under one experiment.

* Another cell went down under load due to a different 10 ms liveness timeout thanks to the misapplication of Nagle's algorithm (although not to networking) - I forget if it was a test cell or something customer-facing.

* Bigtable (NoSQL database) has a similar timeout under 100 ms (but greater than 10 ms) for its tablet servers. I'm sure Spanner (NewSQL database) has the same.


At 200 ms you start to assume the other end is dead and retry. You don’t get sub second consumer response times with random 200 ms delays on data center to data center calls.


The default last_server_contact timeout for Consul is 200ms. Can I have $200?


Maybe. The sticking point for me is that I’ve implemented enough distributed system protocols to know that even if a server occasionally drops out, the overall reliability of the service isn’t affected. I would be very curious to hear from someone in the field if they feel differently.

It’s easy to assume that a server dropout = less reliable network. But even if a leader election were happening every minute, it seems unlikely to drastically affect any ops in flight.

But sure, if they agree I’ll venmo you $200 too.


We operate a large Consul cluster (Consul is great, but we abuse the hell out of it). Frequent leader elections have been responsible for outages. Don't worry about the $200, I'm just fucking with you, but I don't think you're on very firm ground with this line of argument. It's fun to watch though, so I do hope you keep going with it. :)


Hmm. Thank you for the datapoint. It’s why I scaled the bet down to $100 for 200ms.

I think it’s worth uncovering whether a 100ms delay could result in an outage. If I were on call, it’d be hard to sleep knowing that was true.

The root claim is of course that disabling NDELAY can result in an outage. It still seems $200-unlikely that this could be true. Certainly it might cause performance problems, but the claim was reliability. Outages would put it firmly in the “unreliable” section of the Venn diagram.

My claim about 1min leader reelections is admittedly more suspicious. It’s surprising the reelections caused outages. But I suppose if there were a lot of long-running operations that needed a total order, frequent reelections would hose that.


In fairness, I don't know if we kept the default. I'm responding to two independent things at this point: first, there are definitely systems where 200ms delays have rippling impacts, and second, leader elections aren't always benign.

(Consul would, I'm sure, converge eventually regardless of the election frequency, but that doesn't mean everything that relies on Consul will tolerate those delays).

I don't have much of a take here, beyond that I don't think you can extrapolate as much from what's on the 6.824 pages as you might have done here. Certainly, in a system where 200ms is the difference between "healthy" and "not healthy" status on a peer relationship, I'd think you'd want Nagle disabled. But I haven't thought carefully about this, or looked that closely at the typical packet flow between Consul nodes. I could be wrong about all of this; more reason not to give me any money.

Later

Per the comment upthread, I haven't even bothered to check which parts of this packet flow are even TCP to begin with.


I've never directly used Consul's internals, but I'm guessing it uses Stubby, which is built on top of TCP.


It does Serf over UDP, but I get fuzzy on the integration of Serf and Consul.


Raft and the Consul RPC API use TCP, Serf uses both TCP and UDP.

While the Consul RCP API may have grown options to use GRPC (I forget now), Raft uses length-prefixed msgpack PDUs.


Whoops, I thought this was a Google product, given the discussion. Stubby is basically GRPC internal to Google.


it seems to me like systems like these are the exception rather than the rule. you can always turn off nagle's algorithm if you have something really latency-sensitive, but it should not be off by default.

200 ms is not the end of the world in most cases, it's far better than relying on everything doing its own buffering correctly and suffering a massive performance penalty when something inevitably doesn't.


I have to disagree 200 Ms is usually most of your latency budget in my experience. 200 ms delays randomly kill your p99 numbers and harm the customers. Most internet traffic is in the data center, not to the edge. And I assume fastly and Akamai and cloud flare are all aware of how to tune to slow last miles.


It was originally designed as a C/C++ replacement, not necessarily for servers. If I remember right the first major application it was used for was log processing (displacing Google’s in-house language Sawzall) rather than servers.


Everything developed at Google is intended for transforming protobufs. And how are you going to get some protobufs in the first place? /s


Nex step build a language on top of protobuff


Golang has burned me more than once with bizarre design decisions that break things in a user hostile way.

The last one we ran into was a change in Go 1.15 where servers that presented a TLS certificate with the hostname encoded into the CN field instead of the more appropriate SAN field always fail validation.

The behavior could be disabled however that functionality was removed in 1.18 with no way to opt back into the old behavior. I understand why SAN is the right way to do it but in this case I didn’t control the server.

Developers at Google probably never have to deal with 3rd parties with shitty infrastructure but a lot of us do.

Here’s a bug in rke that’s related https://github.com/rancher/rke2/issues/775


The x509 package has unfortunately burned me several times, this one included. It is too anal about non-fatal errors, that Google themselves forked it (and asn1) to improve usability.

https://github.com/google/certificate-transparency-go


Sorry for the late response but thank you so much much for showing me this


It also doesn’t play well with split tunnel VPN’s on macOS that are configured for particular DNS suffixes. If you have a VPN that is only active for connections in a particular domain, git-lfs (and I think any go software, by default) will try to use your non-VPN connection for connections that should be on the VPN.

I don’t know why it is, exactly… but I think it’s related to Golang intentionally avoiding using the system libc and implementing its own low-level TCP/IP functions, leading to it not using the system configuration which tells it which interface to use for which connections.

Edit: now that I think about it, I think the issue is with DNS… macOS can be configured such that some subdomains (like on a VPN) are resolved with different DNS servers than others, which helps isolate things so that you only use your VPN’s DNS server for connections that actually need it. Go’s DNS resolution ignores this configuration system and just uses the same server for all DNS resolution, hence the issue.


Go’s choice to default to its own TCP/IP implementation has bitten me personally to the level of requiring a machine restart.

The Go IPv6 DNS resolution on MacOS can cause all DNS requests on the system to begin to fail until a restart.

https://github.com/golang/go/issues/52839


Not to understate the impact of the bug, but this is not the default for Go. It is used if CGo is disabled, as the issue you linked to describes.


That is the default for Go if cross-compiled, however, and most software compiled on a CI server is cross-compiled to Darwin.

Fortunately Go 1.20 fixes this, using the system resolved even without CGo on Darwin platforms [1].

[1]: https://go-review.googlesource.com/c/go/+/446178


The OS network stack is crashing and this is Go's fault? Is Go holding the network stack wrong?


To be fair, "getaddrinfo is _the_ path" is a shitty situation.

- It's a synchronous interface. Things like getaddrinfo_a are barely better. It has forced people to do stuff like https://c-ares.org/ for ages, which has suffered from "is not _the_ path" issues for as long

- It's a less featured interface than, for example, https://wiki.freedesktop.org/www/Software/systemd/writing-re...


This explains why several of my Go programs needed the occasional restart because of terribly slow transfers over mobile networks.

These weird decisions that go against the norm are exactly why I hate writing Go. There are hidden footguns everywhere and the only way to prevent them is to role play as a Google dev backend dev in a hurry.


>This explains why several of my Go programs needed the occasional restart because of terribly slow transfers over mobile networks.

It doesn't explain that. Why would this cause you to need to restart your applications? At most it will just decrease performance of that transfer.


From my experience most TCP using projects existing for a longer time disable Nagle's Algorithm sooner or later, we did so at Proxmox VE in 2013:

https://git.proxmox.com/?p=pve-manager.git;a=commitdiff;h=fd...

Most of the time it just makes things worse nowadays, so yes, having it disabled by default makes IMO sense.


Meanwhile almost every project I work on is latency sensitivity and I’ve lost track of how many times the fix to bad performance was “disable Nagles algorithm”.

Honestly the correct solution here is probably “there is no default value, the user must explicitly specify on or off”. Some things just warrant a coder to explicitly think about it upfront.


It’s delayed ack on the client side which adds that slowdown. The spec allows the client to wait up to 500 ms to send it.


Delayed ACKs send an ACK every-other-packet. So you have to wait at least 200ms for the first ACK. So if you have enough data for two packets then you won’t even notice a delay (probably most data these days unless you have jumbo frames all the way to the client).

If you control the client, you can turn on quick ACKs and still use Nagle’s algorithm to batch packets.


The problem does not seem to be that TCP_NODELAY is on, but that the packets are sent carry only 50 bytes of payload. If you send a large file, then I would expect that you invoke send() with page-sized buffers. This should give the TCP stack enough opportunity to fill the packets with an reasonable amount of payload, even in the absence of Nagel's algorithm. Or am I missing something?


Even if the application is making 50 byte sends why aren't these getting coalesced once the socket's buffer is full? I understand that Nagle's algorithm will send the first couple packets "eagerly" but I would have expected that onced the transmit window is full they start getting coalesced since they are being buffered anyways.

Disabling Nagle's algorithm should be trading network usage for latency. But it shouldn't reduce throughput.


> Even if the application is making 50 byte sends why aren't these getting coalesced once the socket's buffer is full?

Because maybe the 50 bytes are latency sensitive and need to be at the recipient as soon as possible?

> I understand that Nagle's algorithm will send the first couple packets "eagerly" […] Disabling Nagle's algorithm should be trading network usage for latency

No, Nagle's algorithm will delay outgoing TCP packets in the hope that more data will be provided to the TCP connection, that can be shoved into the delayed packet.

The issue here is not Go's default setting of TCP_NODELAY. There is an use case for TCP_NODELAY. Just like there is a use case for disabling TCP_NODELAY, i.e., Nagle's algorithm (see RFC 869). So any discussion about the default behavior appears to be pointless.

Instead, I believe the application or a underlying library is to blame. Because I don’t see why applications performing a bulk transfer of data by using “small” (a few bytes) write is anything but a bad design. Not writing large (e.g., page-sized) chunks of data into the file descriptor of the socket, especially when you know that there multiple more of this chunks are to come, just kills performance on multiple levels.

If I understand the situation the blog post describes correctly, then git-lfs is sending a large (50 MiB?) file in 50 bytes chunks. I suspect this is because git-lfs (or something between git-lfs and the Linux socket, e.g., a library) issues writes to the socket with 50 bytes of data from the file.


> Because maybe the 50 bytes are latency sensitive and need to be at the recipient as soon as possible?

The difference in latency between a 50 byte and 1500 byte packet is miniscule. If you have the data available in the socket buffer I don't see why you wouldn't want to send it in a single packet.

The latency benefit of TCP_NODELAY should be that it isn't waiting for user space to write more data, not that it is sending short packets.


Modern programming does buffering on class level rather than system call level. Even if NAGLE solves the problem of sending lots of tiny packets, it doesn't solve the problem of making many inefficient system calls. Plus, best size of buffers and flash policy can only be determined by application logic. If I want smart lights to pulse in sync with music heard by a microphone, delaying to optimize network bandwidth makes no sense. So providing raw interface with well defined behavior by default and taking care of things like buffering in wrapper classes is the right thing to do.


> best size of buffers and flash policy can only be determined by application logic

That's not really true. The best result can be obtained by the OS, especially if you can use splice instead of explicit buffers. Or sendfile. There's way too much logic in this to expect each app to deal with this, or even things it doesn't really know about like current IO pressure, or the buffering and caching for a given attached device.

Then there are things you just can't know about. You know about your MTU for example, but won't be monitoring the changes for the given connection. The kernel knows how to scale the buffers appropriately already so it can do the flushes in a better way than the app. (If you're after throughout not latency)


> The kernel knows how to scale the buffers appropriately already so it can do the flushes in a better way than the app. (If you're after throughout not latency)

Well, how can the OS know if I'm after throughput or latency? It would be very wrong to simply assume that all or even most apps would prioritize throughput; at modern network speeds throughput often is sufficient and user experience is dominated by latency (both on consumer and server side), so as the parent post says, this policy can only be determined by application logic, since OS doesn't know about what this particular app needs with respect to throughput vs latency tradeoffs.


> how can the OS know if I'm after throughput or latency

Because you tell it by enabling / disabling buffering (Nagle).

And most apps do prefer throughput. Those that don't really know that they prefer latency.

> since OS doesn't know about what this particular app needs with respect to throughput vs latency tradeoffs.

I think you're mixing up determining what you want (app choice) with how to achieve that best (OS information). I was responding to the parent talking about flushing and buffer sizes specifically.


I kind of wonder if these applications are forced to do their own buffering because they have disabled Nagle's algorithm?

The old adage about people who attempt to attempt to avoid TCP end up reinventing TCP and re-learning the lessons from the 70s...


You missed the part about many inefficient system calls. You want buffering to happen before the thing that has a relatively high per-call overhead.


If you want smart lights to pulse in sync with your microphone you shouldn’t be using TCP in the first place, here UDP is a lot more suitable.

TCP is reconstructing the order, meaning a glitch of a single packet will propagate as delay for following packets, in worst case accumulate into a big congestion.


I talked a bit about that in the post. When you know the network is reliable, it’s a non-issue. When you need to send a few small packets, disable Nagles. When you need to send a bunch of tiny packets across an unknown network (aka the internet) use Nagles.


Those who want more fundamental background on the matter can check this excellent seminal paper by Van Jacobson and Michael Karels [1].

In one of the Computerphile's podcasts on the history of Internet congestion, it's claimed as the most influential paper about the Internet and apparently it has more than 9000 citations as of today [2].

Some trivia, based on this research work Van, together with Steve McCane also created the BPF, Berkeley Packet Filter while he's in Berkeley Uni. This is later adopted by the Linux community as eBPF, and the rest is history [3].

[1]Congestion Avoidance and Control:

https://ee.lbl.gov/papers/congavoid.pdf

[2]Internet Congestion Collapse - Computerphile:

https://youtu.be/edUN8OabWCQ

[3]Berkeley Packet Filter:

https://en.m.wikipedia.org/wiki/Berkeley_Packet_Filter


First URL has an extra `&l` on it, that 404s. Thanks for the links!


You just have to be very careful with the algorithms in that paper, they had some serious problems (apart from their basic inability to deal with faster links). I like this old but fairly damning analysis from an early Linux TCP developer:

https://ftp.gwdg.de/pub/linux/tux/net/ip-routing/README.rto


I ran into a similar phantom-traffic problem from Go ignoring the Linux default for TCP keepalives and sending them every 15 seconds, very wasteful for mobile devices. While I quite like the rest of Go, I don't see why they have to be so opinionated and ignore the OS in their network defaults.

My PR fixing that in Caddy: https://github.com/caddyserver/caddy/pull/4865


To be fair, the linux defaults of 2h are not working in most enterprise or cloud environments. One frequently encounter load balancers, firewalls and other proxies that drop connections after around 5-15 minutes. 15 seconds sounds very aggressive though.


The default of 2h is not just a Linux default; it's straight up from the RFC.

https://www.rfc-editor.org/rfc/rfc9293.html#name-tcp-keep-al...

> Keep-alive packets MUST only be sent when no sent data is outstanding, and no data or acknowledgment packets have been received for the connection within an interval (MUST-26). This interval MUST be configurable (MUST-27) and MUST default to no less than two hours (MUST-28).


Thanks for that PR!! We greatly appreciate it.


What has this to do with the Go language? Runtime defaults don't always work for every possible situation, particularly when the runtime provides much more over a kernel interface. Investigate performance issues and if some default doesn't work for you, you can always change it.


Principle of least surprise. Nagle’s is disabled in Go, except in Windows. The OS default is to have it enabled. I thought this was probably some weird accidental configuration in git-lfs. Then it turned into “aha, this is the source of all my problems on my shitty wifi”


It reminded me of the time when Rust ignored SIGPIPE (obviously a good choice for servers) but did it universally. That's of course also violating the principle of least surprise when interrupting a pipe suddenly causes Rust to spew some exceptions.


Sshfs sets the nodelay tcp flag to off by default precisely because it's designed to transfer files and not interactive traffic, that is single keystrokes in a terminal.

This thread from 2006 could be interesting. It's about the different performances of scp and sftp https://openssh-unix-dev.mindrot.narkive.com/proARDEN/sftp-p...

Meta: the negative in nodelay makes it hard to follow some comments sometimes because of double negatives. The general best practice is to refrain from using negatives in names. This might have been TCP_GROUP_PACKETS?


OP didn't link to the issue, so here it is:

https://github.com/caddyserver/caddy/issues/5276

also, OP didn't mention that its extremely easy to configure this, with Go itself:

https://godocs.io/net#TCPConn.SetNoDelay


> OP didn't mention that its extremely easy to configure this

Maybe not explicitly, but it was definitely mentioned:

> From there, I went into the git-lfs codebase. I didn’t see any calls to setNoDelay


That's not the same function...


In the article the author talks of WiFi interference.

Try using MAC filtering. In previous experiments it drastically improved through put.

I know the mac address can be spoofed, provides no security and can be a pain to set up when everything is WiFi enabled, but it really helps.

All those other WiFi gadgets that belong to your neighbours are continuously try to login, and being rejected, all the time!


While you are at it, probably downgrade those ARP broadcasts to unicasts. Your home Wi-Fi router probably already knows all the IP address MAC address mapping; so no need for devices to send those stupid ARP broadcasts to everything.


Ironically, I imagine one of the side effects of remote work will be that choices like this don't happen as much... because it's much less likely that all your in-house language devs will do all of their performance testing on your corporate WiFi, and at least some will use congested home networks and catch this sooner, or never write it at all.

There's no such thing as a perfect language for all situations - but given that Go was not designed to run solely on low-latency clusters, one wishes it had been further tested in other environments.


This is a bit of a hyperbolic title and post, but it does seem like a real issue that the Golang devs should address. Letting the socket do its thing seems like the right way to go, although I'm not an expert in networking.

Any ideas from the devs or other networking experts here in HN?


I suspect the current behaviour will have to stay as it is because the universe of stuff that could break as a result of changing it is completely unknowable


So it’s not hyperbolic, and actually describes things as they are?


Calling it "evil" is hyperbolic.


At this point, this problem has caused me dozens if not hundreds of hours of waiting on slow transfers. It is evil. Maybe I should disable all my neighbor’s routers or convince the local authorities to open more 5Ghz spectrum… but it is what it is.


I dunno.. I am not a networking expert by any stretch, but it does seem consistent with Golang's philosophy that devs should have a deep understanding of the various levels of the stack they're working in.

Though TFA does make a fair point that in reality this doesn't happen, and there is slow software abound as a result.


Disabling naggle by default is definitely the right decision. Git LFS does the wrong thing by sending out a file in 50 byte chunks. It should be sending MTU sized chunks.


EDIT: I originally linked to the wrong review. It's been there since the initial commit of networking: https://github.com/golang/go/commit/e8a02230f215efb075cccd41...


This is actually the review for adding back the ability to turn NODELAY on and off, it was actually in the networking code from the start https://github.com/golang/go/blob/e8a02230/src/lib/net/net.g...


Thanks! I noticed that right after I posted it. Unfortunately my non-procrastination setting kicked in and I couldn't delete it before anyone saw it.


Lol “back by popular demand”

At least it wasn’t for my initial thoughts when seeing PRs around that code “to speed up unit tests”. I’d love to see the discussions though.


Is this a problem in Go itself? Isn’t this something the Git-lfs should be changing in only lfs?

It seems reasonable to prefer a short delay by default, but when you are sending multi-magabyte files (lfs’s entire use case) it seems like it would be better to make the connection more reliable (e.g. nobody cares about 200ms extra delay).


git-lfs authors agree and point out regular git also disables Nagle.

https://github.com/git-lfs/git-lfs/issues/5242


> Once that was fixed, I saw 600MB per second on my internal network and outside throughput was about the same as wired.

Is the author talking about megabits or really megabytes? 112MB/s is the fastest real speed you will get on a gigabit network. I feel like the author meant to write Mbit instead of MB/s everywhere?


Good find. Yeah 800Mbits.



It's been in the code base from the start: https://github.com/golang/go/blob/e8a02230/src/lib/net/net.g...


Relatedly there was a previous HN post and discussion about Delayed ACKs and TCP_NODELAY where John Nagle himself chimed in:

https://news.ycombinator.com/item?id=10608356


Thanks for this.

I've been troubleshooting a nasty issue with RTSP streams and while I'm fairly confident golang is not responsible, this has highlighted a potential root cause for the behaviour we've been seeing (out of order packets, delayed acks).


I have an email from Nagle himself, c. 1997 telling me that it was probably a bad idea.

And I've disabled it in every server I've written since.


You can just ask him here; he's the 12th busiest user on HN ('Animats, the name of his ragdoll physics engine).


He's even here on an adjacent thread!


Buygenmeds is the world's largest and most reputable generic pharmaceutical drugstore. Without sacrificing quality, we offer the best-value prescription Ed, skincare products, men's healthcare, and women's reproductive Sildalist 120 and Tadalista health medicine. We only sell medications that have been obtained from FDA-approved, reputable sources. Buygenmeds, a global pharmacy, has worked hard to achieve excellence in all aspects of health care.

URL: https://www.buygenmeds.com/product/sildalist-120-mg/


I'm not following.

Let's say the socket is set to TCP_NODELAY, and the transfer starts at 50 KiB/s. After a couple seconds, shouldn't the application have easily outpaced the network, and buffered enough data in the kernel such that the socket's send buffer is full, and subsequent packets are able to be full? What causes the small packets to persist?


This is the question I had from the start and I'm surprised that I had to scroll this far down.

Nagle's algorithm is about what do to when the send buffer isn't full. It is supposed to improve network efficiency in exchange for some latency. Why is it affecting throughput?

Is Linux remembering the size of the send calls in the out buffer and for some reason insisting on sending packets of those sizes still? I can't imagine why it would do that. If anything it sounds like a kernel bug to me.

For large transfers it still likely makes sense to always send full packets (until the end) like TCP_CORK but it seems that it should be unnecessary in most cases.


Because of this post I looked up how I disable Nagle's algorithm on Windows. I've now done it (according to the instructions at least). Let's see how it goes. I'm in central Europe on gigabit ethernet and fiber, with more than 50% of my traffic going over IPv6 and most European sites under 10ms away.


> not to mention nearly 50% of every packet was literally packet headers

I was just looking at a similar issue with grpc-go, where it would somehow send a HEADERS frame, a DATA frame, and a terminal HEADERS frame in 3 different packets. The grpc server is a golang binary (lightstep collector), which definitely disables Nagle's algorithm as shown by strace output, and the flag can't be flipped back via the LD_PRELOAD trick (e.g. with a flipped version of https://github.com/sschroe/libnodelay) as the binary is statically linked.

I can't reproduce this with a dummy grpc-go server, where all 3 frames would be sent in the same packet. So I can't blame Nagle's algorithm, but I am still not sure why the lightstep collector behaves differently.


Found the root cause from https://github.com/grpc/grpc-go/commit/383b1143 (original issue: https://github.com/grpc/grpc-go/issues/75):

    // Note that ServeHTTP uses Go's HTTP/2 server implementation which is
    // totally separate from grpc-go's HTTP/2 server. Performance and
    // features may vary between the two paths.
The lightstep collector serves both gRPC and HTTP traffic on the same port, using the ServeHTTP method from the comment above. Unfortunately, Go's HTTP/2 server doesn't have the improvements mentioned in https://grpc.io/blog/grpc-go-perf-improvements/#reducing-flu.... The frequent flushes mean it can suffer from high latency with Nagle enabled, or from high packet overhead with Nagle disabled.

tl;dr: blame bradfitz instead :)


One specific thing I wonder about is how this setting effects Docker, specifically when pushing/pulling images around.

In both GitHub Docker and Moby organizations, "SetNoDelay" doesn't return any results. I wonder if performance could be improved making connections with `connection.SetNoDelay(false)`


I have a hypothesis here. Go is a language closely curated by Google, and the primary use of Go in Google is to write concurrent Protobuf microservices, which is exactly the case of exchanging lots of small packets on a very reliable networks.


Nagle's algorithm is designed to stop packlets.

If you're not sending a lot of packlets you shouldn't be using Nagle's algorithm. It's on by default in systems because without it interactive shells get weird, and there are few things more annoying to sysadmins than weird terminal behavior, especially when shit is hitting the fan.


But it seems that it shouldn't be limiting packets to 50 bytes (which is apparently the size of buffers used by the application in send/write). Once the send buffer is full the Kernel should be sending full packets.


What's a packlet?


I don't know Golang, but how does the function in git-lfs that writes to the socket look like? Is it writing in 50-byte chunks? Why?

Because I guess even with TCP_NODELAY, if I submit reasonably huge chunks of data (e.g. 4K, 64K...) to the socket, they will get split into reasonably-sized packets.


The code in question seems to be this portion of SendMessageWithData in ssh/protocol.go [1]:

  buf := make([]byte, 32768)
  for {
    n, err := data.Read(buf)
    if n > 0 {
      err := conn.pl.WritePacket(buf[0:n])
      if err != nil {
        return err
      }
    }
    if err != nil {
      break
    }
  }
The write packet size seems to be determined by how much data the reader returns at a time. That could backfire if the reader were e.g. something like line at a time (no idea if something like that exists in Golang), but that does not seem to be the case here.

[1] https://github.com/git-lfs/git-lfs/blob/d3716c9024083a45771c...



SACKs are the second most important/useful TCP extension after window scaling. SACKs have had basically universal support for more than a decade (like, 95% of the traffic on the public internet negotiated SACKs in 2012). Anyone writing a new production TCP stack without SACKs is basically committing malpractice.


I've learned the hard way to avoid git-lfs at all costs.

Main issue is that git-lfs is NOT "it just works".

The migration process if you mistakenly in/excluded a file is quite painful and bug prone.

I'd rather just exclude big blobs from git if possible.


Side-note: I wonder why the author has decided to include overflow:hidden in an effort to hide the page scroll bar.


Must be the theme. It’s a shitty theme but I can’t be bothered to get a better one atm. It’s pretty far down the todo list.


when using TCP_NODELAY do you need to ensure your writes are a multiple of the maximum segment size? for example if the MSS is 1400 and you are doing writes of 1500 bytes does this mean you will be sending packets of size 1400 and 100?


What about if there are jumbo frames all the way to the client. You are throwing away a lot of bandwidth. What about if there is vxlan like in k8s, you’ll be sending two packets, one tiny and one full. Use Nagle and send what you have when you have it. Let the TCP stack do it’s job. Work on optimization when it is actually impactful to do so. Sending a packet is cheaper than reading a db.


the big reason for no-delay is the really bad interaction between nagle's algorithm and delayed ACK for request-response protocols like the start of a TLS connection. its possible the second handshake packet the client/server sends to be delayed significantly because one of the parties has delayed ack enabled.

Ideally, the application could just signal to the OS that the data needs to be flushed at a certain points. TCP_NODELAY almost lets you do this but the problem is it applies to all writes() including ones that don't need to be flushed. for example if you are a http server sending a 250MB response then only the last write needs to be 'flushed'. linux has some non-posix options that you give more control like TCP_CORK using setsockopt which lets you signal these boundaries explicitly or MSG_MORE which is a bit more convenient to use.


Please add links to the GitHub issues in the blog


this has been known forever, very inflammatory article imo.


interesting article


> I would absolutely love to discover the original code review for this and why this was chosen as a default. If the PRs from 2011 are any indication, it was probably to get unit tests to pass faster. If you know why this is the default, I’d love to hear about it!

Please hold while I pick my fallen jaw up off the floor.

The parents of the Internet work at Google. How could this defect make it to production and live for 12+ years in the wild? I guess nothing fixes itself, but this shatters the myth of Google(r) superiority. It turns out people are universally entities comprised of sloppy, error-prone wetware.

At the very least there should be a comment in caps and in the documentation describing why this default was chosen and in what circumstances it's ill-advised. I'm not claiming to be remarkably exceptional and even I bundle such information on the first pass when writing the initial code (my rule: to ensure a good future, any unusual or non-standard defaults deserve at least a minimal explanation) (Full-Disclosure: I was rejected after round 1 of Google code screens 3 times, though have been hired to other FAANG/like companies).

Yeesh.

p.s. Be sure to brace yourself before reading https://news.ycombinator.com/item?id=34179426#34180015


> It turns out people are universally entities comprised of sloppy, error-prone wetware.

The line from Agent K in 'Men In Black' comes to mind here.

More jobs than not, I left with at least one 3+ month old PR of changes for stability I was 'not allowed to merge because we didn't have the bandwidth to regression (or do cross-ecosystem-update-on-lib)'. Yes I made sure to explain to my colleagues why I did them and why I was mentioning them before I left.

Most eventually got applied.

> (I've been rejected after round 1 of Google code screens 3 times, though have been hired to other FAANG-like companies). Sheesh.

I've found that the companies that hire based on quality-of-bullshitting sometimes pay more, but are far less satisfying than companies that hire on quality-of-language-lawyering (i.e. you understand the caveats of a given solution rather than sugar coating them).


> Please hold while I pick my fallen jaw up off the floor.

> p.s. Be sure to brace yourself before reading https://news.ycombinator.com/item?id=34179426#34180015

Both of these snide comments assume that the speculative explanations are correct, which they very well may not be.


Google's interview level is set to not needing to fire too many bad people, it's not about being superior (err on the side of caution when hiring).

This might change now in this downturn, but when I was working at Google in 2008, we were the only tech company where nobody was fired because of the recession (there were offices closed, and people had the option to relocate, although not everybody took that option).

If you compare it with Facebook, they just fired a lot of people.

In short: you probably just didn't have luck, you should try again when you can.


Google designs for Google. In their world everyone uses a latest gen MacBook with maxed out RAM on gigabit fiber.


The default is glinux, most of the company are using chromebooks.


First half yes, second half no. Everyone quickly finds out that chromebooks cant hack it spec-wise, even for simple chrome remote desktop.


As a software engineer at Google, I can say that all of my work is done on a Chromebook remoted into a gLinux desktop.

Macbooks are not allowed unless you get explicit exceptions for specific business reasons (QA iOS apps, iOS dev work, etc).


Did that happen in 2022? I'm a Xoogler as of spring 2022 and when I left everyone on my team used a macbook, several of them new macbooks, and I know several people got exceptions to get more powerful macbooks during WFH.


The latest chromebooks are actually really great*. Many of my team members who were on mac are switching back to ChromeOS for convenience.

* Great if you have a remote linux workstation to do the heavy compilation and test runs


The "On gigabit fiber" part is true, though.


Most engineers have work desktops which run GLinux and they also have macbooks.


I said the company, not engineers. And macbooks are used as chromebooks, I haven't used anything outside of Chrome/term. the dev environment is glinux. osx/m1 is not supported without getting exceptions and not worth the trouble.


Google has more end users on slow networks and old devices than almost anyone. Throttle your browser with the browser tools and see what loads quicker, google.com or a website of your choice. Once you've loaded google.com, do a search.


Does it matter from the server point of view?


How can you call it a defect when it might have been a deliberate decision? Your whole post sounds like you're upset Google didn't hire you lmao


The entire post is embarrassing and makes me think that Google made the correct decision. Also, it seems that people that want to change the default behaviour can simply use the TCPConn.SetNoDelay function.


Decisions deserve documentation (because a footgun warning is preferable to spontaneous unintended penetration).


It is documented. https://pkg.go.dev/net#TCPConn.SetNoDelay

> The default is true (no delay), meaning that data is sent as soon as possible after a Write.


Huh, really? There is public API to change behavior, thats about it. There maybe a million page documentation by now if every decision needed a documentation.


As the amount of confusion and back-and-forth in this story thread proves, such topics deserve more attention rather than being lumped in alongside less consequential matters. Ideally the goal is to spread the knowledge and expertise to as many humans as possible by making it accessible.


The only thing this story thread proves is that young folks aren’t being taught basic networking or distributed systems functionality and history.

TCP options and disabling Nagle’s algorithm was a topic you learn when introduced to RPCs, maybe in 3rd or 4th year, at least in the 90s.


It’s not a defect, and it’s not unusual to enable TCP_NODELAY.

As a default, it’s a design decision. It’s documented in the Golang Net library.

I remember learning all of this stuff in 1997 in my first Java job and witnessing same shock and horror at TCP_NODELAY being disabled (!) by default when most server developers had to enable it to get any reasonable latency for their RPC type apps, because most clients had delayed TCP ACKs on by default. Which should never be used with Nagle’s algorithm!

This Internet folklore gets relearned by every new generation. Golang’s default has decades of experience in building server software behind the decision to enable it. As many other threads here have explained, including Nagle himself.


> The parents of the Internet work at Google. How could this defect make it to production and live for 12+ years in the wild?

Google is a big company; the “parents of the internet”, insofar as they work at Google, probably work nowhere near this, in terms of scope of work.


Would be naive to think corporate incentives are not influencing code and protocols:

> Http/3 standardized 6 months ago and Google has been widely using it for years-- but not supported by Go.

> Webtransport originally did P2P/Ice component but no longer.

> Http/3 doesn't even have option to work without certificate authorities.


> Http/3 doesn't even have option to work without certificate authorities.

Unencrypted HTTP is dead for any serious purpose. Any remaining use is legacy, like code written in Basic.

With Letsencrypt on one hand, and single-binary utilities to run your own local CA on the other hand, this should pose no problem.


> this should pose no problem.

It poses a stack of problems a foot high.

Some random examples:

Docker, Kubernetes, etc... use HTTP by default. Not HTTPS or HTTP/3. Unencrypted HTTP 1.1! This is because containers are snapshots and can't contain certificates. Injecting certificates is a pain in the butt, because there is no standardised mechanism for it.

Okay! You inserted a certificate! For... what name? Is it the "site host name", or the "server name"? Either one you pick will be wrong for something. Many web apps expect to see a host header on the backend that matches the frontend, and will poop themselves if you give them a per-machine (or per-container) certificate. I've seen cloud load balancers that have the opposite problem and expect valid per-machine certificates!

If you pick per-machine certificates, then by definition you have to man-in-the-middle, which breaks a handful of apps that require (and enforce!) end-to-end cryptography.

Okay, fine, you have Let's Encrypt issuing per-site certificates, automatically, via your public endpoint. Nothing could be easier! Right up until someone in secops says that you also need make the non-production sites have "private endpoints". Now, you need two distinct mechanisms for certificate issuance, one internal only, and one public. Double the fun.

It just goes on and on: You'll also likely have to deal with CDNs, API gateways, Lambda/Functions, S3 / blob accounts, legacy virtual machines, management endpoints, infrastructure consoles, and so on. Some of these have integrated issuance/renewal capability, some don't. Some break because of your DNS CAA records. Some don't. Some send notifications before expiry, some don't. And so forth...

As a random example, I recently had to deal with a GIS product that shall not be named that requires a HTTPS REST API to set or change its certificates. Yes. You heard me. HTTPS. To set a valid certificate, you first have to automate against a HTTPS endpoint with an invalid certificate, restart the service, do a multi-minute wait in a retry loop, and then continue the automation. Failure to handle any one of the dozen failure scenarios and corner cases will lead to a dead service that won't start at all. Fun stuff.

Automated certificate issuance for complex architectures is definitely not a solved problem in general.


What are the downsides of using http/1.1 or unencrypted http2 from a docker container?

I’m imagining an application server in a docker container talking to a load balancer, in the same data center. I can see some advantages to http2 (head of line blocking, header compression and multiplexing probably bring some performance benefits). But why do you want http3?


Most http/2 implementations enforce valid certificates, just like http/3.

gRPC requires http/2.

Some software like the aforementioned accursed GIS product refuse to work over unencrypted HTTP. They even ignore the load balancer headers like X-Forwarded-Proto just to be extra irritating.


From my experience with developing gRPC-based microservices, I don't remember certificates being such a big deal.

Mount a filesystem subrtree with them inside a container; problem basically solved.


This isn’t even wrong, however you’ve confused the access of certificates with their issuance, validity and rotation for a given runtime, which is OP’s point: it’s very complicated.

There are utilities like Let’s Encrypt and Kubernetes Cert Manager that make this somewhat easier by default if their defaults work for you. But the devil is in the details.


The real crime is using HTTP for internal network communication.


Instead of?


A custom Enterprise PKI made up of hand-rolled shell scripts poking proprietary public cloud key vaults with OpenSSL, the bastion of quality and robustness.


That's not much worse than the abomination you described in your initial comment.


I see what you did there.


> this shatters the myth of Google(r) superiority. It turns out people are universally entities comprised of sloppy, error-prone wetware.

Golang was created with the specific goal of sidestepping what had become a bureaucratic C++ "readability" process within Google, so yes. Goodhart's law in action.


The problem with C++ is not getting readability, but footguns! footguns everywhere! Plus the compile time.


That’s not at all true. Go has readability as well.


It has it now. For a long, long time, readability for Golang at Google was "Read some of the other Go code out there. Try to make it look like that."

(I don't have enough historical knowledge to comment on the notion that Go was invented to sidestep the need to get more team members readability in C++ though).


Googler's network environment would be extremely good so it's not weird.


I think one of the most insightful things I've learned in life is that books, movies, articles, etc. have warped my perception of the "elites." When you split hairs, there is certainly a difference in skill/knowledge _but_ at the end of the day, everyone will make mistakes. (error-prone wetware, haha)

I totally get it though. I mean, as a recent example, look at FTX. I knew SBF and was close to working for Alameda (didn't want to go to Hong Kong tho). Over the years I thought that I was an idiot for missing out and that everyone there was a genius. Turns out they weren't and not only that _everyone_ got taken for a ride. VCs throwing money, celebrities signing to say anything, politicians shaking hands, etc.

Funny, I did see a leaked text when Elon was trying to buy Twitter, SBF was trying to be part of it and someone didn't actually think he had the money, so maybe someone saw the BS.

All that aside tho, yea, this is something I forget and "re-learn" all the time. A bit concerning if you think about it too much! I wonder if that's the same for other fields of work. I mean, if there was an attack on a power grid, how many people in the US would even know _how_ to fix it? Are the systems legacy? I've seen some code bases where one file could be deleted and it would take tons of hours to even figure out what went wrong, lol.


There's nothing elite about being a programmer at any of the big tech companies. It's software engineering and design. It's the same everywhere, just different problem domains.

I've worked with some of the highest ranking people in multiple large tech companies. The truth is there is no "elite". CTOs of the biggest companies in the world are just like you and me.


>There's nothing elite about being a programmer at any of the big tech companies. It's software engineering and design. It's the same everywhere

I just can't agree with this. I have worked with tons of companies and generally, the "sweet-spot" is new mid-sized firms. There is a considerable difference in quality, on almost every metric when working with a bad firm. I've worked with a Fortune 10 company and it was one of the worst applications of "software and design" I've ever seen.

1000 layers of bureaucracy and relatively bad salaries. I'm not looking to speak ill of anyone but we shouldn't pretend you can hire an army of top notch SDEs for bottom of the barrel pay.

The result is a mess.

>I've worked with some of the highest ranking people in multiple large tech companies. The truth is there is no "elite". CTOs of the biggest companies in the world are just like you and me.

I can certainly agree with this in a sense. Everyone makes mistakes. Nobody is "genius" like you see in movies. However, there is a difference in skill and experience (save nepotism or pure luck). If you want to say we all have the same potential, I 100% agree. As it stands though, if you took the "average" developer and I mean truly the _average_, not skewed by personal experience, the average FAANG dev is going to be "better."

I mean, look at how many programmers can't fizzbuzz.


TLDR: Golang uses TCP_NODELAY by default on sockets. Seems wild. I guess it's time to disable TCP_NODELAY in Linux to fix bad software.


Yeah, let's just remove TCP_NODELAY and fuck all latency-sensitive applications.


Actual latency sensitive apps can always use SOCK_RAW and implement their own TCP. In fact, for serious low latency you need to bypass the entire kernel stack too, like DPDK.


“I know, instead of using TCP_NODELAY we can just roll our own ad-hoc reimplementation of TCP!”


For serious though. TCP has a handshake to make a connection. UDP doesn’t and if you’re going for low-latency, you probably don’t want retransmission anyway, since if you miss the packet it’s too late to do anything about it. If you need the guarantees from TCP, you probably aren’t actually solving a low-latency problem.


Low latency SQL transactions? Not microseconds, but a ms or two is doable unless you have transient connections or Nagle + delayed acks.


my goodness. It (git-lfs, which triggered thus investigation) essentially insists on sending each packet as a tiny individual packet (resulting in umpteen thousands) instead of using the internet's built-in packet batching system (nagle's algorithm)


I believe it just emits at least one packet on each system 'write' call. As long as your 'write' invocations are larger blocks then I'd expect you'd see very little difference with O_NDELAY enabled or disabled. I've always assumed you want to limit system calls so I'd always assumed it to be better practice to encode to a buffer and invoke 'write' on larger blocks. So this feels like a combination of issues.

Regardless, overriding a socket parameter like this should be well documented by Golang if that's the desired intent.


If you want to buffer, you can still buffer. There’s no advantage letting the OS do it, and decades of documented disadvantages.


Whether this is the right or wrong thing depends 100% on what you’re trying to do. For many applications you want to send your message immediately because your next message depends on the response.


Very rarely this is the case. From the application’s perspective yes. From a packet perspective… no. The interface is going to send packets and they’ll end up in a few buffers after going through some wires. If something goes wrong along the way, they’ll be retransmitted. But the packets don’t care about the response, except an acknowledgment the packets were received. If you send 4000 byte messages when the MTU is 9000, you’re wasting perfectly good capacity. If you had Nagle’s turned on, you’d send one 8040 byte packet. With Nagle’s you don’t have to worry about the MTU, you write your data to the kernel and the rest is magically handled for you.


They're really in a bubble at Google.


Nice finding. It would help also to suggest a workaround. Perhaps “overloading“ the function? Not a Golang expert here. But providing a solution (other than waiting for upstream) would be beneficial for others.


There is a public, documented API to turn Nagle back on.

(Please don’t.)


Can you elaborate? Your suggestion not to turn it back on would result in the OP having to suffer slow upload speeds despite having available bandwidth order of magnitude larger. How is that a good outcome?


The correct solution is to leave Nagle off but do larger writes. This will improve performance on all networks, not only noisy ones, and with no overhead.

Go provides ReaderFrom for the general case of letting writers control the level of buffering and this will also provide massive benefits beyond just better TCP flow control (i.e. splice and sendfile are used if applicable).


It is the correct default and anyone who states otherwise has not spent sufficient amount of hours on debugging obscure network latency issues, especially when they interact with any kind of complex software stack on top of them.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: