* What are the reasons for disabling TCP timestamps by default? (If you can answer) will they be eventually enabled by default? (The reason I'm asking is that Linux uses TS field as storage for syncookies, and without it will drop WScale and SACK options greatly degrading Windows TCP perf in case of a synflood.[1])
* I've noticed "Pacing Profile : off" in the `netsh interface tcp show global` output. Is that the same as tcp pacing in fq qdisc[2]? (If you can answer) will it be eventually enabled by default?
Windows historically defaulted to accepting timestamps when negotiated by the peer but didn't initiate the negotiation. There are benefits to timestamps and one downside (12 bytes overhead per packet). Re. syncookies, that's an interesting problem but under a severe syn attack, degraded performance is not going to be the biggest worry for the server. We might turn them on but for the other benefits, no committed plans. Re. pacing profile, no that's pacing implemented at the TCP layer itself (unlike fq disc) and is an experimental knob off by default.
re. syncookies: Linux by default starts issuing syncookies when listening socket's backlog overflows, so it may be accidentally triggered even by a small connection spike. (This, of course, is not an excuse for a service misconfiguration but it is quite common: somaxconn on Linux before 5.4 used to be 128 and many services use the default.)
re: pacing: Awesome!! I would guess it is similar to Linux "internal implementation for pacing"[1]. Looking forward to it eventually graduating form being experimental! As a datapoint: enabling pacing on our Edge hosts (circa 2017) resulted in ~17% reduction in packet loss (w/ CUBIC) and even fully eliminated queue drops on our shallow-buffered routers. There were a couple of roadbumps (e.g. "tcp: do not pace pure ack packets"[2]) but Eric Dumazet fixed all of them very quickly.
Thanks for the heads up. We will investigate to see what fraction of connections end up losing these options.
Pacing TCP is certainly on our roadmap. Our QUIC implementation MsQuic paces by default already.
Do you have any details on how or when Microsoft will roll out quick in Windows? Will it work by just specifying the quick protocol when creating a socket like with tcp?
I have a question, why is it that when opening two sockets on Windows and connecting them through TCP, there is about a 40% difference in transfer rate when sending from socket A to B, compared to sending from B to A?
That's not expected. Are you using loopback sockets or are these sockets on different endpoints? Is this unidirectional or bidirectional traffic, i.e. are you doing both transfers from A to B and B to A simultaneously?
I see the issue both on the same computer, so loopback, as well as on two computers on the same LAN. I tested with unidirectional traffic. Pretty easy to test for yourself. The problem appears to be with the TCP protocol implementation. For UDP transmission speed is the same both ways.
I cannot comment on queuing disciplines and limits in future products. Re. TCP_NOTSENT_LOWAT, you may want to look at the Ideal Send Backlog API that allows an application to have just more than BDP queued to keep the performance at the maximum throughput while minimizing the amount of data queued: https://docs.microsoft.com/en-us/windows/win32/winsock/sio-i...
I got hit by the exact same issue which is described in the fermilab paper, namely packet reordering caused by intel drivers. It took me several days to diagnose the problem. Interestingly enough, the problem virtually disappeared when running tcpdump, which, after a lot of reading on the innards of the linux TCP stack, and prodding with ebpf, eventually led me to conjecture that it was a scheduling/core placement issue. Pinning my process clearly made the problem disappear, and then finding the paper nailed it.
Networks are not my specialty (I come from a math background, am self taught, and had always dismissed them as mere plumbing) , but I have to say that I came out of this difficult (for me) investigation with a great appreciation for networking in general, and now enjoy reading anything I can find about them.
It's never too late to learn, and I have yet to find something in software engineering which is not interesting once you take a closer look at it!
I ended up reading quite a bit about congestion control while investigating a different issue (sending data from a 25gb box to a 1gb one over a 10gb/100ms latency link didn't work well since the bigger nic would saturate the smaller one which then dropped packets, and this caused the tcp window to shrink significantly ), and it was extremely interesting. The whole problem of having multiple agents competing with different strategies and incomplete information to maximize their network throughput also reminded me of economics.
Tcp pacing essentially solved the problem (and if not available brutally traffic shape).
The original symptom was very low throughput, which is what prompted the investigation . Without tcpdump, low throughput and high reordering, with tcpdump, high throughput ( which is why I couldn't figure out what was going on).
I'd be very interested if someone with kernel experience could tell me what's specific about tcpdump.
Pcap captures are not multithreaded so you are pinning to a single core.
This entire thread is interesting because it is highlighting a similar problem with my virtualized router. When I pinned the router vm to specific cpus the problem goes away. I switched to openstack which doesn’t give me the best control over cpu capabilities and the problem has manifested in a worse form.
My uninformed opinion is that there are underlying concurrency problems with multithreaded user land-kernel interaction and some nic drivers (consumer intel and Broadcom hardware)
Ah ! Thanks for this.
Being single threaded does not prevent you from having your process being migrated from one core to another though, no? Or do you mean that pcap captures are pinned?
I had a similar issue with Windows kernels "recently" (2016~?)...
I don't have the memory or patience to write a long and inspiring blog post, but it comes down to:
Even with IOCP/multiple threads: network traffic is single threaded in the kernel, even worse, there's a mutex there. Putting the effective limit on PPS for windows to something like 1.1M for 3.0GHz.
The task of this machine was /basically/ a connection multiplexer with some TLS offloading; so listen on a socket, get an encrypted connection, check your connection pool and forward where appropriate.
Our machine basically sat waiting (in kernel space) for this lock 99.7% of the time, 0.3% was spent on SSL handshaking..
We solved our "issue" by spreading such load over many more machines and gave them low-core-count high-clock-speed Xeons instead of the normal complement of 20vCPU Xeons.
AFAIK that issue persists, I'd be interested to know if someone else managed to coerce windows to do the right thing here.
I did some work optimizing a similar problem, but simpler and on another OS[1]. The basic concept that worked was Receive Side Scaling (RSS), which was developed by Microsoft, for Windows Server. Did you come accross that? It needs support in the NIC and the driver, but intel gigE cards do it, so you don't need the really fancy cards. I don't know what the interface is like for Windows, but inbound RSS for FreeBSD is pretty easy, and skimming Windows docs, it seemed like you could do more advanced things there.
The harder part was aligning the outgoing connections; for max performance, you want all of the related connections pinned to the same CPU, so that there's no inter CPU messaging; for me that meant a frontend connection needs to hash to the same NIC queue as the backend connection; for you, that needs to be all of the demultiplexed connections on the same queue as the multiplexed connection. Windows may have an API to make connections that will hash properly, FreeBSD didn't (doesn't?), so my code had to manage the local source ip and port when connecting to remote servers so that the connection would hash as needed. Assuming a lot of connections, you end up needing to self-manage source ip and port anyway, and at least HAProxy has code for that already, but running the rss hash to qualify ports was new development, and a bit tricky because bulk calculating it gets costly.
Once I got everything setup well with respect to CPUs, things got a lot better; still had some kernel bottlenecks though, I wouldn't know how to resolve that for Windows, but there were some easy wins for FreeBSD.
Low core count is the right way to go though; I think the NICs I used could only do 16 way RSS hashing, so my dual 14 core xeon (2690v4) weren't a great fit; 12 cores were 100% idle all the time; something power of two would be best.
Email in profile if you want to continue the discussion off HN (or after it fizzles out here).
[1] Load balancing/proxying, but no TLS and no multiplexing, on FreeBSD.
Do you actually use RSS via options RSS / options PCBGROUP? I've tried it several times, and its just so hard to get right & have matching cores / rx rings, etc. I've made it work with a local patch to nginx, but it was so fragile that I abandoned it.
I had been thinking that RSS/PCBGROUP was totally abandoned and could potentially be removed.
I no longer work where I did this (and it's been shut down, as it was a transitional proxy), so I can't be 100% sure what the kernel configuration was; I was able to release patches on the HAProxy mailing list, although they weren't incorporated, but at least I can reference them [1].
But yes, I think I ended up using both RSS and PCBGROUP. This was on a server running only one application (plus like sshd and crond and whatever), so it was dead simple to line up listen socket RSS and cpu affinity; I had a config generator script that would look at the number of configured queues and tell HAProxy process 0 to bind to cpu 0 and rss queue 0, up until I ran out of RSS queues; we needed a config generator script anyway, because the backend configuration was subject to frequent changes. If it was only listen sockets, RSS would have been sufficient without needing PCBGROUP, but locking around opening new outgoing sockets was a bottleneck and PCBGROUP helped considerably, but it was still a bottleneck. This was on FreeBSD 12.
Edit: I also found some patches[2] I sent to freebsd-transport that I don't know if anyone saw; I don't remember if I updated the patches after this... I know I tried some more stuff that I wasn't able to get working. Don't apply these patches blindly, but these were some of the things I had to fiddle with anyway. I think I saw there was some stuff in 13 that likely made outgoing connections better.
On further reflection, I just want to emphasize how much of an improvement RSS/PCBGROUP and the couple of minor tweaks made for this use case; with unmodified FreeBSD 12 and HAProxy and the load we had, there was basically zero conncurrency available, you could run as many processes (or threads) as you wanted, and the capacity would be the same, and it was sad, I think we could only get about 100k clients on a box before it would run out of steam.
With everything tweaked, we got to 2M clients per server, and actually it was hard to find the limit, because I wasn't able to direct enough traffic to the machines under test.
The software and configuration changes weren't big, but it was a huge impact. On the other hand, if RSS and PCBGROUP weren't in the kernel, I don't think I would have been able to add something similar, and we would have had to something wild and crazy (or try Linux and see if it would do the job). Now, I really did want to write a raw packet tcp proxy in userspace, but I knew it would be a lot easier to manage and quicker to get working with something off the shelf.
Of course, maybe there's a better solution to the root bottleneck, which was always opening a new outgoing tcp connection; even with all the tweaks, that was still the bottleneck, but fixing that needs someone more skilled than me, and I guess it's a pretty niche use case to be opening so many outgoing sockets. Accepting tons of sockets is way more common and way more optimized.
Sounds like you didn't have receive side scaling enabled; by default flows are queued to core 0 to prevent reordering. If you enable RSS, your flows will be hashed to core-specific queues.
It's inaccurate to describe traffic processing as single-threaded in the kernel.
Do you know whether it's a single thread for all network devices, or just per device? It would be interesting if this ended up being a driver level constraint or something that can be fixed by having multiple NICs in the machine.
Did it have to be Windows? This is the sort of thing Linux or *BSD boxes are better suited for. I wouldn't even consider a Windows machine for the task unless there's some sort of licensed software you need to run on it to get the job done.
> This is the sort of thing Linux or *BSD boxes are better suited for.
Definitely, though enabling conntrack on Linux has similar characteristics (forces single thread with some kind of internal mutex) though it can do 5x the b/w..
We tried having stateful firewalls in front of our windows boxen, that's how I know.
If you’re on recent Windows system, you should have pktmon [1] available. I believe it’s the „netsh trace” successor and has much nicer command line. And you no longer need an external tool to convert the trace to .npcap format.
PktMon is the next generation tool in newer Windows 10 versions and brings many of the same benefits referred to in this blog - particularly being able to view packet captures and traces together in the same text file.
Cool article, but I'm not impressed by DropBox's upload speed on my Windows computer, at all.
I just tested rn with DropBox, GoogleDrive, and OneDrive, all with their native desktop apps. I simply put a 300MB file in the folder and let it sync.
DB: 500 KiB/s
GD: 3 MiB/s
OD: 11 MiB/s (my max bandwidth with 100Mbps)
I don't know what causes the disparity here, but I have been annoyed by this for years, and it's the same across multiple computers I use at different locations.
Another funny thing is if you just use the webpage, both GD and DB can reach 100Mbps easily.
Edit: should mention Google's DriveFS can reach max speed too, but it's not available for my personal account (which uses the "Backup and sync for Google" app).
Is Google Drive using QUIC? If so, then its using the same BBR congestion algorithm as the bbr tcp stack, and BBR's algorithm which does not view loss as congestion will help a lot.
It would be interesting to re-try the experiment on Linux or FreeBSD using BBR as the TCP stack and see if the results are any better for dropbox.
FWIW, my corp openvpn is kinda terrible. My upload speeds via the vpn did not improve at all when I moved and upgraded from 10Mb/s to 1Gb/s upstream speeds. When I switched to BBR, my bandwidth went from ~8Mb/s -> 60Mbs, which I think is the limit of the corp vpn endpoint.
Strange. Dropbox has no problem hitting mid-50s MiB/s if not more on my gigabit connection. I wonder if it's a routing issue and your path to their datacenters is bad?
Tried to change upload speed to no limit, doesn't make much difference.
Ping result:
Pinging nsf-env-1.dropbox-dns.com [162.125.3.12] with 32 bytes of data:
Reply from 162.125.3.12: bytes=32 time=27ms TTL=55
Reply from 162.125.3.12: bytes=32 time=27ms TTL=55
Reply from 162.125.3.12: bytes=32 time=27ms TTL=55
Reply from 162.125.3.12: bytes=32 time=27ms TTL=55
Ping statistics for 162.125.3.12:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 27ms, Maximum = 27ms, Average = 27ms
App Ver. 122.4.4867
Is the OS being Win7 a factor? (Work computer, can't update [yet]).
Oh, Windows 7? That does explain it. Windows TCP stack really improved in 8.1. The main change there is the auto send buffer tuning which allows automatic growing of SNDBUF. Let me see if we can put a dirty hack^W^Wworkaround for Windows < WIN2012R2SERVER that would unconditionally set SO_SNDBUF to something like 1Mb.
Yeah, Windows 7's speeds with single connections are very, very slow. It took me years to figure out the reason were bad TCP settings, because it worked fine when I used a multiconnection downloader. What fixed it for me was running these 3 commands in a CMD with admin permissions:
netsh interface tcp set heuristics disabled
netsh int tcp set global autotuninglevel=normal
netsh int tcp set global congestionprovider=ctcp
Google are migrating Backup and Sync to DriveFS soon [0], but you can upgrade right now. Now, I don't remember how I did it, but I do have Drive FS on my personal account.
Good to know! Definitely will try it later, but I currently have a backup job (one-way photo backup, not sync GD) set up on my second GDrive account which I don't want to touch.. yet.
Yea, Dropbox on my Macs has continuously been outrageously slow at uploading. Everything else is multiples faster.
Dropbox does at least resume fairly reliably though, so I can generally ignore it the whole time... unless I have something I want to sync ASAP. Then I sometimes use the web UI and cross my fingers that I don't get a connection hiccup ಠ_ಠ
> Edit: should mention Google's DriveFS can reach max speed too, but it's not available for my personal account (which uses the "Backup and sync for Google" app).
That thing is far too aggressive about network bandwidth. It will upload 20 files at the same time and the speed limit setting doesn't work.
It's not hyperbole. I make a backup of my computer, with the output being a bunch of 500MB files. And I would then copy or move those files into a folder on the file stream drive. It's not entirely consistent, and it used to do less, but with some update it decided that it should upload way too many files at once. I've had to switch to an entirely different program to upload those files sequentially.
What is the program that you are using? I am currently using odrive for macOS since they don't have DriveFS support for Apple Silicon. odrive work ok, it just have a weird file conflict sometime.
I'm using rclone and honestly prefer it at this point, and there are others as well, so while an official client would be nice it no longer is a concern for me.
I disagree. I had my machine backed up to Google Drive using their Backup and Sync program and when I got a new machine there was no reasonable way to restore the data from the old machine to the new machine, using Google Drive. Sure I can copy data from my old machine, but what if ti was lost or stolen? If the app can't handle this use case, what's the point of it? The only way to restore the files is in small chunks using the web based interface - not reasonable for tens of thousands of files and hundreds of gigabytes.
The workaround was to back everything up to the "Google Drive" folder since this seems to be the only folder that Backup and Sync can actually restore.
Had a somewhat similar issue, but with Drive File Stream.
At one point I set it up to use my second SSD as the local storage. Then I needed that SSD elsewhere, so I just took it out. It was impossible to restart the damn thing. It kept complaining about missing folders. I even tried uninstalling and reinstalling it, but it kept its settings.
Since I barely used that machine, if ever, and I'm not particularly familiar with Windows, I never really looked into how to completely clean up the configuration. But the point is that there clearly are some pretty stupid decisions about some products.
For future reference, most windows apps keep their settings in the AppData/Roaming folder in your user folder. There's a useful shortcut as an env variable - type %APPDATA% into explorer to go straight there.
System-wide settings and state should be stored in C:\ProgramData.
To be fair it is called Backup and Sync not Backup, Sync and Restore.
But on a more useful note how I have handled this in the past is to download the complete Google Drive data using Google Takeout. Not the greatest solution but it has worked.
Are you saying the files were no longer available in Google Drive? Did you download the Drive for desktop client to try restoring files or just try reinstalling the Backup and Sync client?
I believe simply installing the Google Drive for Desktop client would have been what you wanted to do. As the name suggests, the Backup and Sync client is well primarily for backing up and syncing your data automatically to Google Drive and Google Photos.
Really? Google drive sync has been hot garbage for me. Before that program came everything was fine and dandy, but drive sync constantly stumbles over it's own feet, restarts and fails to up- and download files. I'm longing back to Rsync or even FTP after trying to use google drive to move data.
Do they? I constantly see DropBox taking days to sync files that are 30kb on size. Or doing dumbfounding things like download all files, then re-upload all files when I set sync to "online only" on a folder if just one of the files is not set to online only.
Maybe they have grand academic visions and papers, but I've been using them for well over a decade and I feel the client quality has gone downhill over the past few years. They keep adding unnecessary stuff like a redundant file browser while the core service suffers.
Maybe my usage stays in the golden path, but I've been using them for ten years too and I have no complaints about the core functionality. My only real complaint is that they've been adding lots of features I don't care about, getting slightly pushy about convincing you to try them, etc. But I haven't seen the core stuff actually go downhill.
The real root cause for all that flow director mess and core balancing is that there's a huge disconnect between how the hardware works and what the socket API offers by default.
The scaling model of the hardware is rather simple: hash over packet headers and assign a queue based on this. And each queue should be pinned to a core by pinning the interrupts, so you got easy flow-level scaling. That's called RSS. It's simple and effective.
What it means is: the hardware decides which core handles which flow. I wonder why the article doesn't mention RSS at all?
Now the socket API works in a different way: your application decides which core handles which socket and hence which flow. So you get cache misses if you don't tak into account how the hardware is hashing your flows. That's bad. So you can do some work-arounds by using flow director to explicitly redirect flows to cores that handle things but that's just not really an elegant solution (and the flow director lookup tables are small-ish).
I didn't follow kernel development regarding this recently, but there should be some APIs to get a mapping from a connection tuple to the core it gets hashed to on RX (hash function should be standardized to Toeplitz IIRC, the exact details on which fields and how they are put into the function are somewhat hardware- and driver-specific but usually configurable). So you'd need to take this information into account when scheduling your connections to cores. If you do that you don't get any cache misses and don't need to rely on the limited capabilities of explicit per-flow steering.
Note that this problem will mostly go away once TAPS finally replaces BSD sockets :)
We didn't mention RSS/RPS in the post mostly because they are stable. (Albeit, relatively ineffective in terms of L2 cache misses.) FlowDirector, OTOH, breaks that stability and causes a lot of migrations, and hence a lot of re-ordering.
Anyways, nice reference for TAPS! Fo those wanting to dig into it a bit more, consider reading an introductory paper (before a myriad of RFC drafts from the "TAPS Working Group"): https://arxiv.org/pdf/2102.11035.pdf
I appreciate seeing a support and engineering org going this deep to resolve this kind of issue. Normally this is the stuff you waste hours on with a support org only to get told to clear your cookies and cache one more time.
In particular, the collaboration with Microsoft was great.I wonder what it took to make that happen.
Has Dropbox ever experimented with SCTP or other protocols that don't enforce strict ordering of packets? I know some middleboxes struggle with SCTP (they expect TCP or UDP), but in that case you do SCTP over UDP or have a fall back.
Sadly, middleboxes are a real problem, esp. with our Enterprise customers. We had this problem even with HTTP/2 rollout so there is even a special HTTP/1.1-only mode in the Desktop Client for environments where h2 is disabled.
In the future we are planing on having an HTTP/3 support which will give us pretty much the same benefits as SCTP with a better middlebox compatibility.
Theoretically, UDP would be the best choice if you had the time & money to spend on building a very application-specific layer on top that replicates many of the semantics of TCP. I am not aware of any apps that require 100% of the TCP feature set, so there is always an opportunity to optimize.
You would essentially be saying "I know TCP is great, but we have this one thing we really prefer to do our way so we can justify the cost of developing an in-house mostly-TCP clone and can deal with the caveats of UDP".
If you know your communications channel is very reliable, UDP can be better than TCP.
Now, I am absolutely not advocating that anyone go out and do this. If you are trying to bring a product like Dropbox to market (and you don't have their budget), the last thing you want to do is play games with low-level network abstractions across thousands of potential client device types. TCP is an excellent fit for this use case.
It's an ideal application of TCP. Dropbox servers are continually flooded by traffic from clients, so the good congestion behavior from TCP is valuable. There is also less need to implement error detection/correction/retransmission in higher layers.
I am not saying that it should be done from scratch.
But most recent research done in the recent years about the protocols used on the web tend to be built on top of UDP and not TCP, for many historical reasons.
In theory TCP would be the better choice, but in practice this is more complex than you assume.
I think that many people have a knee-jerk reaction when talking about TCP vs UDP, but they probably don't know as much as they think... (parrots)
The onus is on you to explain why. Why not: Smaller payloads per packet and missing out on all the TCP algorithms already implemented in hardware en route.
"Dropbox is used by many creative studios, including video and game productions. These studios’ workflows frequently use offices in different time zones to ensure continuous progress around the clock. "
Honestly I don't understand these orgs that don't go OneDrive/O365 suite. What product value does dropbox have when competing within Microsoft's own ecosystem?
I wonder how the Dropbox developers managed to get in contact with the Windows core TCP team. Maybe I'm too cynical, but I'm surprised that Microsoft would go out of their way to work with a competitor like this.
Even if OneDrive vs Dropbox is important, this is a win for Windows in general. People will switch OSes because the TCP throughput is better on the other side; it's easy to measure and easy to compare and makes a nice item in a pros and cons list.
Fixing something like this can help lots of use cases, but may have been difficult to spot, so I'm sure the Windows TCP team was thrilled to get the detailed, reproducible report.
Interesting. Is the Dropbox client still an obfuscated python app? I'm curious if they spawn new processes for simultaneous uploads since they probably aren't threading.
> On one hand, Dropbox Desktop Client has just a few settings. On the other, behind this simple UI lies some pretty sophisticated Rust code with multi-threaded compression, chunking, and hashing. On the lowest layers, it is backed up by HTTP/2 and TLS stacks.
The Windows client I have installed appears to be a native app using QT 5 and QT5WebEngine (embedded chromium) with an absolutely bonkers number of threads (240). It's possible there's still python in there but I suspect not, their UI has been completely overhauled since the python days.
How come Linux doesn't have this issue? Why did Microsoft had to fix TCP with the RACK-TLP RFC when both Linux and MacOS implementations did fine already?
It's called R(ecent) Acknowledgement and yes the work came out of Google. This is the single biggest change to TCP loss recovery in a decade. It is now a Standards Track RFC: https://datatracker.ietf.org/doc/html/rfc8985. The Windows implementation was one of the earliest amongst a handful and Microsoft participated in the standardization.
Yeah, I have no idea either why Microsoft would want to remove Message Analyzer completely, even if they could not maintain it. You can still download it through the Internet Archive:
Supposedly Microsoft is working on adding to the existing Windows Performance Analyzer (great GUI tool for ETW performance tracing) to display ETW packet captures, which will succeed Message Analyzer and Network Monitor: https://techcommunity.microsoft.com/t5/networking-blog/intro...
You are spot on. PktMon is the next generation tool in newer Windows 10 versions and brings many of the same benefits referred to in this blog - particularly being able to view packet captures and traces together in the same text file. And WPA is also very useful when analyzing performance problems.