Boosting upload speed and improving Windows' TCP stack

slowstart · on May 19, 2021

I lead the Windows TCP team. We blogged about recent TCP advancements which is very relevant: https://techcommunity.microsoft.com/t5/networking-blog/algor...

SaveTheRbtz · on May 19, 2021

A couple of questions:

* What are the reasons for disabling TCP timestamps by default? (If you can answer) will they be eventually enabled by default? (The reason I'm asking is that Linux uses TS field as storage for syncookies, and without it will drop WScale and SACK options greatly degrading Windows TCP perf in case of a synflood.[1])

* I've noticed "Pacing Profile : off" in the `netsh interface tcp show global` output. Is that the same as tcp pacing in fq qdisc[2]? (If you can answer) will it be eventually enabled by default?

[1] https://elixir.bootlin.com/linux/v5.13-rc2/source/net/ipv4/s... [2] https://man7.org/linux/man-pages/man8/tc-fq.8.html

slowstart · on May 19, 2021

Windows historically defaulted to accepting timestamps when negotiated by the peer but didn't initiate the negotiation. There are benefits to timestamps and one downside (12 bytes overhead per packet). Re. syncookies, that's an interesting problem but under a severe syn attack, degraded performance is not going to be the biggest worry for the server. We might turn them on but for the other benefits, no committed plans. Re. pacing profile, no that's pacing implemented at the TCP layer itself (unlike fq disc) and is an experimental knob off by default.

SaveTheRbtz · on May 19, 2021

re. syncookies: Linux by default starts issuing syncookies when listening socket's backlog overflows, so it may be accidentally triggered even by a small connection spike. (This, of course, is not an excuse for a service misconfiguration but it is quite common: somaxconn on Linux before 5.4 used to be 128 and many services use the default.)

re: pacing: Awesome!! I would guess it is similar to Linux "internal implementation for pacing"[1]. Looking forward to it eventually graduating form being experimental! As a datapoint: enabling pacing on our Edge hosts (circa 2017) resulted in ~17% reduction in packet loss (w/ CUBIC) and even fully eliminated queue drops on our shallow-buffered routers. There were a couple of roadbumps (e.g. "tcp: do not pace pure ack packets"[2]) but Eric Dumazet fixed all of them very quickly.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin... [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

slowstart · on May 19, 2021

Thanks for the heads up. We will investigate to see what fraction of connections end up losing these options. Pacing TCP is certainly on our roadmap. Our QUIC implementation MsQuic paces by default already.

drummer · on May 19, 2021

Do you have any details on how or when Microsoft will roll out quick in Windows? Will it work by just specifying the quick protocol when creating a socket like with tcp?

drummer · on May 19, 2021

I have a question, why is it that when opening two sockets on Windows and connecting them through TCP, there is about a 40% difference in transfer rate when sending from socket A to B, compared to sending from B to A?

slowstart · on May 19, 2021

That's not expected. Are you using loopback sockets or are these sockets on different endpoints? Is this unidirectional or bidirectional traffic, i.e. are you doing both transfers from A to B and B to A simultaneously?

drummer · on May 19, 2021

I see the issue both on the same computer, so loopback, as well as on two computers on the same LAN. I tested with unidirectional traffic. Pretty easy to test for yourself. The problem appears to be with the TCP protocol implementation. For UDP transmission speed is the same both ways.

the8472 · on May 19, 2021

Are equivalents to linux' BQL/AQL, fq_codel, TCP_NOTSENT_LOWAT in the pipeline?

slowstart · on May 19, 2021

I cannot comment on queuing disciplines and limits in future products. Re. TCP_NOTSENT_LOWAT, you may want to look at the Ideal Send Backlog API that allows an application to have just more than BDP queued to keep the performance at the maximum throughput while minimizing the amount of data queued: https://docs.microsoft.com/en-us/windows/win32/winsock/sio-i...

the8472 · on May 19, 2021

Interesting. ISB does seem less ergonomic for blocking IO but it makes sense for overlapped requests.

Agingcoder · on May 18, 2021

Excellent article !

I got hit by the exact same issue which is described in the fermilab paper, namely packet reordering caused by intel drivers. It took me several days to diagnose the problem. Interestingly enough, the problem virtually disappeared when running tcpdump, which, after a lot of reading on the innards of the linux TCP stack, and prodding with ebpf, eventually led me to conjecture that it was a scheduling/core placement issue. Pinning my process clearly made the problem disappear, and then finding the paper nailed it.

Networks are not my specialty (I come from a math background, am self taught, and had always dismissed them as mere plumbing) , but I have to say that I came out of this difficult (for me) investigation with a great appreciation for networking in general, and now enjoy reading anything I can find about them.

It's never too late to learn, and I have yet to find something in software engineering which is not interesting once you take a closer look at it!

SaveTheRbtz · on May 19, 2021

> tcpdump ... linux TCP stack ... ebpf

> Networks are not my specialty

I wish all network non-specialist were like you!

Natfan · on May 19, 2021

I have nowhere near any of these skills, and I know I'm not a specialist.

baruch · on May 19, 2021

Networking actually has tons of interesting and complex math. Congestion control is a rabbit hole of math and control theory.

Agingcoder · on May 19, 2021

Indeed.

I ended up reading quite a bit about congestion control while investigating a different issue (sending data from a 25gb box to a 1gb one over a 10gb/100ms latency link didn't work well since the bigger nic would saturate the smaller one which then dropped packets, and this caused the tcp window to shrink significantly ), and it was extremely interesting. The whole problem of having multiple agents competing with different strategies and incomplete information to maximize their network throughput also reminded me of economics.

Tcp pacing essentially solved the problem (and if not available brutally traffic shape).

brohee · on May 19, 2021

I think the problem disappearing while running tcpdump is one of the truest instances of Schrödingbug...

Agingcoder · on May 19, 2021

It was perfectly reproducible.

The original symptom was very low throughput, which is what prompted the investigation . Without tcpdump, low throughput and high reordering, with tcpdump, high throughput ( which is why I couldn't figure out what was going on).

I'd be very interested if someone with kernel experience could tell me what's specific about tcpdump.

o-__-o · on May 20, 2021

Pcap captures are not multithreaded so you are pinning to a single core.

This entire thread is interesting because it is highlighting a similar problem with my virtualized router. When I pinned the router vm to specific cpus the problem goes away. I switched to openstack which doesn’t give me the best control over cpu capabilities and the problem has manifested in a worse form.

My uninformed opinion is that there are underlying concurrency problems with multithreaded user land-kernel interaction and some nic drivers (consumer intel and Broadcom hardware)

Agingcoder · on May 20, 2021

Ah ! Thanks for this. Being single threaded does not prevent you from having your process being migrated from one core to another though, no? Or do you mean that pcap captures are pinned?

dijit · on May 18, 2021

I had a similar issue with Windows kernels "recently" (2016~?)...

I don't have the memory or patience to write a long and inspiring blog post, but it comes down to:

Even with IOCP/multiple threads: network traffic is single threaded in the kernel, even worse, there's a mutex there. Putting the effective limit on PPS for windows to something like 1.1M for 3.0GHz.

The task of this machine was /basically/ a connection multiplexer with some TLS offloading; so listen on a socket, get an encrypted connection, check your connection pool and forward where appropriate.

Our machine basically sat waiting (in kernel space) for this lock 99.7% of the time, 0.3% was spent on SSL handshaking..

We solved our "issue" by spreading such load over many more machines and gave them low-core-count high-clock-speed Xeons instead of the normal complement of 20vCPU Xeons.

AFAIK that issue persists, I'd be interested to know if someone else managed to coerce windows to do the right thing here.

toast0 · on May 18, 2021

I did some work optimizing a similar problem, but simpler and on another OS[1]. The basic concept that worked was Receive Side Scaling (RSS), which was developed by Microsoft, for Windows Server. Did you come accross that? It needs support in the NIC and the driver, but intel gigE cards do it, so you don't need the really fancy cards. I don't know what the interface is like for Windows, but inbound RSS for FreeBSD is pretty easy, and skimming Windows docs, it seemed like you could do more advanced things there.

The harder part was aligning the outgoing connections; for max performance, you want all of the related connections pinned to the same CPU, so that there's no inter CPU messaging; for me that meant a frontend connection needs to hash to the same NIC queue as the backend connection; for you, that needs to be all of the demultiplexed connections on the same queue as the multiplexed connection. Windows may have an API to make connections that will hash properly, FreeBSD didn't (doesn't?), so my code had to manage the local source ip and port when connecting to remote servers so that the connection would hash as needed. Assuming a lot of connections, you end up needing to self-manage source ip and port anyway, and at least HAProxy has code for that already, but running the rss hash to qualify ports was new development, and a bit tricky because bulk calculating it gets costly.

Once I got everything setup well with respect to CPUs, things got a lot better; still had some kernel bottlenecks though, I wouldn't know how to resolve that for Windows, but there were some easy wins for FreeBSD.

Low core count is the right way to go though; I think the NICs I used could only do 16 way RSS hashing, so my dual 14 core xeon (2690v4) weren't a great fit; 12 cores were 100% idle all the time; something power of two would be best.

Email in profile if you want to continue the discussion off HN (or after it fizzles out here).

[1] Load balancing/proxying, but no TLS and no multiplexing, on FreeBSD.

drewg123 · on May 18, 2021

Do you actually use RSS via options RSS / options PCBGROUP? I've tried it several times, and its just so hard to get right & have matching cores / rx rings, etc. I've made it work with a local patch to nginx, but it was so fragile that I abandoned it.

I had been thinking that RSS/PCBGROUP was totally abandoned and could potentially be removed.

toast0 · on May 18, 2021

I no longer work where I did this (and it's been shut down, as it was a transitional proxy), so I can't be 100% sure what the kernel configuration was; I was able to release patches on the HAProxy mailing list, although they weren't incorporated, but at least I can reference them [1].

But yes, I think I ended up using both RSS and PCBGROUP. This was on a server running only one application (plus like sshd and crond and whatever), so it was dead simple to line up listen socket RSS and cpu affinity; I had a config generator script that would look at the number of configured queues and tell HAProxy process 0 to bind to cpu 0 and rss queue 0, up until I ran out of RSS queues; we needed a config generator script anyway, because the backend configuration was subject to frequent changes. If it was only listen sockets, RSS would have been sufficient without needing PCBGROUP, but locking around opening new outgoing sockets was a bottleneck and PCBGROUP helped considerably, but it was still a bottleneck. This was on FreeBSD 12.

Edit: I also found some patches[2] I sent to freebsd-transport that I don't know if anyone saw; I don't remember if I updated the patches after this... I know I tried some more stuff that I wasn't able to get working. Don't apply these patches blindly, but these were some of the things I had to fiddle with anyway. I think I saw there was some stuff in 13 that likely made outgoing connections better.

[1] https://www.mail-archive.com/haproxy@formilux.org/msg34548.h...

[2] https://lists.freebsd.org/pipermail/freebsd-transport/2019-J...

toast0 · on May 19, 2021

On further reflection, I just want to emphasize how much of an improvement RSS/PCBGROUP and the couple of minor tweaks made for this use case; with unmodified FreeBSD 12 and HAProxy and the load we had, there was basically zero conncurrency available, you could run as many processes (or threads) as you wanted, and the capacity would be the same, and it was sad, I think we could only get about 100k clients on a box before it would run out of steam.

With everything tweaked, we got to 2M clients per server, and actually it was hard to find the limit, because I wasn't able to direct enough traffic to the machines under test.

The software and configuration changes weren't big, but it was a huge impact. On the other hand, if RSS and PCBGROUP weren't in the kernel, I don't think I would have been able to add something similar, and we would have had to something wild and crazy (or try Linux and see if it would do the job). Now, I really did want to write a raw packet tcp proxy in userspace, but I knew it would be a lot easier to manage and quicker to get working with something off the shelf.

Of course, maybe there's a better solution to the root bottleneck, which was always opening a new outgoing tcp connection; even with all the tweaks, that was still the bottleneck, but fixing that needs someone more skilled than me, and I guess it's a pretty niche use case to be opening so many outgoing sockets. Accepting tons of sockets is way more common and way more optimized.

betaporter · on May 18, 2021

Sounds like you didn't have receive side scaling enabled; by default flows are queued to core 0 to prevent reordering. If you enable RSS, your flows will be hashed to core-specific queues.

It's inaccurate to describe traffic processing as single-threaded in the kernel.

slowstart · on May 19, 2021

You should look into the RSS configuration. https://docs.microsoft.com/en-us/windows-hardware/drivers/ne... and https://docs.microsoft.com/en-us/windows-server/networking/t...

kevingadd · on May 18, 2021

Do you know whether it's a single thread for all network devices, or just per device? It would be interesting if this ended up being a driver level constraint or something that can be fixed by having multiple NICs in the machine.

dijit · on May 18, 2021

it was LACP'd... buuuuuuut; it was only one PCIe card.

Otherwise you can't easily do LACP because it can't be offloaded to the card.

We tried without LACP, but again, only one PCIe card.

dboreham · on May 19, 2021

> network traffic is single threaded in the kernel

Hmm. It definitely was not years ago. Perhaps something to do with a specific NIC driver? Or perhaps Cutler retired?

jandrese · on May 18, 2021

Did it have to be Windows? This is the sort of thing Linux or *BSD boxes are better suited for. I wouldn't even consider a Windows machine for the task unless there's some sort of licensed software you need to run on it to get the job done.

dijit · on May 18, 2021

> This is the sort of thing Linux or *BSD boxes are better suited for.

Definitely, though enabling conntrack on Linux has similar characteristics (forces single thread with some kind of internal mutex) though it can do 5x the b/w..

We tried having stateful firewalls in front of our windows boxen, that's how I know.

Seems like Cloudflare has an older blog post detailing this too: https://blog.cloudflare.com/conntrack-tales-one-thousand-and...

Anyway, to answer your question: AAA GameDev (and their backends, if highly tailored) are Windows.

strictfp · on May 18, 2021

Maybe try wine? Seriously, it might be very low effort to get the binary to run on wine.

R0b0t1 · on May 18, 2021

FYI downvoters, this is a real solution and I've done it for business purposes.

lowleveldesign · on May 18, 2021

If you’re on recent Windows system, you should have pktmon [1] available. I believe it’s the „netsh trace” successor and has much nicer command line. And you no longer need an external tool to convert the trace to .npcap format.

[1] https://docs.microsoft.com/en-us/windows-server/networking/t...

slowstart · on May 19, 2021

PktMon is the next generation tool in newer Windows 10 versions and brings many of the same benefits referred to in this blog - particularly being able to view packet captures and traces together in the same text file.

SaveTheRbtz · on May 19, 2021

Thanks! Added it back to the blogpost (seems like we accidentally lost the `pktmon` reference during the editing process =)

thrdbndndn · on May 18, 2021

Cool article, but I'm not impressed by DropBox's upload speed on my Windows computer, at all.

I just tested rn with DropBox, GoogleDrive, and OneDrive, all with their native desktop apps. I simply put a 300MB file in the folder and let it sync.

    DB: 500 KiB/s
    GD: 3 MiB/s
    OD: 11 MiB/s (my max bandwidth with 100Mbps)

I don't know what causes the disparity here, but I have been annoyed by this for years, and it's the same across multiple computers I use at different locations.

Another funny thing is if you just use the webpage, both GD and DB can reach 100Mbps easily.

Edit: should mention Google's DriveFS can reach max speed too, but it's not available for my personal account (which uses the "Backup and sync for Google" app).

drewg123 · on May 18, 2021

Is Google Drive using QUIC? If so, then its using the same BBR congestion algorithm as the bbr tcp stack, and BBR's algorithm which does not view loss as congestion will help a lot.

It would be interesting to re-try the experiment on Linux or FreeBSD using BBR as the TCP stack and see if the results are any better for dropbox.

FWIW, my corp openvpn is kinda terrible. My upload speeds via the vpn did not improve at all when I moved and upgraded from 10Mb/s to 1Gb/s upstream speeds. When I switched to BBR, my bandwidth went from ~8Mb/s -> 60Mbs, which I think is the limit of the corp vpn endpoint.

virtuallynathan · on May 18, 2021

Looking at the flows on my network while uploading a file, it seems Google Drive's mac client just uses regular old TCP, same for the website.

leadbase0 · on May 18, 2021

QUIC is UDP, and TCP does not use CCA in userspace.

AndrewDucker · on May 18, 2021

QUIC absolutely uses congestion control. See section 6 here https://tools.ietf.org/id/draft-ietf-quic-recovery-26.html

leadbase0 · on May 18, 2021

No denying in that.

dochtman · on May 18, 2021

QUIC does run in user space, and also uses congestion controllers running inside the QUIC stack, in user space.

(I work on a QUIC implementation in Rust.)

leadbase0 · on May 18, 2021

QUIC is a protocol... "...CCA in userspace" CCA stands for congestion control algorithm.

kevingadd · on May 18, 2021

Strange. Dropbox has no problem hitting mid-50s MiB/s if not more on my gigabit connection. I wonder if it's a routing issue and your path to their datacenters is bad?

thrdbndndn · on May 18, 2021

It uploads fine with web version, so I doubt it's a routing issue (granted, they could use different datacenter.)

SaveTheRbtz · on May 18, 2021

Interesting, can you try disabling upload limiter in settings? Also what is your RTT to `nsf-1.dropbox.com`?

PS. One known problem that we have right now is that we use a multiplexed HTTP/2 connection, therefore:

1) We rely on the host's TCP congestion. (We have not yet switched to HTTP/3 w/ BBR.)

2) We currently use a single TCP connection: it is more fair to the other traffic on the link but can become bottleneck on large RTTs.

thrdbndndn · on May 18, 2021

Tried to change upload speed to no limit, doesn't make much difference.

Ping result:

  Pinging nsf-env-1.dropbox-dns.com [162.125.3.12] with 32 bytes of data:
  Reply from 162.125.3.12: bytes=32 time=27ms TTL=55
  Reply from 162.125.3.12: bytes=32 time=27ms TTL=55
  Reply from 162.125.3.12: bytes=32 time=27ms TTL=55
  Reply from 162.125.3.12: bytes=32 time=27ms TTL=55

  Ping statistics for 162.125.3.12:
   Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
  Approximate round trip times in milli-seconds:
   Minimum = 27ms, Maximum = 27ms, Average = 27ms

App Ver. 122.4.4867

Is the OS being Win7 a factor? (Work computer, can't update [yet]).

Download speed is normal (100Mbps).

SaveTheRbtz · on May 18, 2021

Oh, Windows 7? That does explain it. Windows TCP stack really improved in 8.1. The main change there is the auto send buffer tuning which allows automatic growing of SNDBUF. Let me see if we can put a dirty hack^W^Wworkaround for Windows < WIN2012R2SERVER that would unconditionally set SO_SNDBUF to something like 1Mb.

thrdbndndn · on May 18, 2021

Thank you for the hard work!

alternatetwo · on May 19, 2021

Yeah, Windows 7's speeds with single connections are very, very slow. It took me years to figure out the reason were bad TCP settings, because it worked fine when I used a multiconnection downloader. What fixed it for me was running these 3 commands in a CMD with admin permissions:

    netsh interface tcp set heuristics disabled
    netsh int tcp set global autotuninglevel=normal
    netsh int tcp set global congestionprovider=ctcp

thrdbndndn · on May 19, 2021

Thanks for the tips.

I tried, and the first two lines helped the speed bump to 2.5MB/s! The third one doesn't seem to have any immediate effect.

Still not OneDrive level, but I'm more than happy.

alternatetwo · on May 20, 2021

Good to know which actually take any effect, since it's been so long since I cobbled those commands together that I didn't really remember either!

I probably took them from Win7 forums or some stackoverflow spinoff, the TCP problems on Win7 are not entirely unknown.

pityJuke · on May 18, 2021

Google are migrating Backup and Sync to DriveFS soon [0], but you can upgrade right now. Now, I don't remember how I did it, but I do have Drive FS on my personal account.

[0]: https://support.google.com/googleone/answer/10309431#zippy=

thrdbndndn · on May 18, 2021

Good to know! Definitely will try it later, but I currently have a backup job (one-way photo backup, not sync GD) set up on my second GDrive account which I don't want to touch.. yet.

nailer · on May 18, 2021

Are you using the version of Windows with the fix mentioned in the article?

Groxx · on May 18, 2021

Yea, Dropbox on my Macs has continuously been outrageously slow at uploading. Everything else is multiples faster.

Dropbox does at least resume fairly reliably though, so I can generally ignore it the whole time... unless I have something I want to sync ASAP. Then I sometimes use the web UI and cross my fingers that I don't get a connection hiccup ಠ_ಠ

Dylan16807 · on May 18, 2021

> Edit: should mention Google's DriveFS can reach max speed too, but it's not available for my personal account (which uses the "Backup and sync for Google" app).

That thing is far too aggressive about network bandwidth. It will upload 20 files at the same time and the speed limit setting doesn't work.

vladvasiliu · on May 18, 2021

Is 20 files some kind of hyperbole? If not, how do you get it to do that?

I've never seen it transfer more than five files at a time, which sometimes drives me crazy when there are a lot of small files to sync.

Dylan16807 · on May 18, 2021

It's not hyperbole. I make a backup of my computer, with the output being a bunch of 500MB files. And I would then copy or move those files into a folder on the file stream drive. It's not entirely consistent, and it used to do less, but with some update it decided that it should upload way too many files at once. I've had to switch to an entirely different program to upload those files sequentially.

Isthatablackgsd · on May 18, 2021

What is the program that you are using? I am currently using odrive for macOS since they don't have DriveFS support for Apple Silicon. odrive work ok, it just have a weird file conflict sometime.

Dylan16807 · on May 18, 2021

I'm using rclone to do big uploads.

I still use DriveFS for everything else, at least for now. Rclone is capable of mounting the drive but it's not really designed for that.

encryptluks2 · on May 18, 2021

Google Drive is a gem. I hope it lasts forever cause no one is competing with them.

ASalazarMX · on May 18, 2021

Except we're still waiting for an official Linux client since Google promised it was "coming soon" in 2012

https://abevoelker.github.io/how-long-since-google-said-a-go...

Of course there are alternatives now, but I like to plug this page whenever I can.

encryptluks2 · on May 18, 2021

I'm using rclone and honestly prefer it at this point, and there are others as well, so while an official client would be nice it no longer is a concern for me.

fletchowns · on May 18, 2021

I disagree. I had my machine backed up to Google Drive using their Backup and Sync program and when I got a new machine there was no reasonable way to restore the data from the old machine to the new machine, using Google Drive. Sure I can copy data from my old machine, but what if ti was lost or stolen? If the app can't handle this use case, what's the point of it? The only way to restore the files is in small chunks using the web based interface - not reasonable for tens of thousands of files and hundreds of gigabytes.

The workaround was to back everything up to the "Google Drive" folder since this seems to be the only folder that Backup and Sync can actually restore.

vladvasiliu · on May 18, 2021

Had a somewhat similar issue, but with Drive File Stream.

At one point I set it up to use my second SSD as the local storage. Then I needed that SSD elsewhere, so I just took it out. It was impossible to restart the damn thing. It kept complaining about missing folders. I even tried uninstalling and reinstalling it, but it kept its settings.

Since I barely used that machine, if ever, and I'm not particularly familiar with Windows, I never really looked into how to completely clean up the configuration. But the point is that there clearly are some pretty stupid decisions about some products.

ripdog · on May 18, 2021

For future reference, most windows apps keep their settings in the AppData/Roaming folder in your user folder. There's a useful shortcut as an env variable - type %APPDATA% into explorer to go straight there.

System-wide settings and state should be stored in C:\ProgramData.

tssva · on May 18, 2021

To be fair it is called Backup and Sync not Backup, Sync and Restore.

But on a more useful note how I have handled this in the past is to download the complete Google Drive data using Google Takeout. Not the greatest solution but it has worked.

fletchowns · on May 19, 2021

It should be implied that if you are backing up files at some point you might like to restore them

encryptluks2 · on May 18, 2021

Are you saying the files were no longer available in Google Drive? Did you download the Drive for desktop client to try restoring files or just try reinstalling the Backup and Sync client?

fletchowns · on May 19, 2021

The files were still in Google Drive under "My Computer". My attempt was the latter, reinstalling the Backup and Sync client.

encryptluks2 · on May 19, 2021

I believe simply installing the Google Drive for Desktop client would have been what you wanted to do. As the name suggests, the Backup and Sync client is well primarily for backing up and syncing your data automatically to Google Drive and Google Photos.

fletchowns · on May 19, 2021

Thanks for the tip! I'll check that out

CPAhem · on May 18, 2021

There are better Google Drive clients like SyncDocs that can actually restore properly.

strictfp · on May 18, 2021

Really? Google drive sync has been hot garbage for me. Before that program came everything was fine and dandy, but drive sync constantly stumbles over it's own feet, restarts and fails to up- and download files. I'm longing back to Rsync or even FTP after trying to use google drive to move data.

pjmlp · on May 19, 2021

It does indeed take forever to upload most stuff.

brundolf · on May 18, 2021

Dropbox always publishes such good technical blog posts. And as a user, it's reassuring to see how much they still care about technical excellence.

whatever_dude · on May 18, 2021

Do they? I constantly see DropBox taking days to sync files that are 30kb on size. Or doing dumbfounding things like download all files, then re-upload all files when I set sync to "online only" on a folder if just one of the files is not set to online only.

Maybe they have grand academic visions and papers, but I've been using them for well over a decade and I feel the client quality has gone downhill over the past few years. They keep adding unnecessary stuff like a redundant file browser while the core service suffers.

brundolf · on May 18, 2021

Maybe my usage stays in the golden path, but I've been using them for ten years too and I have no complaints about the core functionality. My only real complaint is that they've been adding lots of features I don't care about, getting slightly pushy about convincing you to try them, etc. But I haven't seen the core stuff actually go downhill.

emmericp · on May 18, 2021

The real root cause for all that flow director mess and core balancing is that there's a huge disconnect between how the hardware works and what the socket API offers by default.

The scaling model of the hardware is rather simple: hash over packet headers and assign a queue based on this. And each queue should be pinned to a core by pinning the interrupts, so you got easy flow-level scaling. That's called RSS. It's simple and effective. What it means is: the hardware decides which core handles which flow. I wonder why the article doesn't mention RSS at all?

Now the socket API works in a different way: your application decides which core handles which socket and hence which flow. So you get cache misses if you don't tak into account how the hardware is hashing your flows. That's bad. So you can do some work-arounds by using flow director to explicitly redirect flows to cores that handle things but that's just not really an elegant solution (and the flow director lookup tables are small-ish).

I didn't follow kernel development regarding this recently, but there should be some APIs to get a mapping from a connection tuple to the core it gets hashed to on RX (hash function should be standardized to Toeplitz IIRC, the exact details on which fields and how they are put into the function are somewhat hardware- and driver-specific but usually configurable). So you'd need to take this information into account when scheduling your connections to cores. If you do that you don't get any cache misses and don't need to rely on the limited capabilities of explicit per-flow steering.

Note that this problem will mostly go away once TAPS finally replaces BSD sockets :)

SaveTheRbtz · on May 18, 2021

We didn't mention RSS/RPS in the post mostly because they are stable. (Albeit, relatively ineffective in terms of L2 cache misses.) FlowDirector, OTOH, breaks that stability and causes a lot of migrations, and hence a lot of re-ordering.

Anyways, nice reference for TAPS! Fo those wanting to dig into it a bit more, consider reading an introductory paper (before a myriad of RFC drafts from the "TAPS Working Group"): https://arxiv.org/pdf/2102.11035.pdf

PS. We went through most of our low-level web-server optimization for the Edge Network in an old blogpost: https://dropbox.tech/infrastructure/optimizing-web-servers-f...

tims33 · on May 18, 2021

I appreciate seeing a support and engineering org going this deep to resolve this kind of issue. Normally this is the stuff you waste hours on with a support org only to get told to clear your cookies and cache one more time.

In particular, the collaboration with Microsoft was great.I wonder what it took to make that happen.

Seattle3503 · on May 18, 2021

Has Dropbox ever experimented with SCTP or other protocols that don't enforce strict ordering of packets? I know some middleboxes struggle with SCTP (they expect TCP or UDP), but in that case you do SCTP over UDP or have a fall back.

SaveTheRbtz · on May 19, 2021

Sadly, middleboxes are a real problem, esp. with our Enterprise customers. We had this problem even with HTTP/2 rollout so there is even a special HTTP/1.1-only mode in the Desktop Client for environments where h2 is disabled.

In the future we are planing on having an HTTP/3 support which will give us pretty much the same benefits as SCTP with a better middlebox compatibility.

mrpippy · on May 18, 2021

API Monitor is really useful, but unfortunately is closed-source and hasn't been updated in a few years.

stephc_int13 · on May 18, 2021

Is TCP the best choice? Why not UDP?

bob1029 · on May 18, 2021

This is a good question in my opinion.

Theoretically, UDP would be the best choice if you had the time & money to spend on building a very application-specific layer on top that replicates many of the semantics of TCP. I am not aware of any apps that require 100% of the TCP feature set, so there is always an opportunity to optimize.

You would essentially be saying "I know TCP is great, but we have this one thing we really prefer to do our way so we can justify the cost of developing an in-house mostly-TCP clone and can deal with the caveats of UDP".

If you know your communications channel is very reliable, UDP can be better than TCP.

Now, I am absolutely not advocating that anyone go out and do this. If you are trying to bring a product like Dropbox to market (and you don't have their budget), the last thing you want to do is play games with low-level network abstractions across thousands of potential client device types. TCP is an excellent fit for this use case.

willis936 · on May 18, 2021

It's an ideal application of TCP. Dropbox servers are continually flooded by traffic from clients, so the good congestion behavior from TCP is valuable. There is also less need to implement error detection/correction/retransmission in higher layers.

SaveTheRbtz · on May 18, 2021

We will be eventually migrating to UDP (HTTP/3) once it is rolled out on Envoys[0] on Dropbox Edge Network[1].

[0] https://dropbox.tech/infrastructure/how-we-migrated-dropbox-...

[1] https://dropbox.tech/infrastructure/dropbox-traffic-infrastr...

jandrese · on May 18, 2021

Bulk data transfer is TCP's bread and butter. This is the protocol living the dream.

pansinghkoder · on May 19, 2021

Thumbs up! Most video transmissions happen using protocols written over udp.

arduinomancer · on May 18, 2021

Sure if you want to re-build TCP yourself on top of UDP

kevin_thibedeau · on May 19, 2021

You could even call it SCTP.

michaelmcmillan · on May 18, 2021

And reimplement TCP on top? Would not recommend.

stephc_int13 · on May 19, 2021

Why not?

I am not saying that it should be done from scratch. But most recent research done in the recent years about the protocols used on the web tend to be built on top of UDP and not TCP, for many historical reasons.

In theory TCP would be the better choice, but in practice this is more complex than you assume.

I think that many people have a knee-jerk reaction when talking about TCP vs UDP, but they probably don't know as much as they think... (parrots)

michaelmcmillan · on May 19, 2021

The onus is on you to explain why. Why not: Smaller payloads per packet and missing out on all the TCP algorithms already implemented in hardware en route.

rootsudo · on May 18, 2021

"Dropbox is used by many creative studios, including video and game productions. These studios’ workflows frequently use offices in different time zones to ensure continuous progress around the clock. "

Honestly I don't understand these orgs that don't go OneDrive/O365 suite. What product value does dropbox have when competing within Microsoft's own ecosystem?

mwcampbell · on May 18, 2021

I wonder how the Dropbox developers managed to get in contact with the Windows core TCP team. Maybe I'm too cynical, but I'm surprised that Microsoft would go out of their way to work with a competitor like this.

toast0 · on May 18, 2021

Even if OneDrive vs Dropbox is important, this is a win for Windows in general. People will switch OSes because the TCP throughput is better on the other side; it's easy to measure and easy to compare and makes a nice item in a pros and cons list.

Fixing something like this can help lots of use cases, but may have been difficult to spot, so I'm sure the Windows TCP team was thrilled to get the detailed, reproducible report.

paxys · on May 18, 2021

Microsoft is a massive and highly compartmentalized company. Windows kernel developers have no reason to see Dropbox as a competitor.

tyingq · on May 18, 2021

Interesting. Is the Dropbox client still an obfuscated python app? I'm curious if they spawn new processes for simultaneous uploads since they probably aren't threading.

Twisol · on May 18, 2021

> On one hand, Dropbox Desktop Client has just a few settings. On the other, behind this simple UI lies some pretty sophisticated Rust code with multi-threaded compression, chunking, and hashing. On the lowest layers, it is backed up by HTTP/2 and TLS stacks.

And I found another Dropbox blog post about rewriting their sync engine from Python to Rust: https://dropbox.tech/infrastructure/rewriting-the-heart-of-o...

But it isn't clear whether the outer shell of the app might still be Python.

kevingadd · on May 18, 2021

The Windows client I have installed appears to be a native app using QT 5 and QT5WebEngine (embedded chromium) with an absolutely bonkers number of threads (240). It's possible there's still python in there but I suspect not, their UI has been completely overhauled since the python days.

freerk · on May 18, 2021

How come Linux doesn't have this issue? Why did Microsoft had to fix TCP with the RACK-TLP RFC when both Linux and MacOS implementations did fine already?

SaveTheRbtz · on May 18, 2021

Microsoft Devs explain this in their "Algorithmic improvements boost TCP performance on the Internet"[1] article.

TL;DR is that they had RACK (RFC draft) implemented as an MVP but w/o the reordering heuristic.

[1] https://techcommunity.microsoft.com/t5/networking-blog/algor...

the8472 · on May 19, 2021

The linux implementation already had rapid acknowledgements and tail loss probe for a long time. I think it was prototyped there by google.

slowstart · on May 19, 2021

It's called R(ecent) Acknowledgement and yes the work came out of Google. This is the single biggest change to TCP loss recovery in a decade. It is now a Standards Track RFC: https://datatracker.ietf.org/doc/html/rfc8985. The Windows implementation was one of the earliest amongst a handful and Microsoft participated in the standardization.

ovebepari · on May 19, 2021

I would've said "it's not our problem to solve"

chokeartist · on May 18, 2021

I got excited when I saw that fancy Microsoft Message Analyzer tool and wanted to try it out. Sadly it appears to be retired and removed by MSFT? Sad!

hyperrail · on May 18, 2021

Yeah, I have no idea either why Microsoft would want to remove Message Analyzer completely, even if they could not maintain it. You can still download it through the Internet Archive:

* 32-bit x86: https://web.archive.org/web/20191104120802/https://download....

* 64-bit x86: https://web.archive.org/web/20190420141924/http://download.m...

(those links via: https://www.reddit.com/r/sysadmin/comments/e4qocq/microsoft_... )

Or use the even older Microsoft utility Network Monitor, which is still available on Microsoft's website: https://www.microsoft.com/en-us/download/details.aspx?id=486...

Supposedly Microsoft is working on adding to the existing Windows Performance Analyzer (great GUI tool for ETW performance tracing) to display ETW packet captures, which will succeed Message Analyzer and Network Monitor: https://techcommunity.microsoft.com/t5/networking-blog/intro...

slowstart · on May 19, 2021

You are spot on. PktMon is the next generation tool in newer Windows 10 versions and brings many of the same benefits referred to in this blog - particularly being able to view packet captures and traces together in the same text file. And WPA is also very useful when analyzing performance problems.

jabroni_salad · on May 18, 2021

It's really too bad. I'm happy enough to use Wireshark, but I liked that MMA could filter by PID.