Are parallel downloads faster in S3? My intuition says you will hit network or d...

throwaway340953 · on April 11, 2021

Depends where you're downloading from. S3 single stream GET throughput is throttled to ~40MB/sec. An HDD can write at ~200MB/sec and SSDs can write ~500MB/sec. NICs on old EC2 instances can do 10Gbps and the new ones can do 25Gbps.

If you're on a home internet/mobile connection downloading large files, a single download will likely saturate your connection. If you're on an EC2 instance, you should be able to do 10-100x better parallelizing.

Source: I used to work on S3.

haimez · on April 11, 2021

FWIW: that 40MB single stream limit doesn’t seem to exist from my empirical measurements in the last two years. When did you work on S3?

Galanwe · on April 11, 2021

It does exist, but it's 80MB/s nowadays.

Galanwe · on April 11, 2021

> S3 single stream GET throughput is throttled to ~40MB/sec

More like 80MB/s

banana_giraffe · on April 11, 2021

Yes, parallel downloads are faster in some cases. There are even extreme cases, mostly around giant EC2 nodes where you can take this idea a step further and spin up multiple processes to each download parts of a file, and really saturate your network or disk.

My favorite version of this is when you start to use shared memory of some fashion to move terabytes of data from S3 to EC2 to work on it without ever hitting a disk.

Not for everyone, and for sure many times the extra milliseconds saved won't matter, but sometimes you really do need to get hundreds of gigabytes or terabytes of data moved as quickly as possible.

Galanwe · on April 11, 2021

That's what my small python package does (https://github.com/NewbiZ/s3pd). Definitely not perfect (not saturating cores because no use of event loop per process) but download is split in multiple processes and stored in shared memory.

I've been able to saturate 20GB NICs on Ec2 with it (32 cores)

killingtime74 · on April 11, 2021

You just described Apache Spark

ak217 · on April 11, 2021

Aside from what the other replies mentioned, one of the network bottlenecks that you're referring to is the per-TCP connection packet rate limit and congestion avoidance artifacts that many networks impose. The S3 frontend node that serves your request may also experience congestion from "hot" objects it has been assigned, so amortizing your connections over many S3 nodes can help. Finally, for many small objects, having multiple connections helps amortize the per-object latency overhead.

As far as the disk overhead, modern NVMe SSD drives can easily sustain millions of IOPS and multiple gigabytes per second of bandwidth, more than keeping up with a 40 gbps link (such as on a large EC2 instance that does have the connectivity to talk to S3 at that rate).

aloer · on April 11, 2021

Extreme case can be when you make heavy use of s3 for ec2 workloads and want to saturate your instance connection. This can go up to 40g afaik

Edit: I read this recently and if I remember correctly there’s a limit of like a thousand parallel connections to s3

AWS internally probably has higher limits for some of their services, e.g. when you query data in s3 with Athena

__turbobrew__ · on April 11, 2021

Make sense, thank you. Wouldn't you want to multiprocess the downloads instead of multithread? I imagine you would run into the Python global interpreter lock before being able to push through 40gb?

ak217 · on April 11, 2021

Yes, you would. The CPU and memory overhead of multiprocessing for this application is why we ended up migrating away from boto3 and to the AWS Go SDK for this specific purpose (https://github.com/chanzuckerberg/s3parcp as I mentioned in another comment). We still use boto3 in other areas, but for maxing out the network connection, golang is far more scalable.

Galanwe · on April 11, 2021

Using multiprocessing I've been able to quite easily saturate a 20GBps Ec2 NICs in python. https://github.com/NewbiZ/s3pd

There is no reason why multiprocessing for IO in python would use _crazily_ more memory than in an other language, when done properly.

twotwotwo · on April 11, 2021

When you have a lot of small files, the latency on each operation, though not huge in absolute terms, can end up taking up more time than the data transfer and whatever other work is being done on the content.

(At work, we had an upload job with ~800k files, ranging from <1kb to >100kb. I looked at rearranging how we stored things to avoid small files, but it ended up a straighter shot to continue to use little files but use a worker pool to make the transfer parallel.)

twotwotwo · on April 12, 2021

(Er, some files >100MB not >100KB. If they maxed out around 100KB we probably wouldn't have picked S3 to store them!)

emasquil · on April 11, 2021

In this case I needed to download about 1k files of size ~ 10mb. Doing this with multithreading was so much faster than downloading all the files sequentially.

mathgladiator · on April 11, 2021

yes, and if you leverage the multipart upload then downloading along those boundaries will have zero contention and you can go extra fast.