Hacker News new | past | comments | ask | show | jobs | submit login
Downloading files from S3 with multithreading and Boto3 (emasquil.github.io)
154 points by emasquil on April 10, 2021 | hide | past | favorite | 64 comments



It is a bit of a hidden gem but Metaflow includes a Boto-based highly parallelized, error-tolerant S3 client that Netflix uses routinely to get 10-20Gbps throughput between EC2 and S3.

Technically it is independent from Metaflow, so you could use it as a stand-alone, high-performance S3 client.

See docs here https://docs.metaflow.org/metaflow/data#store-and-load-objec...

And code here https://github.com/Netflix/metaflow/tree/master/metaflow/dat...

(I wrote it originally - AMA if curious)


Really interesting going through your code, thanks for sharing. Why did you opt for running the parallelization code in a separate python process (s3op)?

You might want to update section ‚Caution: Overwriting data in S3‘ in the docs since S3 offers strong read after write consistency since dec 2020.

https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-rea...


re: separate process - fault-tolerance is a key requirement. There are a myriad of ways how a highly parallelized, network- and data-intensive code can fail, so isolating it in a separate process is a safer approach than trying to try-except everything and hope it works.

Good catch re: the warning about consistency! The docs were written before the change :)


Can you do a blogpost please!


The topic is close to my heart so maybe one day :)


The AWS Go SDK now has a connection pool based S3 download/upload manager API that allows saturating your (e.g. 40Gbit/s EC2-S3) network connection using far less memory and CPU than is possible with Python.

A colleague of mine developed this tool to make this functionality available in a CLI: https://github.com/chanzuckerberg/s3parcp


With AWS it's really hard to tell which of their SDKs are up to cutting edge feature parity and which not, that some of them are and some of them are not is a real shame.

Just last week I wrote basically the same thing as an ad-hoc solution using boto3 because I had 10s of TB of data to pull out of Glacier and distribute across S3 buckets. It wasn't a big deal because I'm experienced writing parallel network code in Python and having big datastreams flow, and boto3 has good documentation, but things like this really shouldn't be left as an exercise to the SDK consumer.


Do you know if there is a "sync" function just like the aws-cli?!

I've been thinking of starting using Go to deploy some stuff that doesn't need python as dependency, and statically compiled :P

Edit: In this case for DO Spaces. Way more cheap.


RClone does all that you require and much more.

RClone: https://rclone.org/ S3 Backend: https://rclone.org/s3/


Be cautious though, as rclone "sync" is based on file metadata (e.g. last modified), it does not recompute local etags to know which files need to be sync'ed.

For instance, if you "cp -a" a directory and then apply sync, it could do nothing and return success if the copied files were last modified before the ones in S3.

For our use case at work, we wanted to be _sure_ that sync always work as intended, and thus ended up recomputing etags locally and compare to the ones in S3 to know what to sync (got bitten by the issue of last modified before)


I'm not aware of a supported API in the AWS Go SDK for this, but there is a sketch here: https://github.com/aws/aws-sdk-go/tree/main/example/service/...


Excellent walkthrough, love boto. We’ve recently been using s5cmd which we’ve found is ridiculously faster than boto without any extra boto tricks.

https://github.com/peak/s5cmd


Same!


Sorry to do this here, I know it’s not Amazon support. Is there a way to copy S3 objects from bucket to bucket without sending them through the compute that’s doing the copying?

We have a use case for copying terabytes of content to buckets with different owners and it just seems wasteful to run everything through a client.


Yes: https://docs.aws.amazon.com/AmazonS3/latest/API/API_CopyObje...

I wrote many tests for it... so many tests...


I’ll have to try this again, it really seemed like the performance was quite low for it to be a ‘backend’ color operation. Thanks!

edit: just tried it, definitely seems to be working. not sure what I was seeing earlier, thanks!


distributed systems are complex, and most likely you were bound to an overloaded webserver. My knowledge of S3 is six years old now, but the copy operation is single threaded with a tremendous amount of hashing to ensure durability.

At scale, the performance cost is worth it given the number of checks done internally to ensure the copy is perfect. If you download and upload, then you could make it faster however getting all the details right such that corruption didn't happen is tricky.


You can use aws s3 batch which can handle billions of files in just a few minutes

https://docs.aws.amazon.com/AmazonS3/latest/userguide/batch-...



aws cli cp can copy between buckets.


That’s true but data flows through the machine executing aws cli. The parent was asking how to copy data without it flowing through a compute instance.


I’d double check the cli code path for that parameter. The s3 api does have a copy method, which performs the operation within s3 without compute acting as an intermediary. If that’s not the case for the copy parameter, sounds like a bug that needs to be fixed in the cli tooling.


It is directly checking for s3 to s3[1] and indicates that it wants to copy...

I've read over it and I'm reasonably sure that it's going to issue CopyObject, but it would take me actually getting out paper and pen to really track it down.

The AWS CLI and Boto are a case study in overdoing class hierarchies. Not because there's any obvious AbstractSingletonProxyFactoryBean, but rather that there's no instance that stands out as "this is where they went wrong" and nevertheless the end result is a confusing mess of inheritance and objects.

[1]: https://github.com/aws/aws-cli/blob/45b0063b2d0b245b17a57fd9...


Not to mention the insane over engineering of a python 2.7-compatible async task stealing io loop which is slow as hell, and pitifully delivers a maximum of ~150MB/s with 30% cpu core activity. That's why anyone needing to regularly download/upload files from S3 need an additional library (s5cmd, s3pd, etc)


Thanks! I don’t know why I had the understanding that it worked the other way. This is useful to know!


No, when copying objects between buckets (aws s3 cp s3://... s3://... and the corresponding sync command), the AWS CLI uses CopyObject (https://docs.aws.amazon.com/AmazonS3/latest/API/API_CopyObje..., previously known as S3 PUT Copy), in which the client doesn't handle any object contents. The call stack eventually reaches https://github.com/boto/s3transfer/blob/develop/s3transfer/c... (or its multipart equivalent), where it calls the botocore binding for this API.



s5cmd[0], written in Go, performs cp in parallel to/from s3 too. I've been using it for some time and it's plenty fast.

https://github.com/peak/s5cmd


No mention of how it compares to s4cmd (which like s3cmd is python).

https://github.com/bloomreach/s4cmd


I had tried both a couple of years back and at that time s5cmd was significantly faster when downloading multiple (100s) of files, hence my loyalty to it. Of course both are actively developed projects, so things might have changed since.


There is aioboto3 which wraps boto3 in asyncio calls. It's a lot simpler than all of this.


Having used aiobotocore, it's pretty good. It can be a bit sketchy about releasing connection pools, though. Normally you shouldn't run into any issues, but we've had trouble with running out of filehandles on AWS Lambda.

I would also point out that for many cases, using a ThreadPoolExecutor works fine.

I maintain a simulation engine and it's distributing work out to AWS Lambda. So I had two distributors, one thread based and one async based. (Also one based on multiprocessing that runs work on a local machine.)

From my tests, the performance is basically equivalent, which is not surprising: most of it was just waiting and then processing incoming responses. Threads work great at this, and botocore is designed to work with threads.

I eventually went with async because Python doesn't let you prioritize threads. That means that if you get a "stop" signal, in the threaded model, the command thread is competing with many worker threads. In the async model, the workers are all in one thread, so the command thread will be woken up per the switch interval[1].

So, broadly, if you're going to need a command channel that must respond in a timely fashion, I'd recommend piling workers into an event loop through async.

The other possibility is to have a command channel run in another process, but then you need to get the fork right, do IPC, etc.

[1]: https://docs.python.org/3/library/sys.html#sys.setswitchinte...


In this case I just wanted to keep it simple and do it the "native" way. I've been hearing a lot about aioboto3 and I'll surely check it out. However, it seems to have some limitations like https://github.com/boto/botocore/issues/458


That issue is just botocore not supporting async, which aiobotocore addresses. It's also very long; is there a particular comment on that issue that you wanted to point out?


I just mentioned the issue because a colleague of mine did some quick tests with aioboto3 and he told me it was downloading the files sequentially. We thought it might be related to that.

You made a great point about going async if you need a working command channel.


Have you benchmarked this solution on a high bandwidth connection? If so, what's the max throughput that it achieved?


Unfortunately I haven't. If someone wants to benchmark this solution I'd be glad to hear about that!


I find it's easier and more convenient to use aws-cli [1] (with subprocess.run) than using Boto.

Aws-cli is multithreaded in both download and upload. There is even a setting to tweak the number of parallel requests [2].

[1] https://github.com/aws/aws-cli

[2] https://docs.aws.amazon.com/credref/latest/refdocs/setting-s...

Edit: didn't know about s5cmd (from other commenters). Seems like a faster alternative to aws-cli.


Are parallel downloads faster in S3? My intuition says you will hit network or disk bottlenecks before being able to saturate a single core with work?


Depends where you're downloading from. S3 single stream GET throughput is throttled to ~40MB/sec. An HDD can write at ~200MB/sec and SSDs can write ~500MB/sec. NICs on old EC2 instances can do 10Gbps and the new ones can do 25Gbps.

If you're on a home internet/mobile connection downloading large files, a single download will likely saturate your connection. If you're on an EC2 instance, you should be able to do 10-100x better parallelizing.

Source: I used to work on S3.


FWIW: that 40MB single stream limit doesn’t seem to exist from my empirical measurements in the last two years. When did you work on S3?


It does exist, but it's 80MB/s nowadays.


> S3 single stream GET throughput is throttled to ~40MB/sec

More like 80MB/s


Yes, parallel downloads are faster in some cases. There are even extreme cases, mostly around giant EC2 nodes where you can take this idea a step further and spin up multiple processes to each download parts of a file, and really saturate your network or disk.

My favorite version of this is when you start to use shared memory of some fashion to move terabytes of data from S3 to EC2 to work on it without ever hitting a disk.

Not for everyone, and for sure many times the extra milliseconds saved won't matter, but sometimes you really do need to get hundreds of gigabytes or terabytes of data moved as quickly as possible.


That's what my small python package does (https://github.com/NewbiZ/s3pd). Definitely not perfect (not saturating cores because no use of event loop per process) but download is split in multiple processes and stored in shared memory.

I've been able to saturate 20GB NICs on Ec2 with it (32 cores)


You just described Apache Spark


Aside from what the other replies mentioned, one of the network bottlenecks that you're referring to is the per-TCP connection packet rate limit and congestion avoidance artifacts that many networks impose. The S3 frontend node that serves your request may also experience congestion from "hot" objects it has been assigned, so amortizing your connections over many S3 nodes can help. Finally, for many small objects, having multiple connections helps amortize the per-object latency overhead.

As far as the disk overhead, modern NVMe SSD drives can easily sustain millions of IOPS and multiple gigabytes per second of bandwidth, more than keeping up with a 40 gbps link (such as on a large EC2 instance that does have the connectivity to talk to S3 at that rate).


Extreme case can be when you make heavy use of s3 for ec2 workloads and want to saturate your instance connection. This can go up to 40g afaik

Edit: I read this recently and if I remember correctly there’s a limit of like a thousand parallel connections to s3

AWS internally probably has higher limits for some of their services, e.g. when you query data in s3 with Athena


Make sense, thank you. Wouldn't you want to multiprocess the downloads instead of multithread? I imagine you would run into the Python global interpreter lock before being able to push through 40gb?


Yes, you would. The CPU and memory overhead of multiprocessing for this application is why we ended up migrating away from boto3 and to the AWS Go SDK for this specific purpose (https://github.com/chanzuckerberg/s3parcp as I mentioned in another comment). We still use boto3 in other areas, but for maxing out the network connection, golang is far more scalable.


Using multiprocessing I've been able to quite easily saturate a 20GBps Ec2 NICs in python. https://github.com/NewbiZ/s3pd

There is no reason why multiprocessing for IO in python would use _crazily_ more memory than in an other language, when done properly.


When you have a lot of small files, the latency on each operation, though not huge in absolute terms, can end up taking up more time than the data transfer and whatever other work is being done on the content.

(At work, we had an upload job with ~800k files, ranging from <1kb to >100kb. I looked at rearranging how we stored things to avoid small files, but it ended up a straighter shot to continue to use little files but use a worker pool to make the transfer parallel.)


(Er, some files >100MB not >100KB. If they maxed out around 100KB we probably wouldn't have picked S3 to store them!)


In this case I needed to download about 1k files of size ~ 10mb. Doing this with multithreading was so much faster than downloading all the files sequentially.


yes, and if you leverage the multipart upload then downloading along those boundaries will have zero contention and you can go extra fast.


I used Boto to upload static images from Django to s3. Ran into a few problems where the sync takes so long that I stopped relying on it. And just add new files to local and the corresponding s3 folder. Kind of annoying. This was still on Django 1.6. Not sure if this has now been fixed.


I know people lave s5cmd (a successor of s4cmd, which itself is a successor of the popular s3cmd), but the standard aws-cli handles parallel transfers well.


Man I had a hard time reading this post because of contrast violations. Light gray on dark gray is VERY hard to read.

It was probably a good article tho


Sorry about that! Next time I work on my blog I'll take that into account and find a better theme/color palette. In the meantime you can change it to light mode (clicking on the top right button) and there you have black text on white background.


Thank you so much for posting this. I was working on something similar and this helped a lot.


RClone does this and much more.

RClone: https://rclone.org/ S3 Backend: https://rclone.org/s3/


Nice read


agree, i found it quite helpful!


Agreed!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: