It is a bit of a hidden gem but Metaflow includes a Boto-based highly parallelized, error-tolerant S3 client that Netflix uses routinely to get 10-20Gbps throughput between EC2 and S3.
Technically it is independent from Metaflow, so you could use it as a stand-alone, high-performance S3 client.
Really interesting going through your code, thanks for sharing.
Why did you opt for running the parallelization code in a separate python process (s3op)?
You might want to update section ‚Caution: Overwriting data in S3‘ in the docs since S3 offers strong read after write consistency since dec 2020.
re: separate process - fault-tolerance is a key requirement. There are a myriad of ways how a highly parallelized, network- and data-intensive code can fail, so isolating it in a separate process is a safer approach than trying to try-except everything and hope it works.
Good catch re: the warning about consistency! The docs were written before the change :)
The AWS Go SDK now has a connection pool based S3 download/upload manager API that allows saturating your (e.g. 40Gbit/s EC2-S3) network connection using far less memory and CPU than is possible with Python.
With AWS it's really hard to tell which of their SDKs are up to cutting edge feature parity and which not, that some of them are and some of them are not is a real shame.
Just last week I wrote basically the same thing as an ad-hoc solution using boto3 because I had 10s of TB of data to pull out of Glacier and distribute across S3 buckets. It wasn't a big deal because I'm experienced writing parallel network code in Python and having big datastreams flow, and boto3 has good documentation, but things like this really shouldn't be left as an exercise to the SDK consumer.
Be cautious though, as rclone "sync" is based on file metadata (e.g. last modified), it does not recompute local etags to know which files need to be sync'ed.
For instance, if you "cp -a" a directory and then apply sync, it could do nothing and return success if the copied files were last modified before the ones in S3.
For our use case at work, we wanted to be _sure_ that sync always work as intended, and thus ended up recomputing etags locally and compare to the ones in S3 to know what to sync (got bitten by the issue of last modified before)
Sorry to do this here, I know it’s not Amazon support. Is there a way to copy S3 objects from bucket to bucket without sending them through the compute that’s doing the copying?
We have a use case for copying terabytes of content to buckets with different owners and it just seems wasteful to run everything through a client.
distributed systems are complex, and most likely you were bound to an overloaded webserver. My knowledge of S3 is six years old now, but the copy operation is single threaded with a tremendous amount of hashing to ensure durability.
At scale, the performance cost is worth it given the number of checks done internally to ensure the copy is perfect. If you download and upload, then you could make it faster however getting all the details right such that corruption didn't happen is tricky.
That’s true but data flows through the machine executing aws cli. The parent was asking how to copy data without it flowing through a compute instance.
I’d double check the cli code path for that parameter. The s3 api does have a copy method, which performs the operation within s3 without compute acting as an intermediary. If that’s not the case for the copy parameter, sounds like a bug that needs to be fixed in the cli tooling.
It is directly checking for s3 to s3[1] and indicates that it wants to copy...
I've read over it and I'm reasonably sure that it's going to issue CopyObject, but it would take me actually getting out paper and pen to really track it down.
The AWS CLI and Boto are a case study in overdoing class hierarchies. Not because there's any obvious AbstractSingletonProxyFactoryBean, but rather that there's no instance that stands out as "this is where they went wrong" and nevertheless the end result is a confusing mess of inheritance and objects.
Not to mention the insane over engineering of a python 2.7-compatible async task stealing io loop which is slow as hell, and pitifully delivers a maximum of ~150MB/s with 30% cpu core activity. That's why anyone needing to regularly download/upload files from S3 need an additional library (s5cmd, s3pd, etc)
I had tried both a couple of years back and at that time s5cmd was significantly faster when downloading multiple (100s) of files, hence my loyalty to it. Of course both are actively developed projects, so things might have changed since.
Having used aiobotocore, it's pretty good. It can be a bit sketchy about releasing connection pools, though. Normally you shouldn't run into any issues, but we've had trouble with running out of filehandles on AWS Lambda.
I would also point out that for many cases, using a ThreadPoolExecutor works fine.
I maintain a simulation engine and it's distributing work out to AWS Lambda. So I had two distributors, one thread based and one async based. (Also one based on multiprocessing that runs work on a local machine.)
From my tests, the performance is basically equivalent, which is not surprising: most of it was just waiting and then processing incoming responses. Threads work great at this, and botocore is designed to work with threads.
I eventually went with async because Python doesn't let you prioritize threads. That means that if you get a "stop" signal, in the threaded model, the command thread is competing with many worker threads. In the async model, the workers are all in one thread, so the command thread will be woken up per the switch interval[1].
So, broadly, if you're going to need a command channel that must respond in a timely fashion, I'd recommend piling workers into an event loop through async.
The other possibility is to have a command channel run in another process, but then you need to get the fork right, do IPC, etc.
In this case I just wanted to keep it simple and do it the "native" way. I've been hearing a lot about aioboto3 and I'll surely check it out. However, it seems to have some limitations like https://github.com/boto/botocore/issues/458
That issue is just botocore not supporting async, which aiobotocore addresses. It's also very long; is there a particular comment on that issue that you wanted to point out?
I just mentioned the issue because a colleague of mine did some quick tests with aioboto3 and he told me it was downloading the files sequentially. We thought it might be related to that.
You made a great point about going async if you need a working command channel.
Depends where you're downloading from. S3 single stream GET throughput is throttled to ~40MB/sec. An HDD can write at ~200MB/sec and SSDs can write ~500MB/sec. NICs on old EC2 instances can do 10Gbps and the new ones can do 25Gbps.
If you're on a home internet/mobile connection downloading large files, a single download will likely saturate your connection. If you're on an EC2 instance, you should be able to do 10-100x better parallelizing.
Yes, parallel downloads are faster in some cases. There are even extreme cases, mostly around giant EC2 nodes where you can take this idea a step further and spin up multiple processes to each download parts of a file, and really saturate your network or disk.
My favorite version of this is when you start to use shared memory of some fashion to move terabytes of data from S3 to EC2 to work on it without ever hitting a disk.
Not for everyone, and for sure many times the extra milliseconds saved won't matter, but sometimes you really do need to get hundreds of gigabytes or terabytes of data moved as quickly as possible.
That's what my small python package does (https://github.com/NewbiZ/s3pd). Definitely not perfect (not saturating cores because no use of event loop per process) but download is split in multiple processes and stored in shared memory.
I've been able to saturate 20GB NICs on Ec2 with it (32 cores)
Aside from what the other replies mentioned, one of the network bottlenecks that you're referring to is the per-TCP connection packet rate limit and congestion avoidance artifacts that many networks impose. The S3 frontend node that serves your request may also experience congestion from "hot" objects it has been assigned, so amortizing your connections over many S3 nodes can help. Finally, for many small objects, having multiple connections helps amortize the per-object latency overhead.
As far as the disk overhead, modern NVMe SSD drives can easily sustain millions of IOPS and multiple gigabytes per second of bandwidth, more than keeping up with a 40 gbps link (such as on a large EC2 instance that does have the connectivity to talk to S3 at that rate).
Make sense, thank you. Wouldn't you want to multiprocess the downloads instead of multithread? I imagine you would run into the Python global interpreter lock before being able to push through 40gb?
Yes, you would. The CPU and memory overhead of multiprocessing for this application is why we ended up migrating away from boto3 and to the AWS Go SDK for this specific purpose (https://github.com/chanzuckerberg/s3parcp as I mentioned in another comment). We still use boto3 in other areas, but for maxing out the network connection, golang is far more scalable.
When you have a lot of small files, the latency on each operation, though not huge in absolute terms, can end up taking up more time than the data transfer and whatever other work is being done on the content.
(At work, we had an upload job with ~800k files, ranging from <1kb to >100kb. I looked at rearranging how we stored things to avoid small files, but it ended up a straighter shot to continue to use little files but use a worker pool to make the transfer parallel.)
In this case I needed to download about 1k files of size ~ 10mb. Doing this with multithreading was so much faster than downloading all the files sequentially.
I used Boto to upload static images from Django to s3. Ran into a few problems where the sync takes so long that I stopped relying on it. And just add new files to local and the corresponding s3 folder. Kind of annoying. This was still on Django 1.6. Not sure if this has now been fixed.
I know people lave s5cmd (a successor of s4cmd, which itself is a successor of the popular s3cmd), but the standard aws-cli handles parallel transfers well.
Sorry about that! Next time I work on my blog I'll take that into account and find a better theme/color palette. In the meantime you can change it to light mode (clicking on the top right button) and there you have black text on white background.
Technically it is independent from Metaflow, so you could use it as a stand-alone, high-performance S3 client.
See docs here https://docs.metaflow.org/metaflow/data#store-and-load-objec...
And code here https://github.com/Netflix/metaflow/tree/master/metaflow/dat...
(I wrote it originally - AMA if curious)