I used to work for a company which used S3 as its data backend - copying lots of files in and out of S3 is how the product worked. It was also how we enabled QA/STG and the CI/CD pipelines to interact with some of that data - sanitizing it was just a matter of removing or rewriting metadata.
We never solved this problem well.
Our first go was a naive `aws s3 cp --recursive` between the source and destination buckets - this executes all files to copy sequentially. Not good for performance when you have hundreds of thousands of files.
Our next go was to use some custom Ruby scripts to do it, executed under JRuby so that we could take advantage of Java threads to multithread the process. This also didn't work well - the Ruby SDK's `CopyObject` implementation seems to hang some threads if you transfer multiple files concurrently. Our team was primarily Ruby-based, so going pure Java was an option that we didn't pursue.
We then heard about S3 Batch Operations, and launched an unsuccessful POC, because Batch Operations don't work on files larger than 5GB. That's an immediate non-starter for our use case.
We ended up with a horrible hack after many person-weeks of work on this problem:
1) Get a list of all of the files in the source bucket, and put them on a queue.
2a) If a file is small, copy it using a native Java/Ruby threaded worker.
2b) If the files is large, fork an `aws` CLI process and have it `s3 cp ...` the file.
I'm not surprised at all that copying data took these seven engineers two full days.
I had to transfer 50TB+ to a new bucket with several rules based on the file metadata. Scanned all the file names, put them in a queue, then ran a custom C# program on several of the largest instances to process the queue. Maxed out CPU and bandwidth and it worked great.
I don't recommend Ruby for anything beyond the simplest websites. Don't try to make it perform, just use a better language that can handle the performance you need.
(Sarcasm) You should “buck up” and learn Rust or C. With a real language you won’t have to deal with the overhead of a runtime and will really be able to saturate the resources.
We never solved this problem well.
Our first go was a naive `aws s3 cp --recursive` between the source and destination buckets - this executes all files to copy sequentially. Not good for performance when you have hundreds of thousands of files.
Our next go was to use some custom Ruby scripts to do it, executed under JRuby so that we could take advantage of Java threads to multithread the process. This also didn't work well - the Ruby SDK's `CopyObject` implementation seems to hang some threads if you transfer multiple files concurrently. Our team was primarily Ruby-based, so going pure Java was an option that we didn't pursue.
We then heard about S3 Batch Operations, and launched an unsuccessful POC, because Batch Operations don't work on files larger than 5GB. That's an immediate non-starter for our use case.
We ended up with a horrible hack after many person-weeks of work on this problem: 1) Get a list of all of the files in the source bucket, and put them on a queue. 2a) If a file is small, copy it using a native Java/Ruby threaded worker. 2b) If the files is large, fork an `aws` CLI process and have it `s3 cp ...` the file.
I'm not surprised at all that copying data took these seven engineers two full days.