Moving 25TB data from one S3 bucket to another took 7 engineers and 2 full days

darthShadow · on Sept 13, 2020

RClone. Probably the best tool I have come across for interacting with cloud storage solutions.

Move Docs: https://rclone.org/commands/rclone_move/

S3 Docs: https://rclone.org/s3/#amazon-s3

Supports parallel server-side copies and deletes (no server-side moves, unfortunately) so this would have been much faster.

rsync · on Sept 14, 2020

Are there any cloud storage providers that have 'rclone' built into their environment ?

So I could, for instance:

  ssh user@rsync.net rclone s3:/some/bucket gdrive:/blah

or:

  ssh user@rsync.net rclone s3:/source/files /my/rsync.net/account

Well I'll be damned! It appears that there is such a provider :

https://www.rsync.net/products/rclone.html

https://www.rsync.net/products/universal.html

brylie · on Sept 14, 2020

> NO Charges for ingress/egress

Is this too good to be true?!

Thank you for sharing! Yet another gem found in HN comments.

rsync · on Sept 14, 2020

Don't thank us, thank Hurricane Electric (he.net).

20 years ago, I was on irc.lightning.net and thought that was the best run efnet server. The MOTD advertised their IP transit services. They became he.net and that is why rsync.net does (most) of their IP transit with them.

throwaway9d0291 · on Sept 14, 2020

No, the tradeoff is that even with volume, rsync.net storage costs are $15/TBmonth [0], which is ~3x managed alternatives [1,2] and ~10x DIY alternatives [3]. To be fair, ZFS snapshots are "free" but whether that's worth 3x cost will depend on your data.

The lack of bandwidth charges sounds really nice but [1,3] have that too and unlike [1,3], Rsync.net doesn't support protocols that are suited for bandwidth-intensive applications (like serving large files to end-users). It only supports SSH-based transfers [4].

[0]: https://www.rsync.net/pricing.html

[1]: https://wasabi.com/cloud-storage-pricing/

[2]: https://www.backblaze.com/b2/cloud-storage-pricing.html

[3]: https://www.hetzner.de/dedicated-rootserver/matrix-sx?countr...

[4]: https://www.rsync.net/products/platform.html

rsync · on Sept 14, 2020

It is true that our service is more expensive.

If you look at the pricing on that rclone page[1] you'll see that at quantity it is as low as 0.5 Cents Per GB / Month.

As for bandwidth intensive protocols, we have HPN-SSH[2] patches built into our environment which allow for high bandwidth transfers, over SSH, over long WAN links.

[1] https://www.rsync.net/products/rclone.html

[2] https://www.psc.edu/hpn-ssh

throwaway9d0291 · on Sept 15, 2020

> If you look at the pricing on that rclone page[1] you'll see that at quantity it is as low as 0.5 Cents Per GB / Month.

That's good but at the volume needed to reach it, I imagine that you're in the realm of "contact us and get a special deal" at other providers (e.g. AWS, GCP, Azure) too.

Plus, we haven't even got to the "you have to pay for capacity, not usage" aspect of rsync.net pricing.

Pricing is the big thing I wish you'd fix. Your service has a bunch of cool features but they're just not worth that price, not when I can get most of the same features from Hetzner for a fraction of the cost.

> As for bandwidth intensive protocols, we have HPN-SSH[2] patches built into our environment which allow for high bandwidth transfers, over SSH, over long WAN links.

That's cool (honestly, not being condescending) but my point was that for most of the reasons you'd be excited about free bandwidth, rsync.net isn't a viable solution. For example you can't host images, videos, software updates, machine learning datasets or other large binaries and serve them over HTTP on the public internet.

rsync · on Sept 17, 2020

"rsync.net isn't a viable solution. For example you can't host images, videos, software updates, machine learning datasets or other large binaries and serve them over HTTP on the public internet."

Correct. We don't do these things and we never will.

When you 'nmap' an rsync.net storage array, you get:

22 TCP

... and that's it. There are no other services running, or offered. There are no interpreters in the environment. The filesystems are mounted noexec,nosuid.

We do one very simple thing and that's it.

throwaway9d0291 · on Sept 17, 2020

And were you to do that simple thing at a competitive price, it'd be great.

But you don't.

To be competitive you either need to drop prices or do more stuff to be worth the money.

mekster · on Sept 21, 2020

To be fair, storage cost is their only pricing, meaning lowering it will have a great impact on their income when others charge for bandwidth and extra support.

But having no other cost than the storage itself is relieving and having the service running for nearly 20 years gives you a peace of mind on their reliability. I also hear their support is good

I would want a bit more modern web interface though. BorgBase does look interesting to me, except only 2 years of operation can't tell about its reliability and longevity.

chromedev · on Sept 14, 2020

Google Cloud doesn't charge for egress to Google Drive as well. If you use GCP to transfer files to Google Drive via rclone, then you can download them elsewhere without any charges.

vitaliyf · on Sept 14, 2020

+1 on rclone - one of my favorite tools in the toolbox.

With sufficient --transfers and --checkers values you can easily saturate 10-20 gigabit/second links, so with a handful of machines you can transfer 25TB in about an hour, no engineers required.

Worth noting that rclone also supports many other storage providers or even plain SSH, so it pretty much obsoletes rsync. It also has many other nifty features useful outside of the "sync files" use case.

fctorial · on Sept 14, 2020

The source bucket had a ton of small files and most of the time was spent making network requests (O(<number of files>)). The suggested solution was to do a batch operation (open a ticket for a batch operation). Would rclone have helped in this scenario?

darthShadow · on Sept 14, 2020

Yep, as mentioned in the other comment, with configurable `--transfers` & `--checkers`, the only bottleneck is the amount of CPU you have available and can allocate.

fctorial · on Sept 15, 2020

> the only bottleneck is the amount of CPU you have available and can allocate.

In this case S3 will be the bottleneck. 25 TB will have around 700M files (assuming it's as dense a linux installation). S3 can do 5000 operations per second. It'll take at least 40 hrs (assuming everything works at peak speed during the whole process).

oarsinsync · on Sept 14, 2020

rclone lets you specify how many files to handle in parallel. It defaults to 4, and auto retries in the event of failures.

Ramp up the parallelism, and don’t make 7 people monitor it round the clock.

HenryBemis · on Sept 14, 2020

Silly question.. can't one just cut off/limit access to the "previous"? And then let it the time it needs to Delete?

I saw in the Reddit discussion that one can set up role/privileges (as expected), use the (new) user/role/group X for the move, and cut all others.

I understand that the original planning went bust 2h --> 48h, but that it buffled me was the "It sounds more like you had 2 hours for the delete part of the operation."

pdimitar · on Sept 13, 2020

Agreed, I regularly sync with servers that have >1M files and rclone makes the whole thing a joy. Used it once with S3 and it performed admirably there as well.

kureikain · on Sept 14, 2020

Doesn't RClone still need to download files to the server from the source bucket, before syncing to dest bucket. That means it still has some upper limit(CPU, network of where we run that).

S3 replication and email to AWS supports to apply replication for existing object(per reddit discussion) seems a better method.

thejosh · on Sept 14, 2020

Yep, and it "just works" with Google Drive, which is amazing.

reincarnate0x14 · on Sept 14, 2020

I have noticed rclone deletes sometimes don't ever get cleared out of Google Drive's trash cache, but that's something you can work around with some periodic check scripts.

That trash behavior isn't specific to rclone and probably something on Google's end, but rclone is what I notice it with most.

darthShadow · on Sept 14, 2020

I did get an update from Google recently that they are implementing automatic clearing of trash after 30 days which may help you.

The exact wording of the relevant section of the mail:

We are writing to let you know that starting October 13, 2020, Google Drive is making a change so that its trash behaves more consistently with the rest of our G Suite services with regards to automatic deletion. This means that any file that is put into Google Drive’s ‘My Drive’ trash will be automatically deleted after 30 days. Items in trash will still continue to consume quota.

Please note that starting October 13, 2020 any files already in a user’s trash will remain there for 30 days. After the 30-day-period files that have been in the trash for longer than 30 days will begin to be automatically deleted.

What does this mean for my organization?

Any file that has been in the trash for longer than 30 days after October 13, 2020 will be automatically deleted. We will be showing in-app messaging in Drive starting September 15, 2020 and in our Editors products (such as Google Docs and Google Forms) starting September 29, 2020.

A few things to note:

* Files in shared drives trash are already automatically deleted after 30 days. * These changes affect items that are trashed from any device and any platform. * Files deleted from Drive File Stream will be purged from the system trash after 30 days. There is no impact to Backup and Sync behavior. * G Suite administrators can still restore items from any emptied trash on behalf of their users for up to 25 days. * Retention policies set by G Suite administrators in Google Vault are not affected by this change. * These changes apply to all G Suite editions and end-users.

aloknnikhil · on Sept 14, 2020

I can vouch for this too. I have a nightly cron job that uses rclone to sync my Apple Time Machine disk to Google Drive and I have yet to see it fail. Particularly, the Time Machine disk is a bunch of tiny (8 MB) files.

jiggawatts · on Sept 13, 2020

Simply copying lots of files over an extended period can break your applications elsewhere:

"Amazon S3 automatically scales in response to sustained request rates above these guidelines, or sustained request rates concurrent with LIST requests. While Amazon S3 is internally optimizing for the new request rate, you might receive HTTP 503 request responses temporarily until the optimization is complete. This might occur with increases in request per second rates, or when you first enable S3 RTC. During these periods, your replication latency might increase. The S3 RTC service level agreement (SLA) doesn’t apply to time periods when Amazon S3 performance guidelines on requests per second are exceeded."

That's a fun gotcha...

don-code · on Sept 13, 2020

I used to work for a company which used S3 as its data backend - copying lots of files in and out of S3 is how the product worked. It was also how we enabled QA/STG and the CI/CD pipelines to interact with some of that data - sanitizing it was just a matter of removing or rewriting metadata.

We never solved this problem well.

Our first go was a naive `aws s3 cp --recursive` between the source and destination buckets - this executes all files to copy sequentially. Not good for performance when you have hundreds of thousands of files.

Our next go was to use some custom Ruby scripts to do it, executed under JRuby so that we could take advantage of Java threads to multithread the process. This also didn't work well - the Ruby SDK's `CopyObject` implementation seems to hang some threads if you transfer multiple files concurrently. Our team was primarily Ruby-based, so going pure Java was an option that we didn't pursue.

We then heard about S3 Batch Operations, and launched an unsuccessful POC, because Batch Operations don't work on files larger than 5GB. That's an immediate non-starter for our use case.

We ended up with a horrible hack after many person-weeks of work on this problem: 1) Get a list of all of the files in the source bucket, and put them on a queue. 2a) If a file is small, copy it using a native Java/Ruby threaded worker. 2b) If the files is large, fork an `aws` CLI process and have it `s3 cp ...` the file.

I'm not surprised at all that copying data took these seven engineers two full days.

manigandham · on Sept 14, 2020

I had to transfer 50TB+ to a new bucket with several rules based on the file metadata. Scanned all the file names, put them in a queue, then ran a custom C# program on several of the largest instances to process the queue. Maxed out CPU and bandwidth and it worked great.

I don't recommend Ruby for anything beyond the simplest websites. Don't try to make it perform, just use a better language that can handle the performance you need.

rapsey · on Sept 14, 2020

So you clearly picked the wrong tool for the job and spent an obscene amount of effort hacking around your tool instead of picking a better one.

jml7c5 · on Sept 14, 2020

What alternative would you have proposed?

mixmastamyk · on Sept 14, 2020

In other threads folks are chatting about how awesome rclone is.

rapsey · on Sept 14, 2020

A language better at parallellism. Elixir or go are both easy to pick up.

tehlike · on Sept 14, 2020

Umm what, java supports threads what's the deal :)

ComputerGuru · on Sept 14, 2020

The problem was Ruby, not Java.

lmm · on Sept 14, 2020

Buck up and learn enough Java to write this basic program. It's really not a particularly complex or hard language.

ed25519FUUU · on Sept 14, 2020

(Sarcasm) You should “buck up” and learn Rust or C. With a real language you won’t have to deal with the overhead of a runtime and will really be able to saturate the resources.

lmm · on Sept 14, 2020

I can and have written Rust, C, and indeed Ruby where warranted. Surely any self-respecting programmer would do the same?

CydeWeys · on Sept 14, 2020

How's that significantly different from the Ruby tool they did write?

lmm · on Sept 14, 2020

Java libraries don't tend to break in the presence of multithreading the way Ruby ones do.

kondro · on Sept 13, 2020

I’m surprised none of the engineers either knew about, or thought to ask AWS support, S3 Batch Operations or S3 Replication.

Both of which could’ve completed this process in a relatively short amount of time in a completely hands-off way.

When I Google for “move large S3 bucket” the first result I get is an AWS article which goes through all the best options: https://aws.amazon.com/premiumsupport/knowledge-center/s3-la...

hal9000-tng · on Sept 13, 2020

> I’m surprised none of the engineers either knew about

You'll probably also be surprised to learn that the original article not only specified that link but also that they chose the first listed option, which probably wasn't a surprising choice, given that it was listed first.

kondro · on Sept 13, 2020

You’re right. I was tired when I first read this last night on Reddit and I could swear the link wasn’t there then.

lumost · on Sept 14, 2020

I'd bet good money that this team was tired, when they read the docs/googled as well.

booleanbetrayal · on Sept 13, 2020

If you read the thread, they mention that they had a requirement to empty the original bucket within the maintenance window, as well.

altdatathrow · on Sept 13, 2020

Their biggest issue was letting someone who didn't know a thing about S3 dictate the rules. Revoke all access to bucket ~= everything is deleted.

Then they could set a lifecycle rule, wait a day, and it would be gone for real.

dmlittle · on Sept 13, 2020

From my experience if you have a bucket with billions of objects it will still take a few days after the objects are deleted by the lifecycle rules to be able to delete the bucket. The objects are non-recoverable but I believe due to the eventual consistency nature of S3 you can't delete the bucket until the objects are all fully deleted.

altdatathrow · on Sept 13, 2020

Yeah, S3 is frankly a disaster. People use it like a filesystem and if you have billions of objects, it very quickly becomes a nightmare to manage. Size of data is a complete non-issue, it's all about the number of objects.

grogenaut · on Sept 14, 2020

I will point out deleting that much data from harddrives isn't a fast operation either if you were storing it on drive. Same for tape.

Sure you can destructively destroy the drives. But you could also just use KMS on the S3 bucket object and then toss the key and close the buckets. Gonezo.

jiggawatts · on Sept 14, 2020

It's strange that people just shrug their shoulders and accept this state of things in 2020.

Any system these days should allow near-instantaneous admin operations for just about any reasonable request. Subtree move, delete, or rename should all be instant. Similarly, any simple ACL changes should be instant.

Similarly, data movement should scale linearly with the volume of data, not the number of objects.

dylan604 · on Sept 14, 2020

except you'd still be on the hook for the monthly cost for the useless bucket full of data

the8472 · on Sept 14, 2020

> I will point out deleting that much data from harddrives isn't a fast operation either

Dropping a btrfs subvolume or snapshot is pretty fast. ditto for zfs.

danielheath · on Sept 14, 2020

That doesn't help if you want the data to be gone instead of hidden.

Dylan16807 · on Sept 14, 2020

If it's encrypted then destroying the key gives you a blank-with-noise drive. There's no equivalent for S3.

nstart · on Sept 14, 2020

I'm definitely one of those who treat it like a filesystem :') . I'd love to hear a different point of view. Would you feel ok to share thoughts on how I could be thinking of s3 to fully appreciate the right way of using it?

manigandham · on Sept 14, 2020

That entirely depends on what you're using it for. Mimicking a typical block-level filesystem with object storage will lead to problems beyond a few thousand objects or shallow "folder" levels.

Object storage is best for larger files with bulk updates. Either write new files entirely (best for S3) or append/replace where supported (like Azure/GCP). Any further structure and metadata should be maintained at a higher level with pointers to the raw files in the object store.

There are products like Stablebit Cloud Drive [1], ObjectiveFS [2], and other storage companies that handle all this for you. You can also look UtahFS [3] to see open-source code on building something yourself.

1) https://stablebit.com/CloudDrive 2) https://objectivefs.com/ 3) https://blog.cloudflare.com/utahfs/

Dylan16807 · on Sept 14, 2020

> wait a day

How does that solve the problem of having the bucket accessible and empty in two hours?

kondro · on Sept 13, 2020

Which you could’ve easily done with S3 Batch Operations or the EMR-based sync in a super-parallel way and little engineering effort.

idunno246 · on Sept 14, 2020

I had to do this before batch, replication, inventory. Support, SA, TAM were all like... good luck, though they did offer to make an inventory but they delivered that two weeks later. Everything at the time said s3distcp was the only option. It honestly wasn’t that bad, except it did a serial list before it started copying which would have taken days so needed some tomfoolery.

ed25519FUUU · on Sept 13, 2020

At least when I look into bucket replication (a long time ago, I don't think it was GA at the time), the sync operation only affected _new_ keys, not existing keys, so it wouldn't have worked for this purpose.

kondro · on Sept 14, 2020

By default, you'd have to do something to touch the keys. But AWS support is more than happy to fire off a full init for a new replication if you ask them.

manigandham · on Sept 13, 2020

GCP and Azure both have much better built-in tooling that would make this a few clicks. Their storage system design is also much better. It's unfortunate that the industry has standardized around S3 just because it's a first mover rather than pushing Amazon's product to get better.

ed25519FUUU · on Sept 14, 2020

Fun problem. Without replication, here's how I would do it.

A pipeline: S3 list worker(s) (see below for optiizations) -> SQS -> aws lambda (s3 sync) -> SQS -> aws lambda (s3 delete of original key).

Now the secret: to make it really go vrrooom you need to sufficiently parallelize the list operations. For this enable S3 storage analytics on the bucket to get a manifest of all of the files a few days beforehand. Use that list to partition the prefix/key space evenly into 20 or 30 (or 100, or 200) workers each with start and end key prefix.

With SQS, you can make sure the sync and delete actions succeed, and with lambda they'll run in a serverless fashion as fast as you can feed the queues.

Amazon will absolutely throttle you when making a lot of requests. Reach out to them ahead of time. Make your lambda code robust, so that a failure is idempotent lean into the retries.

jlarocco · on Sept 14, 2020

Something sounds fishy about this. What exactly were the 7 engineers doing during this time? And none of them thought to Google it or call support?

TBH, I won't claim I know the best way to do this, but I'm 100% sure a call to Amazon support would sort things out. Copying data between buckets is common enough I'm sure they have a good way to handle this.

hinkley · on Sept 14, 2020

I’m sure you’ve run into a less egregious version of this at some point yeah? Automation doesn’t always take longer than doing the task. But it does mean zero progress for part of it.

In any group of 7 engineers there is at least one who is more than happy to do a repetitive task. The pay is the same. Whether there is one who would rather do anything but that? I think that depends on luck, and whether the other 6 chased them off already.

The “it’s not so bad” crowd always fits in easily, but they might not be that effective when push comes to shove.

antonvs · on Sept 15, 2020

They were keeping the ssh session alive because apparently they didn't know about tools like 'screen'. Seriously. OP confirms this in the original thread.

perrygeo · on Sept 13, 2020

Even if you were moving 25TB of data on a server under your physical control, it could be a significant engineering effort. Maybe not 14 person-days but at least a few.

What's amazing is that some teams assume the same operation will be more performant in the cloud, despite being wrapped in layers of HTTP-based API abstractions.

brundolf · on Sept 13, 2020

If it's within the same cloud, though, why wouldn't there by an internal shortcut so it doesn't have to go fully "out" then "back in"?

Heck, why does it have to physically move at all? Why doesn't this effectively come down to a rename within AWS' system, like when you "move" a file within the same hard drive?

SteveNuts · on Sept 13, 2020

That's probably possible, but not publicly exposed. Like a lot of the people on the Reddit post said, reach out to support. They have a LOT of data and knobs to turn, in my experience.

dmlittle · on Sept 13, 2020

Not OP but we also had to move a relatively large S3 bucket (significantly larger than 25TB) and unfortunately AWS doesn't have a way to change the ownership of the S3 bucket. My guess is it has to do with how the underlying system stores the objects.

We ended up writing a Go program to copy the objects from one bucket to another to help with the parallelization of migration. This was also before AWS had announced S3 Batch operations so I'm not sure how much better it would be to use that today. The deletion of the bucket also took us over a week. Even though we had deleted all objects in the bucket due to the eventual consistency nature of S3 we weren't allowed to delete the bucket until all objects were fully removed from S3. All we could get from AWS support was to wait a few more days and reach back if we couldn't delete it then.

Edit: depending on your object naming scheme you might also run into the S3 prefix rate limits.

jiggawatts · on Sept 13, 2020

Your advice is completely valid, yet it is a little absurd to be in a situation where something as basic as "mv" is a support-only technology, not available to mere mortals spending millions a month.

I've seen many similar situations that boil down to "UPDATE SET x = y WHERE z" require support tickets at a minimum, or are flat impossible because even their internal staff don't know how to do it.

idunno246 · on Sept 14, 2020

At least when I did similar, moving the data itself was not the problem - it is entirely s3 server side. The problem is round trip times on the empty-body api calls themselves, aws cli being python is slow and maxes out on https/signatures.

If the copy Api supported wildcards there wouldn’t be any discussion at all on this

kinkrtyavimoodh · on Sept 13, 2020

If you are moving it across data-centers (Availability Zones) it won't be possible to do it purely 'symbolically'..

brundolf · on Sept 14, 2020

Sure, but you still shouldn't have to manually squeeze it through public HTTP APIs

Dylan16807 · on Sept 14, 2020

You seem to be conflating person-days and performance?

I'll address performance first: Sure, it could be slow if I can't move the actual hard drives and I only have 1Gbps links. But of course I expect the cloud to be faster than "artificially bottlenecked at 1Gbps"! How is that expectation "amazing"?

And for engineering effort, something would have to be deeply wrong for a 25TB transfer of one directory of normal files to take a significant amount. This is a problem that you point rsync at, watch for a couple minutes, and go to lunch.

Aerroon · on Sept 14, 2020

>Even if you were moving 25TB of data on a server under your physical control, it could be a significant engineering effort.

But why?

I'm genuinely curious. The other week I had a HDD throw up some errors. I copy pasted some of its contents of around 1-2 TB in Windows onto another drive. Sure, it took a while, but it was all done by morning.

It didn't even occur to me that problems could arise with this.

userbinator · on Sept 14, 2020

What's amazing is that some teams assume the same operation will be more performant in the cloud, despite being wrapped in layers of HTTP-based API abstractions.

I have worked with such developers, whose thoughts have been --- for lack of a better word --- clouded by marketing.

manigandham · on Sept 14, 2020

It should be more performant in the cloud. That flexibility and scalability is what you're paying for. There are various tools and strategies for moving this data as explained by several posts here that would've made the process much easier.

booleanbetrayal · on Sept 13, 2020

TL;DR - The delete is the hard part. You can use S3 batch operations with an inventory list to do the copy very quickly. Alternatively, you can setup replication and "touch" each file using a self-copy CLI command, once the replication policy is in place. For the deletion, you're sort of stuck with lifecycle policies, which would take a day or so to clear out the old bucket, but could be supplemented with manual interaction, I'd imagine. There probably needs to be a better mechanic for completely wiping buckets. Last I checked there was not.

boyter · on Sept 14, 2020

Yep. Had to delete a bucket with 100 million or so objects in it a while ago, with each having multiple versions. So it could have been a billion objects. It was something I needed to run every now and then (process to clear the production bucket while getting ready for the full cutover) and have it done in a few hours rather than wait for lifecycle policies to kick in.

I ended up hacking together this https://github.com/boyter/aws-s3-bucket-purger and then ran it on a few different machines to clear it out in under an hour.

Not having some easy way to clear buckets was very annoying.

pugz · on Sept 14, 2020

S3 Batch Operations can also invoke a Lambda function per object, so it's straightforward enough to create a function that calls DeleteObject() and let Lambda scale out to silly levels.

grogenaut · on Sept 14, 2020

encrypt all objects. Wipe bucket? just toss key and delete bucket, done.

Use generated bucket names not fancy bucket names, buckets are cows not pets.

booleanbetrayal · on Sept 14, 2020

Only, that's not actually deletion, nor would it have helped their business requirement:

> It is a 3rd party application that puts data into that origin bucket. They needed the bucket to be empty before the new version gets activated. And they wouldn't use another bucket. Something out of our control

gfody · on Sept 13, 2020

s3 just makes you work for it:

- delete 1000 objects per post to /?delete

- do this 100 times per connection

- do this on a few threads per server to avoid being throttled

- which entails keeping track of which servers you're using (there are hundreds of servers available but dns only exposes a couple per second)

..and that assumes you've already got a list of keys - if you need to also enumerate the bucket you'll have to employ the same strategy and also partition your queries somehow to get multiple non-overlapping continuation-tokens for your concurrent calls to /?list. you can also request a bucket inventory but if you're in a hurry it's faster to do it yourself (for a 25TB bucket the turnaround on the inventory request is probably 2+ days)

redis_mlc · on Sept 13, 2020

> The delete is the hard part.

Correct.

> you're sort of stuck with lifecycle policies, which would take a day or so to clear out the old bucket

I've had delete jobs run for a month per bucket. We're talking terabytes per bucket, but even so.

awiesenhofer · on Sept 13, 2020

I had a similar task once, tried it with awscli and s3cmd first, then took about 5 minutes to google for a multithreaded alternative and found s4cmd[1], problem solved. (tmux to keep the session running)

Makes me kinda doubt at least parts of this story - not one of these 7 engineers thought about googling?

Edit to add: Agree with the basic premise though - s3 can be real slow when when you actually need to work with your data.

[1] https://github.com/bloomreach/s4cmd

gfody · on Sept 13, 2020

or even faster https://github.com/peak/s5cmd

miniman1337 · on Sept 14, 2020

s5cmd is super fast

orf · on Sept 14, 2020

Lots of weird comments here. I’ve done this solo with a 200tb bucket in about a day. A rando “aws s3 cp” obviously isn’t going to cut it.

They had the weird requirement where they also needed to empty the bucket, which makes things slightly more complex. You’d create a small 5 line lambda that would copy an object to the new bucket and then delete it. You’d then invoke this with a batch operation.

If you can relax the empty requirement, then you’d just use the batch operation to copy without needing a lambda and set a lifecycle policy to delete the objects. You’d need to do a bit of manual work to copy-delete objects created since the last inventory ran, but that is fairly simple.

You can use a tool like rclone but that’s still going to be quite slow compared to a batch operation, especially if you have a higher number of small files.

Alternatively, you would set up replication between the buckets and just handle the deletion within the window. S3 can delete 1,000 objects per call, meaning emptying a bucket is pretty fast.

Or, lastly, you’d use “s3 cp” with an appropriate number of threads on a machine within AWS.

hn2017 · on Sept 14, 2020

I had a similar problem, except I had to download about 600GB (mainly from small <30kb files) from a client's s3 to on-prem. It took a painfully long-time using the AWS-CLI.

Never really found any good solutions but fortunately it wasn't urgent so it wasn't a big deal. But this should be easier.

ars · on Sept 13, 2020

I often copy (not synchronize) remote files using rsync instead of scp because rsync will batch the small files, and scp does not.

Or zip up (or tar) the small files, copy over the large archive, then unzip.

They are having the same issue here: A lot of small files each transferred one at a time, instead of in a streaming batch.

awiesenhofer · on Sept 13, 2020

You are getting downvoted because tfa is about the s3 protocol.

All of these commands unfortunately wont help you with s3. If you want to manipulate that from the command line you are basically stuck with aws cli, s3/s4cmd or scripting it yourself using a library.

ars · on Sept 13, 2020

I wasn't giving them advice, I was just talking about a similar problem in another context.

levischoen · on Sept 14, 2020

Do you hear the sound of clinking chains? That’s vendor lock in ;-)

jedberg · on Sept 14, 2020

That's because they did it in the worst possible way.

AWS has built in tools to do this in the background where they take care of everything for you.

dekhn · on Sept 14, 2020

I moved a ton of data in and out of S3 when I worked at $STARTUP. You just need to tune the s3 options a bit. Haven't had any problems using a beefy machine, often saturating 1Gbps with small files, or 10Gbps with large files, using a single beefy machine.

nikolay · on Sept 14, 2020

We are all hostages of the clouds!

exabrial · on Sept 14, 2020

Yikes. I miss Unix and rsync

billman · on Sept 14, 2020

I think you can use replication to accomplish this.

crb002 · on Sept 13, 2020

They have Snowball for this, up to 50 TB. https://aws.amazon.com/getting-started/projects/migrate-peta...

Also, throwing a swarm of lambdas to chunk the files into about 250mb temporary zip files would help.

kyawzazaw · on Sept 13, 2020

Isn’t snowball for migrating on prem data to AWS? Not AWS S3 to another S3 with the listed reqs

WrtCdEvrydy · on Sept 13, 2020

Just Snowball down from S3 and then return it to S3?

Is it even possible to ask for a snowball with data and return it to be stored in a different S3 bucket or would you need two?