Hacker News new | past | comments | ask | show | jobs | submit login

Sorry to do this here, I know it’s not Amazon support. Is there a way to copy S3 objects from bucket to bucket without sending them through the compute that’s doing the copying?

We have a use case for copying terabytes of content to buckets with different owners and it just seems wasteful to run everything through a client.




Yes: https://docs.aws.amazon.com/AmazonS3/latest/API/API_CopyObje...

I wrote many tests for it... so many tests...


I’ll have to try this again, it really seemed like the performance was quite low for it to be a ‘backend’ color operation. Thanks!

edit: just tried it, definitely seems to be working. not sure what I was seeing earlier, thanks!


distributed systems are complex, and most likely you were bound to an overloaded webserver. My knowledge of S3 is six years old now, but the copy operation is single threaded with a tremendous amount of hashing to ensure durability.

At scale, the performance cost is worth it given the number of checks done internally to ensure the copy is perfect. If you download and upload, then you could make it faster however getting all the details right such that corruption didn't happen is tricky.


You can use aws s3 batch which can handle billions of files in just a few minutes

https://docs.aws.amazon.com/AmazonS3/latest/userguide/batch-...



aws cli cp can copy between buckets.


That’s true but data flows through the machine executing aws cli. The parent was asking how to copy data without it flowing through a compute instance.


I’d double check the cli code path for that parameter. The s3 api does have a copy method, which performs the operation within s3 without compute acting as an intermediary. If that’s not the case for the copy parameter, sounds like a bug that needs to be fixed in the cli tooling.


It is directly checking for s3 to s3[1] and indicates that it wants to copy...

I've read over it and I'm reasonably sure that it's going to issue CopyObject, but it would take me actually getting out paper and pen to really track it down.

The AWS CLI and Boto are a case study in overdoing class hierarchies. Not because there's any obvious AbstractSingletonProxyFactoryBean, but rather that there's no instance that stands out as "this is where they went wrong" and nevertheless the end result is a confusing mess of inheritance and objects.

[1]: https://github.com/aws/aws-cli/blob/45b0063b2d0b245b17a57fd9...


Not to mention the insane over engineering of a python 2.7-compatible async task stealing io loop which is slow as hell, and pitifully delivers a maximum of ~150MB/s with 30% cpu core activity. That's why anyone needing to regularly download/upload files from S3 need an additional library (s5cmd, s3pd, etc)


Thanks! I don’t know why I had the understanding that it worked the other way. This is useful to know!


No, when copying objects between buckets (aws s3 cp s3://... s3://... and the corresponding sync command), the AWS CLI uses CopyObject (https://docs.aws.amazon.com/AmazonS3/latest/API/API_CopyObje..., previously known as S3 PUT Copy), in which the client doesn't handle any object contents. The call stack eventually reaches https://github.com/boto/s3transfer/blob/develop/s3transfer/c... (or its multipart equivalent), where it calls the botocore binding for this API.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: