Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Gargantuan Takeout Rocket – Google Takeout Transloader to Azure (github.com/nelsonjchen)
98 points by crazysim on Feb 21, 2023 | hide | past | favorite | 41 comments
Been broken for 4 months, just got back to fixing it and validating. Figured I'll repost this.

Gargantuan Takeout Rocket (GTR) is a toolkit to make the pain of backing up a Google account to somewhere that's not Google a lot less. At the moment the only destination supported is Azure.

It's a guide, a browser extension, a Cloudflare worker to deploy, and Azure storage to configure. This sounds like buzzword creep, but believe me, every piece is extremely important.

It's very cheap to run/serverless. You can backup a Google account at about $1/TB.

Compared to renting a VPS to do this, it's much more pleasant. You aren't juggling strange URLs, needing big beefy boxes to buffer large data, or trying to login to Google or pass URLs through a VPS. Unfortunately, not everything about the procedure can be automated. But whatever can be, is.

It's very fast. 1GB/s is the stable default and recommended speed. However, you can have about 3 of these going at a time for about 3GB/s+ overall. This trick is accomplished by making Azure download from Google to a file block, a unique API not seen in S3 or S3-like object storage.

Unfortunately, Azure has URL handling bugs and only supports HTTP 1.1, greatly limiting parallelism. We can use Cloudflare Workers to work around these issues.

I use GTR myself with a scheduled Google Takeout every two months to backup 1.5TB of data from Google. This can be photos, YouTube videos, etc. I can finish my backups to safe non-Google storage in 15 minutes after I get an email from Google that my Takeout is ready to be downloaded.

Unfortunately the only destination is currently Azure. There's also no encryption support. And also Cloudflare is involved. That said, if you're fine with this, this is a fine way to backup a Google and Youtube account as-is.




This is great; I'm also looking for cloud backup solutions that are highly cheap.

I'm also absolutely hate the Chrome Marketplace for prohibiting the existence of such tiny use cases (although kind of understandable)

But I'm also aware that Azure can't be that cheap. And thankfully, the repo includes a description of the catch:

> Restoration: For 1TB, this will cost about $100.88 (rounded up). Small price for salvation.

For those who don't know, Clouds separate costs based on readiness (Hot, Archive) and operations (Read, List, Write), unlike Google Drive, which has fixed pricing per TB

So, this makes use of the Archive's low cost, which won't work well for backup use cases that require random access (such as looking up an old photo), but will be great for an additional backup that comes after a local hard disk backup.

Great solution though! Maybe with some adjustment someone can make it Support random access in the future (making an index to restore a single file to "Hot Storage" again)


I'm also looking to see how this can be used to make local backups easier too. Intermediate and temporary staging to Cloudflare R2 might be helpful in allowing the use of queueable/resumable download managers against Takeout archives as bandwidth costs to and from R2 are free.

https://github.com/nelsonjchen/gargantuan-takeout-rocket/iss...

Direct Google Takeout archives URLs are only valid 15 minutes.

I'm not going to make any effort until R2 gets lifecycle rules though. It is too easy to accidentally leave too much there and get a large bill.


TIL that Azure blob storage can be told where to download blobs from, that's neat.

> Pre-GTR, the backup procedure would have taken at least 3 hours even with a VPS Setup facilitating the transfer from Google Takeout as even large instances on the cloud with large disks, much memory, and many CPUs would eventually choke with too many files being downloaded in parallel.

("VPS" above links to https://sjwheel.net/cloud/computing/2019/08/01/aws_backup.ht... , which describes spinning up an EC2 machine with enough local storage to download the whole takeout, and using that to download from google and then upload to s3)

I think one could do this "transloading" comfortably inside the AWS Lambda limits. The golang AWS sdk's s3.s3manager.UploadInput takes an io.Reader and puts its contents into an S3 bucket+key. You could make a lambda function that downloads a Takeout part and chunks it into s3. Lambda scales to zero and will be patiently waiting for you two months from now when you need it again.


That AWS centric idea with Lambda seems pretty reasonable cost-wise too. One issue with Cloudflare Workers is that it is limited in only being able to PUT 100MB at a time to any host, Cloudflare or not, in an undocumented limit presumably for anti-abuse. I don't believe a similar limitation is present in AWS Lambda.

GTR was organically grown from seeing Azure's Blob Storage download API. I got disappointed at the bugs seen and then just patched over it with Cloudflare Workers, but didn't change up to just AWS/S3.

I could see myself also going this route too in a clean slate. With the 100MB limit gone, and having to pay for CPU anyway, maybe we could also get proper GPG public key encryption in it too.


> One issue with Cloudflare Workers is that it is limited in only being able to PUT 100MB at a time to any host, Cloudflare or not, in an undocumented limit presumably for anti-abuse.

It's documented here:

https://developers.cloudflare.com/workers/platform/limits/#r...

This limit actually isn't specific to Workers. It's a general Cloudflare thing, because Cloudflare buffers uploads to disk in order to defend against slow loris attacks. That is, this is actually meant to protect your origin. We've occasionally discussed letting people turn this off if they don't need it but it hasn't come up very often...

(I'm the lead engineer for Workers.)


Actually, the surprise was that it applied to non-Cloudflare hosts such as Azure Storage's endpoint. Request limits against Cloudflare itself? Famous. Against another non-Cloudflare host from a Cloudflare Worker's environment? Much less famous or well known.


Yeah, that's more or less a bug. The reason for it is that outgoing requests from Workers pass through a lot of the same proxy logic as incoming requests. This made Workers a lot easier to build initially (six months from inception to beta!) but there are a lot of quirks as a result.

Outgoing requests from Workers destined for external servers (i.e. not other sites on Cloudflare) really have no reason to be buffered and thus no reason to impose these limits. We should fix that. I also suspect most Workers-based apps don't really need incoming buffering either.


Ah! I guess I missed that. It confused the CF Discord a bit but I guess it is documented.


Azure uses this functionality for their "Transfer me data from GCS/S3" functionality in their CLI.

Unfortunately, product-wise, they seemed to have missed the memo about making an S3 compatible interface like GCS. Oh well. At least we got GTR!


Does it have any 'restore' ability?

I am hesitant to use any backup solution that doesn't have an equally well tested restore path. Ideally a restore path to another provider too - eg. migrate google photos to icloud photos, with as many tags/labels/metadata preserved as possible.


I don't think that Google Takeout is a restorable service. It's more of an "Oh snap, I lost my Google Account forever and lost my data" type of recovery option.


Nope, this just gets the ZIPs off and out of Google. My trouble is with the gargantuan size I have to deal with (1TB+). How you get the bytes back out to a format another provider can take is best handled by other already existing projects on GitHub.

And those other already existing projects are useless without the bytes :).


Google Takeout has added an option for GDrive as a destination for the backed up files. When the backup is complete the files appear in your GDrive.

Have you considered streaming the files from there to Azure Storage ?


This might not be appropriate for users who cannot temporarily store 1TB in their GDrive.


I am not sure, but I think that even if your drive is full, files will appear. So if it's just temporary might be okay.


Won't this also kinda temporarily brick Gmail too? The shared storage will be full.


Hmmm possibly yes


I didn't read the documentation yet. Is it possible to make it work with Wasabi instead of Azure?


At the moment it's just Azure. Azure's API is special as mentioned by other comments.

That said, I've tried PUT-ing directly the Azure but quickly discovered that Azure actually has an ingress(!) quota on bytes, not requests like S3 or S3-likes I've seen before.

I have not implemented a S3 target yet. I might try. I'm particularly interested in targeting Cloudflare R2 as a temporary staging area so I can download the takeout files via a resumable or bulk download manager for a local backup without heavy bandwidth costs. AWS S3 also has their own competitive Archive tier too.

Maybe keep an eye on https://github.com/nelsonjchen/gargantuan-takeout-rocket/iss...


> Azure's API is special

Yes, so is Wasabi. It doesn't charge you for ingress/egress data, so there is a clear cost advantage over similar services.

Disclaimer: I'm a Wasabi official partner in Brazil.


In another branch of replies, it does not appear there is an API to do transfers. Just a transfer tool. https://news.ycombinator.com/item?id=34904775 . Nothing about any deviations from the S3 API for transloading or an API offered.

As for that special-ness you've mentioned, Wasabi is actually falling behind on that "charge you for ingress/egress" point since has a serious asterisk. One of the reasons I'm far more interested in Cloudflare R2 for a possible future temporary staging platform for local downloads is because it's genuine not "charge you for ingress/egress" as Wasabi says "don't egress more than you have stored" or we'll ban you. R2 does not have that limitation there. In fact, I've used it to serve a 28GB file that have been downloaded many, many times with only 28GB stored in R2. Such a thing is against Wasabi's use and the threat makes it unsuitable for GTR.

As for cost, like a final destination, there is no option to do Deep Glacier or Archive tier like on AWS or Azure. It is significantly more expensive.


I don't know Wasabi well enough to comment, but GTR uses Azure because of its unique(?) ability to do cloud-initiated fetches, versus "download to my computer, upload to the S3 endpoint" that every other cloud storage API uses. All of the words you haven't read yet were about trickery to exfiltrate the signed Takeout URLs so that Azure's blob storage could fetch them (along with some Cloudflare Worker trickery to work around some kind of bugs in Azure)

So, I guess the TL;DR (heh, quite literally in this case) is that if Wasabi allows you to download something like an Ubuntu ISO to Wasabi storage by giving it the ISO URL from your phone, it'll likely work for you. Otherwise, you're back in the VPS setup TFA was seeking to avoid


Thanks for the "extended" TL;DR =)

Wasabi allows cloud-to-cloud migrations[1].

Disclaimer: I'm an official partner in Brazil.

[1] https://wasabi.com/cloud-migration-tools/


That was an objectively terrible blog post. But, because I was curious, I actually went digging into the docs for it and it seems that "WCSM" only supports fetching from S3 Compatible origins, which those pre-signed Takeout URLs are not: https://docs.wasabi.com/docs/wcsm-for-data-migration

Hypothetically one could trick the Cloudflare workers into exposing those Takeout URLs under fake GetObject request-response handshake, but ... what a lot of work just for Wasabi


I don't see any documentation about an API here at all. It's just yeah we support these sources for our transfer tool and contact our team for them.

From the outside it's a lot closer to the storage transfer job stuff I've seen on GCP and yet also kind of far as it's gated behind those infernal "talk to our team" for access gates.


Nice! I want to use this to migrate some customers. I sent the bug details to some teams at MS.


I think you found my issue but they haven't fixed it for at least two years at this point since I've reported it. It'll be nice to kill off one leg of the transfer though.

That said, I'm pretty happy I got acknowledgement on a Workers issue though. Wasn't sure if it was a bug, but it was a bug! https://news.ycombinator.com/item?id=34891237


What exactly is the "file block" API? I can't find docs for it. Does anyone know?



If the destination is Azure, why Cloudflare workers instead of Azure Functions?


It was just what I had on hand. I was also using it for another project.

It simply had the least amount of danger or cost for hosting a public demo. The bandwidth costs that the Big cloud providers like AWS, Azure, and GCP have simply wasn't there.

It also deploys in milliseconds, which was a great developer experience.

Copied and pasted from https://github.com/nelsonjchen/gtr-proxy

* Cloudflare does not charge for incoming or outgoing data. No egress or ingress charges.

* Cloudflare does not charge for CPU/Memory used while the request has finished processing, the response headers are sent, and the worker is just shoveling bytes between two sockets.

* Cloudflare has the peering, compute, and scalability to handle the massive transfer from Google Takeout to Azure Storage. Many of its peering points are peered with Azure and Google with high capacity links.

* Cloudflare Workers are serverless.

* Cloudflare's free tier is generous.

* Cloudflare allows fetching and streaming of data from other URLs programmatically.

* Cloudflare Worker endpoints are HTTP/3 compatible and can comfortably connect to HTTP 1.1 endpoints.

* Cloudflare Workers are globally deployed. If you transfer from Google in the EU to Azure in the EU, the worker proxy is also in the EU and your data stays in the EU for the whole time. Same for Australia, US, and so on.


They're both part of the bandwidth alliance, as is backblaze.


Correct me if I'm wrong, but that's only for data reads proxied through Cloudflare since inbound bandwidth is already free.


Amazing. This will help many people to abandon Google and migrate to Azure.


Not really. This only moves the backups: if you want to use Office 365, you have to import the data there. If you need to do it, there are consultancies who specializes in that (and in moving form Office 365 to Google Workspace)


I do a lot of the migrations you mention, and while this tool in particular won't be used directly, adapting this process to our tools will definitely be helpful. In many circumstances I'd rather ingest an enterprise Google Takeout file than make N# of service connections to GSuite.


Interesting, I was thinking this would be useful for Atlassian Cloud but never thought about enterprise Google Workspace takeout.


Funny, it's done the opposite for me. I like Google Takeout. While downloading the archives is a pain in the ass and hence why GTR exists, Google makes it easier to assemble all data in Google all together which is something I don't think I've seen on other providers.


I wouldn't store my entire Google Takeout archive in Azure without encrypting it first. If someone were to hack my Azure account, then they would have everything.

What this project is doing is adding another potential point of privacy failure. My suggestion would be for users to not use the public proxy, but to modify their own proxy to GPG encrypt each Takeout file with their public key as the it is passed through to Azure Storage.


For those concerned about this, the public proxy should just be considered only for demo purposes. You may also consider only using it for the YouTube portion of your Takeout too.

Unfortunately, even if people did modify their own proxies, it would quickly blow through any CPU budget on Cloudflare workers. When you finish processing and return some stream objects or whatever on Cloudflare Workers, Cloudflare basically unloads the worker, stops CPU billing, and just handles shoveling bytes between two objects/sockets.

The closest thing I think would work is using Azure Storage's encryption. It's still theoretical though and I haven't tried it with block by block transloads much less uploads. Unfortunately, it's symmetric and Azure will hold the symmetric key but pinky promises to wipe it after the transfer is done. This should deter most adversaries.

https://learn.microsoft.com/en-us/azure/storage/common/stora...

https://github.com/nelsonjchen/gargantuan-takeout-rocket/iss...

There's also the obvious thing of just downloading the archives via an Azure VM, encrypting it, and throwing it back in the storage/bucket, and deleting the unencrypted transloaded bytes. Downloading from Azure Storage to an Azure VM should be very quick. You might even be able to do it "stateless" without intermediary files in the VM! There is still a window where it's unencrypted but you can close it up afterwards.

The bytes become way more handleable once they're outside Google's URLs and on Azure. Use GTR to get the bytes out of Google Takeout's quagmire.


This is great. I tried doing something similar, Takeout can put the files in Google Drive for you. So i tried creating a Cloudflare worker reads the files from Google Drive and streams them directly to backblaze B2 (S3 compatible).

Worker was supposed to run for free as the files were streaming and no CPU time was used but in practice CF workers were stopping because I was exhausting the free limit so I guess something in my code did end up using CPU




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: