Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Google Drive to SQLite (simonwillison.net)
305 points by simonw on Feb 21, 2022 | hide | past | favorite | 32 comments
Here's the repo: https://github.com/simonw/google-drive-to-sqlite

The README is using a trick I'm increasingly leaning on: parts of that document - the --help output and the example database schema - are automatically generated using Cog: https://nedbatchelder.com/code/cog and https://til.simonwillison.net/python/cog-to-update-help-in-r...




We discovered (in rclone) that Google are deprecating the auth method you are using - the OOB auth here: https://github.com/simonw/google-drive-to-sqlite/blob/121509...

Here is the google page: https://developers.googleblog.com/2022/02/making-oauth-flows...

Here is the discussion on the rclone issue - there are links to more stuff there: https://github.com/rclone/rclone/issues/6000

Luckily rclone already implements the oauth webserver for other backends, but for tools like this it has suddenly made authenticating with Google a whole lot harder.


Hah, wow thanks I hadn't seen that! Just my luck to implement that the same week they announce they are deprecating it.


It's pretty hard to understand what I'm meant to do for a CLI app instead here. Are they removing all versions of the "copy and paste this code" auth flow?

Do I need to spin up a localhost web server on a port instead? How am I supposed to do that sensibly on a remote server that I'm SSHing into?

Also that announcement seemed to indicate that localhost: redirects were going to be disabled too, wouldn't that make the local web server option not work either?

I swear, obtaining an OAuth token to interact with my Google account is the new hardest problem in computer science. Oh for a GitHub-style personal access token!


I'm collecting notes on how to address this here: https://github.com/simonw/google-drive-to-sqlite/issues/39

It looks like the localhost redirects are being disabled for "iOS, Android and Chrome app OAuth client types" so hopefully I can still use those by spinning up a localhost web server after all.


My reading is that a localhost web server on a port is the way to go.

This aligns with what nearly all the other cloud providers do with oauth.

> How am I supposed to do that sensibly on a remote server that I'm SSHing into?

If you want to see rclone's solution then check out

https://rclone.org/remote_setup/

It's a bit of a pain!

> Also that announcement seemed to indicate that localhost: redirects were going to be disabled too, wouldn't that make the local web server option not work either?

As you noted below that is only for Android, IOS and ChromeOS. I hadn't noticed that before though...

> I swear, obtaining an OAuth token to interact with my Google account is the new hardest problem in computer science.

Ha ha!

I note that you can get "rclone authorize" to do it if you want as you can set the scopes and the client id / client secret.


Considering the GCloud SDK CLI Auth uses the copy/paste this code auth flow for certain scenarios, this will be interesting.


They literally couldn't care less.

Either you're a sticky customer and you'll figure it out,

or you're a high value customer that they'll sick an engineer or two on for a week to re-tool.

They do. not. care.


The replacement flow requires gcloud on the machine using the browser to login. The command is 'gcloud auth login --no-browser'


There used to be one blog that explained how you could roll oAuth. Then v2 came out and they gave up.

It wasn't T.J Holowaychuk but it was that time. Google doesn't give me shit but back in 2010 there was one post that everyone was working against.


gTakeout[0] also includes this file meta and allows incremental downloads. Ingesting from a local Takeout archive as an additional source for your tool mitigates some of this auth wonkiness and depreciations.

[0]https://takeout.google.com/


I dived into Datasette a bit last year. Simon has a whole set of tools around SQLite and data analysis with it. I haven't used any of it for anything major but I love to see what he's got going on. My Google Drive is out of control, I hope I can find the time to try this one out.


FWIW: If you find that you need something like this, consider using a different file storage product. Most likely, it's using SQLite and the schema is useful.

I used to be the lead on Syncplicity, a desktop file synchronization product. All of the local state was in SQLite. Early versions of the desktop client kept the complete path to every file in the associated row in SQLite. Later versions kept the filename and foreign key to the directory row.

Bittorrent Sync (I forget its new name) also used to use a SQLite database for local metadata.

And, finally: It's not that hard to write a tool to recursively scan a folder and grab all the metadata of files. You can scan 1000s of files in a matter of seconds.


So why use something else? Just because it's not that hard to write it?


Datasette and the various companion tools have all been a pleasure to work with. I've recently been using it to build my personal info archive as well as building a data analysis and visualization dashboard with Datasette and Vega lite. I've been impressed with the architecture and philosophy, as well as the code quality and ease of use in simonw's projects.

Just want to say thanks for sharing these valuable projects and keep up the great work!

I hope to be able to contribute something back to Datasette in the future. (code is at https://gitlab.wikimedia.org/releng/ddd/ but probably not anything that could be merged back into the core, it's just custom extension code on top of datasette)


https://gitlab.wikimedia.org/repos/releng/ddd is amazing! This is exactly the kind of thing I've been hoping to see teams using Datasette for.

I love that this is a public repo - I'm definitely going to be pointing people to it as an example of the kind of thing you can build.

Thanks for sharing that!


You can see it in action here: https://data.releng.team/dev/-/ddd/dashboard/project-metrics...

A lot of inspiration was taken from https://github.com/rclement/datasette-dashboards. I started out by trying to extend that project. I ended up abandoning that idea and started fresh once I realized how much my requirements diverged from the capabilities and direction of rclement's plugin.

Thanks for making such a remarkably extensible architecture and especially for the great documentation. It's been a real pleasure working with datasette and I look forward to doing some more work on datasette plugins in the future.


> Pulling the metadata—file names, sizes, file types, file owners, creation dates—into a SQLite database felt like a great way to start understanding the size and scope of what had been collected so far.

I would:

- sync the GDrive locally

- setup "search anything service" from voidtools

Voila. A super fast programmable index of the files.

If you want to dump it to SQLite, go ahead.

Ps. Applicable to Dotnet

Note: file owners wouldn't be known.


If you have a lots of files in Google Drive, this can take a significant amount of time or you may not be able to sync it locally. I push incremental backups & disk images without a retention period to Google Drive (it's unlimited storage) so there's quite a bit of data shoved in there

It's pretty easy to get rate limited so it could take days or weeks to build an index (so having a tool that talks to the API is generally more reliable)


In my case the Google Drive folder that was shared with me has about half a TB in it, so going via the API seemed like a better option!


Agreed :p


> The README is using a trick I'm increasingly leaning on: parts of that document - the --help output and the example database schema - are automatically generated using Cog

How does this interact with IDEs? Has anyone written a VSCode extension yet? I'd probably be reluctant to use something like this without IDE support. (Though I may well be in the minority, and I appreciate that for open source tools it's not necessarily reasonable to expect people to dedicate so much of their time to DX.)


To regenerate the markdown file I run this command in the console:

    cog -r README.md
Presumably most IDEs have q mechanism that can run that command automatically, either when a file changes or just intermittently over time.

I use VS Code's markdown preview panel, which updates automatically when I manually run the "cog -r" command.


Nice! I didn't know about cog.

I've developed a simple "codeblocks" tool that to keep README.md up-to-date. In your example, here is how to insert the latest `--help` output:

    codeblocks usage-auth README.md -- google-drive-to-sqlite auth --help
Where "usage-auth" is the block language in README.md:

    Full `--help`:
    ```usage-auth
    Usage: ...
    ```
Checking is also possible:

    codeblocks --check usage-auth README.md -- google-drive-to-sqlite auth --help
It's not as generic as cog, but does not require writing codegen scripts inside Markdown file.

https://github.com/shamrin/codeblocks


Notable alternative for emacs users: Github supports rendering .org files, so you can have a README.org using babel (just remember to execute all code blocks before pushing).


hm I'm guessing..run the cog command in the VSCode terminal


An Alfred workflow that queries this index would be fantastic. Alternatively feed it into spotlight


There's a Google Drive workflow for Alfred which I've used previously.

https://www.alfredforum.com/topic/17318-google-drive-%E2%80%...


The desktop Google Drive client actually stores the the file and sync info into a SQLite file.


How does one post both a link and description in a HN submission? Was this added by a mod?


I posted the link and a first comment with that description - I guess a moderator must have merged them together, unless that's a new undocumented feature of HN?


[deleted]


TELL HN: More things like this please -- though I have no real use for this tool directly, it's paradigms are super helpful in thinking through other projects I'm working on.

Thanks simonw - neat stuff


Agreed. More uses of SQLite too!

I have been working with it to gather ecom sales data across platforms for a couple of months. Such a powerful tool! I want to do more with it!




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: