Hacker News new | past | comments | ask | show | jobs | submit login

A trick I think would be useful to include here is running scrapers in GitHub Actions that write their results back to the repository.

This is free(!) to host, and the commit log gives an enormous amount of detail about how the scraped resource changed over time.

I wrote more about this trick here: https://simonwillison.net/2020/Oct/9/git-scraping/

Here are 267 repos that are using it: https://github.com/topics/git-scraping?o=desc&s=updated




I feel like this is bad manners. The runners are a shared resource and you risk getting their IPs blacklisted by the sites you're scraping. I think a strict reading of the GitHub Actions TOS may prohibit this sort of usage, too.

> ... for example, don't use Actions as a content delivery network or as part of a serverless application ...

> Actions should not be used for: ... any other activity unrelated to the production, testing, deployment, or publication of the software project associated with the repository where GitHub Actions are used.

> You may only access and use GitHub Actions to develop and test your application(s).

https://docs.github.com/en/site-policy/github-terms/github-t...


I initially had similar concerns, but the idea seems to be endorsed by the GitHub Developer Experience team: https://githubnext.com/projects/flat-data/


There is a repo on GH that basically does this and can be used as a currency conversion API (with historical rates). It scrapes all values once a day with actions, commits it, and you can then query it with a cdn.

https://github.com/fawazahmed0/currency-api


Honorable mention even if he doesn't use Actions: https://github.com/elsamuko/Shirt-without-Stripes


Hi Simon! I'll definitely consider adding that in. Also, I love Datasette!


Interesting! Any idea of how likely Github ips are to be blocked?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: