I feel like this is bad manners. The runners are a shared resource and you risk getting their IPs blacklisted by the sites you're scraping. I think a strict reading of the GitHub Actions TOS may prohibit this sort of usage, too.
> ... for example, don't use Actions as a content delivery network or as part of a serverless application ...
> Actions should not be used for: ... any other activity unrelated to the production, testing, deployment, or publication of the software project associated with the repository where GitHub Actions are used.
> You may only access and use GitHub Actions to develop and test your application(s).
There is a repo on GH that basically does this and can be used as a currency conversion API (with historical rates). It scrapes all values once a day with actions, commits it, and you can then query it with a cdn.
It might be worth adding a section on distributed anonymous scrapers that use some form of messaging middleware to distribute the URLs to scrape. Regarding the anonymous aspect (independent of job distribution, of course), you could walk them through using https://github.com/aaronsw/pytorctl or even a rotating tor proxy. This is how I scraped all those Instagram locations + metadata we discussed about five years ago. Hope you’re doing well!
This is from 2020. Besides a small change to the "Introduction to the Command Line" section, it has not been updated.
Back in 2015, the author reported using CasperJS to scrape public LinkedIn profiles. The author reported this was a PITA.
Here the author recommends using WebDriver implementations, e.g., chromedriver or geckodriver, in addition to scripting language frameworks such as Puppeteer and Selenium. Is scraping LinkedIn still a PITA.
Because the examples given are always relatively simple, i.e., not LinkedIn, I am skeptical when I see "web scraping" tutorials using Python frameworks and cURL as the only recommended option for automated public data/information retrieval from the www.[FN1,2] I use none of the above. For small tasks like the examples given in these tutorials, the approaches I use are not nearly as sophisticated/complicated and yet they are faster and use fewer resources than using Python and/or cURL. They are also easier to update if something changes. That is in part because (1) the binaries I use are smaller, (2) I do not rely on scripting languages[FN3] and third party libraries (and so much less code involved), (3) the programs I use start working immediately whereas Python takes seconds to start up and (4) compared to the programs I use, cURL as a means of sending HTTP requests is inflexible, e.g., one is limited to what "options" cURL provides and cURL has no option for HTTP/1.1 pipelining.
1. LinkedIn's so-called "technological measures" to prevent retrieval of public information have failed. Similarly, its attempts to prevent retrieval of public information through intimidation, e.g., cease-and-desist letters and threats of CFAA claims, have failed. Tutorials on "web scraping" that extol Python frameworks should use LinkedIn as an example instead of trivial examples for which using Python is, IMHO, overkill.
2. What would be more interesting is a Rosetta Code for "web scraping" tasks. There are many, many ways to do public data/information retrieval from the www. Using scripting languages such as Python, Ruby, NodeJS, etc. and frameworks are one way. That approach may be ideally suited for large scale jobs, like those undertaken by what the author calls "internet companies". But for smaller tasks undertaken by individual www users for noncommercial purposes, e.g., this author's concept of "scrapism", there are also faster, less complicated and more efficient options.
yy025 is a flexible utility I wrote to generate custom HTTP from URLs. It is controlled through environmental variables. nc is a tcpclient, such as netcat. proxy is a HOSTS file entry for a localhost TLS proxy. The sequence "yy025|tcpclient" is normally contained in a shell script that adds a <base href> tag, something like
For example, Plan9 grep does not have an -o option.
This solution is fast and flexible, but not portable.
There are myriad other portable solutions using POSIX UNIX utilities such as sh, tr and sed. For small tasks like those in "web scraping" tutorials these can still be faster than Python (due to Python start up time alone).
Solution B - Use flex to make small, fast, custom utilities
Create a file called 1.l that contains
int fileno(FILE *);
#define jmp (yy_start) = 1 + 2 *
#define echo do {if(fwrite(yytext,(size_t)yyleng,1,yyout)){}}while(0)
%s xa xb
%option noyywrap noinput nounput
%%
\<h3 jmp xa;
<xa>\> jmp xb;
<xb>\< jmp 0;
<xb>[^<]* echo;putchar(10);
.|\n
%%
int main(){ yylex();exit(0);}
yy4 extracts and prints the job catgeory name and totalcount
We can either solve this in steps where we create files or we can do it as a single pipeline. I personally find breaking a problem into discrete steps is easier.
Thanks to yy025, we are using HTTP/1.1 pipelining. This is a feature of HTTP that almost 100% of httpd's support (I cannot name one that doesn't) however neither "modern" browsers nor cURL cannot take advantage of it. Multiple HTTP request are made over a single TCP connection. Unlike the Python tutorial in the video we are not "hammering" a server with multiple TCP connections at the same time, nor are we making a number of successive TCP connections that could "trigger a block". We are following the guidance of the RFCs which historically recommended that clients not open many connections to the same host at the same time. Here we only open one for retrievng all the jobs pages. Adding a delay between requests is unnecessary. We allow the server to return the results at its own pace. For most websites, this is remarkably fast. Craigslist is an anamaly and is rather slow.
What are ka and ka-. yy025 sets HTTP headers acording to environmental variables. For example, the value of Connection is set to "close" by default. To change it,
The "Scraping XHR" [1] explains how to inspect network requests and reproduce them with Python. I actually built har2requests [2] to automate that process!
Hi! This is a guide that I started during the pandemic but never quite finished. I’m in the process of re-writing/re-recording some parts of it to bring it back up to date, and adding in the bits that are still missing.
I'm bothered that this doesn't mention any of the ethics involved, such as checking the robots.txt file and so forth.
More than half of my traffic is from bots, so I'm paying something like half my operational expenses to support them. And we've had to do a lot of work to mitigate what would otherwise be DoS attacks from badly written (or badly intended!) bots. I think that at least a tip of the hat to avoiding damage would be appropriate in a piece like this.
Is Python still the most common tool used for web scraping and if so, what's the advantage over jsdom/cheerio or, say a headless browser based tool like puppeteer?
I've been using these tools for years, but I grew up in the JS world, so I'd be curious to hear people with different backgrounds/biases than mine:)
Is Python still the most common tool used for web scraping and if so, what's the advantage over jsdom/cheerio or, say a headless browser based tool like puppeteer?
I'm a bit of a Python zealot, but if you prefer JS then use JS. The best tool is the one you know.
I think Python became the scraping language because many people thought it was significantly easier to use than Perl, and Perl was on top because it was significantly easier than shell scripting. Any language that is incrementally easier, or better in other aspects wont inspire people to learn something new. If we had node 15 years ago, maybe it would have been JS.
As far as headless browsers, selenium has official Python bindings. Though I kind of consider that cheating.
For my personal taste, I choose Python because I can write it off the top of my head without making syntax errors, or leaning on an IDE. It's the only language I can do that with, and I've been able to do it since about a month into using it.
For low-level stuff where you don't need the overhead of puppeteer, I doubt that there's a better solution. I do pretty much anything with https://www.python-httpx.org these days.
This is free(!) to host, and the commit log gives an enormous amount of detail about how the scraped resource changed over time.
I wrote more about this trick here: https://simonwillison.net/2020/Oct/9/git-scraping/
Here are 267 repos that are using it: https://github.com/topics/git-scraping?o=desc&s=updated