Scrapism

simonw · on May 25, 2022

A trick I think would be useful to include here is running scrapers in GitHub Actions that write their results back to the repository.

This is free(!) to host, and the commit log gives an enormous amount of detail about how the scraped resource changed over time.

I wrote more about this trick here: https://simonwillison.net/2020/Oct/9/git-scraping/

Here are 267 repos that are using it: https://github.com/topics/git-scraping?o=desc&s=updated

electroly · on May 25, 2022

I feel like this is bad manners. The runners are a shared resource and you risk getting their IPs blacklisted by the sites you're scraping. I think a strict reading of the GitHub Actions TOS may prohibit this sort of usage, too.

> ... for example, don't use Actions as a content delivery network or as part of a serverless application ...

> Actions should not be used for: ... any other activity unrelated to the production, testing, deployment, or publication of the software project associated with the repository where GitHub Actions are used.

> You may only access and use GitHub Actions to develop and test your application(s).

https://docs.github.com/en/site-policy/github-terms/github-t...

jamessb · on May 25, 2022

I initially had similar concerns, but the idea seems to be endorsed by the GitHub Developer Experience team: https://githubnext.com/projects/flat-data/

ornornor · on May 25, 2022

There is a repo on GH that basically does this and can be used as a currency conversion API (with historical rates). It scrapes all values once a day with actions, commits it, and you can then query it with a cdn.

https://github.com/fawazahmed0/currency-api

sackerhews · on May 25, 2022

Honorable mention even if he doesn't use Actions: https://github.com/elsamuko/Shirt-without-Stripes

saaaam · on May 25, 2022

Hi Simon! I'll definitely consider adding that in. Also, I love Datasette!

Labo333 · on May 25, 2022

Interesting! Any idea of how likely Github ips are to be blocked?

quyleanh · on May 25, 2022

I usually create some small scraping script for my daily life.

- Getting all comic's image and converting to e-book for my Kindle.

- Surveying info for buying new house.

- Helping my wife in collecting data for her new writing.

- Transferring all my Facebook fanpage post to my personal blog

And I did enjoyed my journey in scraping thing to make my life easier and full of joy.

gtsnexp · on May 25, 2022

Wonderful, are you also sharing your scripts with everyone?

quyleanh · on May 25, 2022

Not yet. But I have plan about writing a blog about these scripts.

helsinki · on May 25, 2022

Hi Sam,

It might be worth adding a section on distributed anonymous scrapers that use some form of messaging middleware to distribute the URLs to scrape. Regarding the anonymous aspect (independent of job distribution, of course), you could walk them through using https://github.com/aaronsw/pytorctl or even a rotating tor proxy. This is how I scraped all those Instagram locations + metadata we discussed about five years ago. Hope you’re doing well!

1vuio0pswjnm7 · on May 25, 2022

This is from 2020. Besides a small change to the "Introduction to the Command Line" section, it has not been updated.

Back in 2015, the author reported using CasperJS to scrape public LinkedIn profiles. The author reported this was a PITA.

Here the author recommends using WebDriver implementations, e.g., chromedriver or geckodriver, in addition to scripting language frameworks such as Puppeteer and Selenium. Is scraping LinkedIn still a PITA.

Because the examples given are always relatively simple, i.e., not LinkedIn, I am skeptical when I see "web scraping" tutorials using Python frameworks and cURL as the only recommended option for automated public data/information retrieval from the www.[FN1,2] I use none of the above. For small tasks like the examples given in these tutorials, the approaches I use are not nearly as sophisticated/complicated and yet they are faster and use fewer resources than using Python and/or cURL. They are also easier to update if something changes. That is in part because (1) the binaries I use are smaller, (2) I do not rely on scripting languages[FN3] and third party libraries (and so much less code involved), (3) the programs I use start working immediately whereas Python takes seconds to start up and (4) compared to the programs I use, cURL as a means of sending HTTP requests is inflexible, e.g., one is limited to what "options" cURL provides and cURL has no option for HTTP/1.1 pipelining.

1. LinkedIn's so-called "technological measures" to prevent retrieval of public information have failed. Similarly, its attempts to prevent retrieval of public information through intimidation, e.g., cease-and-desist letters and threats of CFAA claims, have failed. Tutorials on "web scraping" that extol Python frameworks should use LinkedIn as an example instead of trivial examples for which using Python is, IMHO, overkill.

2. What would be more interesting is a Rosetta Code for "web scraping" tasks. There are many, many ways to do public data/information retrieval from the www. Using scripting languages such as Python, Ruby, NodeJS, etc. and frameworks are one way. That approach may be ideally suited for large scale jobs, like those undertaken by what the author calls "internet companies". But for smaller tasks undertaken by individual www users for noncommercial purposes, e.g., this author's concept of "scrapism", there are also faster, less complicated and more efficient options.

3. Other than the Almquist shell

saaaam · on May 25, 2022

Hi - I'd be interested to hear more details about what approaches you suggest!

1vuio0pswjnm7 · on May 26, 2022

Taking the examples from https://www.youtube-nocookie.com/embed/hA1ZsxE8VJg I am sharing how I approach the simple problems in the video without using Python or having any knowledge of CSS selectors.

Retrieving the HTML

   echo https://www.nytimes.com|yy025|nc -vv proxy 80 > 1.htm

yy025 is a flexible utility I wrote to generate custom HTTP from URLs. It is controlled through environmental variables. nc is a tcpclient, such as netcat. proxy is a HOSTS file entry for a localhost TLS proxy. The sequence "yy025|tcpclient" is normally contained in a shell script that adds a <base href> tag, something like

   #! /bin/sh
   yy025 5>.1 >.2
   read x < .1;
   echo "<base href=https://$x />";
   nc -vv proxy 80 < .2|yy045;

yy045 is a utility that removes chunked transfer encoding.

The benefit of using separate, small programs that do one thing will be illustrated in the solution for Problem 3.

1vuio0pswjnm7 · on May 26, 2022

Problem 2 - Extract href value from <a> tags in NYT front page

Create a file called 2.l containing

    int fileno(FILE *);
    #define jmp (yy_start) = 1 + 2 *
    #define echo do {if(fwrite(yytext,(size_t)yyleng,1,yyout)){}}while(0)
   
   %s xa xb
   %option noyywrap noinput nounput
   %%
   \<a jmp xa;
   <xa>\40href=\" jmp xb;
   <xb>\" jmp 0;
   <xb>[^\"]* echo;putchar(10);
   .|\n
   %%
   int main(){ yylex();exit(0);}

Compile

    flex -8iCrf 1.l
    cc  -std=c89 -Wall -pedantic -I$HOME -pipe lex.yy.c -static -o yy1

And finally,

    yy2 < 1.htm

This faster than Python and requires fewer resources.

lgas · on May 29, 2022

It's hard to imagine an environment where the the speed/resource difference between that approach and python would matter.

Can't see reaching for something like that instead of something like

    curl -s url | htmlq a --attribute href

1vuio0pswjnm7 · on May 26, 2022

Problem 1 - Extract the values of <h2> tags from NYT front page

NB. In 1.htm, NYT is using the <h3> tag for headlines, not <h2> as in the 2020 video.

Solution A - Use UNIX utilties

    grep -o "<h3[^\>]*>[^\<]*" 1.htm |sed -n '/indicate-hover/s/.*\">//p'

The grep utility is ubiquitous, but the -o option is not.

https://web.archive.org/web/20201202103125/https://pubs.open...

For example, Plan9 grep does not have an -o option.

This solution is fast and flexible, but not portable.

There are myriad other portable solutions using POSIX UNIX utilities such as sh, tr and sed. For small tasks like those in "web scraping" tutorials these can still be faster than Python (due to Python start up time alone).

Solution B - Use flex to make small, fast, custom utilities

Create a file called 1.l that contains

    int fileno(FILE *);
    #define jmp (yy_start) = 1 + 2 *
    #define echo do {if(fwrite(yytext,(size_t)yyleng,1,yyout)){}}while(0)

   %s xa xb
   %option noyywrap noinput nounput
   %%
   \<h3 jmp xa;
   <xa>\> jmp xb;
   <xb>\< jmp 0;
   <xb>[^<]* echo;putchar(10);
   .|\n
   %%
   int main(){ yylex();exit(0);}

Then compile with something like

    flex -8iCrf 1.l 
    cc  -std=c89 -Wall -pedantic -I$HOME -pipe lex.yy.c -static -o yy1

And finally,

    yy1 < 1.htm

This is faster than Python.

Solution C - Extract values from JSON instead of HTML

The file 1.htm contains a large proportion of what appears to be JSON.

I wrote a quick and dirty WIP JSON reformatter that takes web pages as input called yy059. https://news.ycombinator.com/item?id=31174088

   yy059 < 1.htm|sed -n '/promotionalHeadline\":\"[^\"]/p'|cut -d\" -f4

Sure enough, the JSON contains the headlines. One could rewrite Solution B to extract from the JSON instead of the HTML.

1vuio0pswjnm7 · on May 26, 2022

Problem 3 - Extract totalcount value from <span> tag in Craigslist job pages

Create a file called 3.l containing

    int fileno(FILE *);
    #define jmp (yy_start) = 1 + 2 *
   %s xa xb xc
   %option noyywrap noinput nounput
   %%
   \<ul\40id=\"jjj0\" jmp xa;
   <xa>"</ul>" yyterminate();
   <xa><a\40href=\" jmp xb;
   <xb>\" putchar(10);jmp xa;
   <xb>[^\"]* fprintf(stdout,"%s%s","https://newyork.craigslist.org",yytext);
   .|\n
   %%
   int main(){ yylex();exit(0);}

Compile

   flex -8iCrf 1.l
   cc  -std=c89 -Wall -pedantic -I$HOME -pipe lex.yy.c -static -o yy3

yy3 extracts and prints the URLs for the job pages

Create a file called 4.l containing

    int fileno(FILE *);
    #define jmp (yy_start) = 1 + 2 *
    #define echo do{if(fwrite(yytext,(size_t)yyleng,1,yyout)){}}while(0)
   %s xa xb xc xd xe
   %option noyywrap noinput nounput
   %%
   \<h1\40class=\"cattitle\" jmp xa;
   <xa>\<a\40href jmp xb;
   <xb>\"\> jmp xc;
   <xc>[^<]* fprintf(stdout,"%s ",yytext);jmp xd;
   <xd>\<span\40class=\"totalcount\"\> jmp xe;
   <xe>\< jmp 0;
   <xe>[0-9]* echo;putchar(10);
   .|\n
   %%
   int main(){ yylex();exit(0);}

Compile

   flex -8iCrf 1.l
   cc  -std=c89 -Wall -pedantic -I$HOME -pipe lex.yy.c -static -o yy4

yy4 extracts and prints the job catgeory name and totalcount

We can either solve this in steps where we create files or we can do it as a single pipeline. I personally find breaking a problem into discrete steps is easier.

In steps

    echo http://newyork.craigslist.org|yy025|nc -vv proxy 80|yy045 > 1.htm;
    ka;yy3 < 1.htm|yy025|nc -vv proxy 80|yy045 > 2.htm;ka-;
    yy4 < 2.htm;

As a single pipeline

    echo http://newyork.craigslist.org|yy025|nc -vv proxy 80|y045|yy3|(ka;yy025)|nc -vv proxy 80|yy045|yy4;ka-

Shortened further by using a shell script called nc0 for the yy025|nc|yy045 sequence

    echo https://newyork.craigslist.org|nc0|yy3|(ka;nc0)|yy4

Thanks to yy025, we are using HTTP/1.1 pipelining. This is a feature of HTTP that almost 100% of httpd's support (I cannot name one that doesn't) however neither "modern" browsers nor cURL cannot take advantage of it. Multiple HTTP request are made over a single TCP connection. Unlike the Python tutorial in the video we are not "hammering" a server with multiple TCP connections at the same time, nor are we making a number of successive TCP connections that could "trigger a block". We are following the guidance of the RFCs which historically recommended that clients not open many connections to the same host at the same time. Here we only open one for retrievng all the jobs pages. Adding a delay between requests is unnecessary. We allow the server to return the results at its own pace. For most websites, this is remarkably fast. Craigslist is an anamaly and is rather slow.

What are ka and ka-. yy025 sets HTTP headers acording to environmental variables. For example, the value of Connection is set to "close" by default. To change it,

    Connection=keep-alive yy025 < url-list|nc -vv proxy 80 >0.htm

Another way is to use aliases

    alias ka="export Connection=keep-alive;set|sed -n /^Connection/p";
    alias ka-="export Connection=close;set|sed -n /^Connection/p";
    ka;yy025 < url-list|nc -vv proxy 80 >0.htm;ka-

yy025 is intended to be used with djb's envdir. Custom sets of headers can thus be defined in a directory.

This solution uses less resources, both on the client side and on the server side, than a Python approach. It is probably faster, too.

is_true · on May 25, 2022

Sorry if I missed something, but, Which programs do you use?

Labo333 · on May 25, 2022

Good guide!

The "Scraping XHR" [1] explains how to inspect network requests and reproduce them with Python. I actually built har2requests [2] to automate that process!

[1]: https://scrapism.lav.io/scraping-xhr/ [2]: https://github.com/louisabraham/har2requests

saaaam · on May 25, 2022

Thanks for sharing this - I'll check it out!

saaaam · on May 25, 2022

Hi! This is a guide that I started during the pandemic but never quite finished. I’m in the process of re-writing/re-recording some parts of it to bring it back up to date, and adding in the bits that are still missing.

CWuestefeld · on May 25, 2022

I'm bothered that this doesn't mention any of the ethics involved, such as checking the robots.txt file and so forth.

More than half of my traffic is from bots, so I'm paying something like half my operational expenses to support them. And we've had to do a lot of work to mitigate what would otherwise be DoS attacks from badly written (or badly intended!) bots. I think that at least a tip of the hat to avoiding damage would be appropriate in a piece like this.

rpastuszak · on May 25, 2022

Tangentially related question:

Is Python still the most common tool used for web scraping and if so, what's the advantage over jsdom/cheerio or, say a headless browser based tool like puppeteer?

I've been using these tools for years, but I grew up in the JS world, so I'd be curious to hear people with different backgrounds/biases than mine:)

dec0dedab0de · on May 25, 2022

Is Python still the most common tool used for web scraping and if so, what's the advantage over jsdom/cheerio or, say a headless browser based tool like puppeteer?

I'm a bit of a Python zealot, but if you prefer JS then use JS. The best tool is the one you know.

I think Python became the scraping language because many people thought it was significantly easier to use than Perl, and Perl was on top because it was significantly easier than shell scripting. Any language that is incrementally easier, or better in other aspects wont inspire people to learn something new. If we had node 15 years ago, maybe it would have been JS.

As far as headless browsers, selenium has official Python bindings. Though I kind of consider that cheating.

For my personal taste, I choose Python because I can write it off the top of my head without making syntax errors, or leaning on an IDE. It's the only language I can do that with, and I've been able to do it since about a month into using it.

marban · on May 25, 2022

For low-level stuff where you don't need the overhead of puppeteer, I doubt that there's a better solution. I do pretty much anything with https://www.python-httpx.org these days.

fjallstrom · on May 25, 2022

Was happy to find that the person behind it Sam Lavigne, one of the people behind Stupid Hackathon.