For example, Plan9 grep does not have an -o option.
This solution is fast and flexible, but not portable.
There are myriad other portable solutions using POSIX UNIX utilities such as sh, tr and sed. For small tasks like those in "web scraping" tutorials these can still be faster than Python (due to Python start up time alone).
Solution B - Use flex to make small, fast, custom utilities
Create a file called 1.l that contains
int fileno(FILE *);
#define jmp (yy_start) = 1 + 2 *
#define echo do {if(fwrite(yytext,(size_t)yyleng,1,yyout)){}}while(0)
%s xa xb
%option noyywrap noinput nounput
%%
\<h3 jmp xa;
<xa>\> jmp xb;
<xb>\< jmp 0;
<xb>[^<]* echo;putchar(10);
.|\n
%%
int main(){ yylex();exit(0);}
NB. In 1.htm, NYT is using the <h3> tag for headlines, not <h2> as in the 2020 video.
Solution A - Use UNIX utilties
The grep utility is ubiquitous, but the -o option is not.https://web.archive.org/web/20201202103125/https://pubs.open...
For example, Plan9 grep does not have an -o option.
This solution is fast and flexible, but not portable.
There are myriad other portable solutions using POSIX UNIX utilities such as sh, tr and sed. For small tasks like those in "web scraping" tutorials these can still be faster than Python (due to Python start up time alone).
Solution B - Use flex to make small, fast, custom utilities
Create a file called 1.l that contains
Then compile with something like And finally, This is faster than Python.Solution C - Extract values from JSON instead of HTML
The file 1.htm contains a large proportion of what appears to be JSON.
I wrote a quick and dirty WIP JSON reformatter that takes web pages as input called yy059. https://news.ycombinator.com/item?id=31174088
Sure enough, the JSON contains the headlines. One could rewrite Solution B to extract from the JSON instead of the HTML.