Hacker News new | past | comments | ask | show | jobs | submit login

Problem 1 - Extract the values of <h2> tags from NYT front page

NB. In 1.htm, NYT is using the <h3> tag for headlines, not <h2> as in the 2020 video.

Solution A - Use UNIX utilties

    grep -o "<h3[^\>]*>[^\<]*" 1.htm |sed -n '/indicate-hover/s/.*\">//p'
The grep utility is ubiquitous, but the -o option is not.

https://web.archive.org/web/20201202103125/https://pubs.open...

For example, Plan9 grep does not have an -o option.

This solution is fast and flexible, but not portable.

There are myriad other portable solutions using POSIX UNIX utilities such as sh, tr and sed. For small tasks like those in "web scraping" tutorials these can still be faster than Python (due to Python start up time alone).

Solution B - Use flex to make small, fast, custom utilities

Create a file called 1.l that contains

    int fileno(FILE *);
    #define jmp (yy_start) = 1 + 2 *
    #define echo do {if(fwrite(yytext,(size_t)yyleng,1,yyout)){}}while(0)

   %s xa xb
   %option noyywrap noinput nounput
   %%
   \<h3 jmp xa;
   <xa>\> jmp xb;
   <xb>\< jmp 0;
   <xb>[^<]* echo;putchar(10);
   .|\n
   %%
   int main(){ yylex();exit(0);}
Then compile with something like

    flex -8iCrf 1.l 
    cc  -std=c89 -Wall -pedantic -I$HOME -pipe lex.yy.c -static -o yy1 
And finally,

    yy1 < 1.htm
This is faster than Python.

Solution C - Extract values from JSON instead of HTML

The file 1.htm contains a large proportion of what appears to be JSON.

I wrote a quick and dirty WIP JSON reformatter that takes web pages as input called yy059. https://news.ycombinator.com/item?id=31174088

   yy059 < 1.htm|sed -n '/promotionalHeadline\":\"[^\"]/p'|cut -d\" -f4
Sure enough, the JSON contains the headlines. One could rewrite Solution B to extract from the JSON instead of the HTML.



Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: