>> suppose you have millions of web pages that you want to download and save to ... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

dhruvkar on Jan 21, 2020 | parent | context | favorite | on: Why Learn Awk? (2016)

>> suppose you have millions of web pages that you want to download and save to disk for later processing. How do you do it?

I don't know enough about the 'real way' or the 'taco bell way', but interested to know --- is this doable the way Ted describes in the article via xargs and wget?

scruple on Jan 21, 2020 | [–]

Yes, absolutely. This is absolutely how ~~we~~ many (most?) of us used to scrap web pages in the Dark Ages.

rakoo on Jan 21, 2020 | [–]

I would assume a combination of

- sed/awk to extract URLs, one by line

- xargs and wget to download each page from the previous output

Consider applying for YC's Spring batch! Applications are open till Feb 11.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact