Hacker News new | past | comments | ask | show | jobs | submit login

>> suppose you have millions of web pages that you want to download and save to disk for later processing. How do you do it?

I don't know enough about the 'real way' or the 'taco bell way', but interested to know --- is this doable the way Ted describes in the article via xargs and wget?




Yes, absolutely. This is absolutely how ~~we~~ many (most?) of us used to scrap web pages in the Dark Ages.


I would assume a combination of

- sed/awk to extract URLs, one by line

- xargs and wget to download each page from the previous output




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: