> Which means using wget as your HTTP module and a scripting language as the glue for the logic you'll ultimately need to implement to create a robust crawler (robust to failures and edge cases).
This is kind of the premise of this discussion. You don't use Hadoop to process 2GB of data, but you don't build Googlebot using bash and wget. There is a scale past which it makes sense to use the Big Data toolbox. The point is that most people never get there. Your crawler is never going to be Googlebot.
> Is wget able to check whether a previously failed page exists on disk [in some kind of index] before making any new HTTP requests? It sounds like this would try fetching every failed URL until it reaches the point where it left off before the restart. If it's not possible to maintain an index of unfetchable URLs and reasons for the failures then this would be one reason why wget wouldn't work in place of software designed for the task of crawling (as opposed to just fetching).
It really depends what you're trying to do here. If the reason you're restarting the crawler is because e.g. your internet connection flapped while it was running or some server was temporarily giving spurious HTTP errors then you want the failed URLs to be retried. If you're only restarting the crawler because you had to pause it momentarily and you want to carry on from where you left off then you can easily record what the last URL you tried was and strip all of the previous ones from the list before restarting.
But I think what you're really running into is that we ended up talking about wget and wget isn't really designed in the Unix tradition. The recursive mode in particular doesn't compose well. It should be at least two separate programs, one that fetches via HTTP and one that parses HTML. Then you can see the easy solution to that class of problems: When you fetch a URL you write the URL and the retrieval status to a file which you can parse later to do the things you're referring to.
> If you're trying to saturate your connection with multiple wget instances, how do you make sure that you're not fetching more than one page from a single server at once (being a friendly crawler)? Or how would you honor robots.txt's Crawl-delay with multiple instances?
Give each process a FIFO to read URLs from. Then you choose which FIFO to add a URL to based on the address so that all URLs with the same address are assigned to the same process.
> Give each process a FIFO to read URLs from. Then you choose which FIFO to add a URL to based on the address so that all URLs with the same address are assigned to the same process.
I wrote this in a reply to myself a moment after you posted your comment so I'll just move it here:
Regarding the last two issues I mentioned, you could sort the list of URLs by domain and split the list when the new list's length is >= n URLs and domain on the current line is different from the domain on the previous line. As long as wget can at least honor robots.txt directives between consecutive requests to a domain, it should all work out fine.
It looks like an easily solvable problem however you go about it.
> It really depends what you're trying to do here.
I was thinking about HTTP requests that respond with 4xx and 5xx errors. It would need to be possible to either remove those from the frontier and store them in a separate list or mark them with the error code so that it can be checked at some point being passed onto wget.
Open file on disk. See that it's 404. Delete file. Re-run crawler.
You'd turn that into code by doing grep -R 404 . or whatever the actual unique error string is and deleting any file containing the error message. (You'd be careful not to run that recursive delete on any unexpected data.)
Really, these problems are pretty easy. It's easy to overthink it.
This isn't 1995 anymore. When you hit a 404 error, you no longer get Apache's default 404 page. You really can't count on there being any consistency between 404 pages on different sites.
If wget somehow stored the header response info to disk (e.g. "FILENAME.header-info") you could whip something up to do what you are suggesting though.
Yeah, wget stores response info to disk. Besides, even if it didn't, you could still visit a 404 page of the website and figure out a unique string of text to search for.
This is kind of the premise of this discussion. You don't use Hadoop to process 2GB of data, but you don't build Googlebot using bash and wget. There is a scale past which it makes sense to use the Big Data toolbox. The point is that most people never get there. Your crawler is never going to be Googlebot.
> Is wget able to check whether a previously failed page exists on disk [in some kind of index] before making any new HTTP requests? It sounds like this would try fetching every failed URL until it reaches the point where it left off before the restart. If it's not possible to maintain an index of unfetchable URLs and reasons for the failures then this would be one reason why wget wouldn't work in place of software designed for the task of crawling (as opposed to just fetching).
It really depends what you're trying to do here. If the reason you're restarting the crawler is because e.g. your internet connection flapped while it was running or some server was temporarily giving spurious HTTP errors then you want the failed URLs to be retried. If you're only restarting the crawler because you had to pause it momentarily and you want to carry on from where you left off then you can easily record what the last URL you tried was and strip all of the previous ones from the list before restarting.
But I think what you're really running into is that we ended up talking about wget and wget isn't really designed in the Unix tradition. The recursive mode in particular doesn't compose well. It should be at least two separate programs, one that fetches via HTTP and one that parses HTML. Then you can see the easy solution to that class of problems: When you fetch a URL you write the URL and the retrieval status to a file which you can parse later to do the things you're referring to.
> If you're trying to saturate your connection with multiple wget instances, how do you make sure that you're not fetching more than one page from a single server at once (being a friendly crawler)? Or how would you honor robots.txt's Crawl-delay with multiple instances?
Give each process a FIFO to read URLs from. Then you choose which FIFO to add a URL to based on the address so that all URLs with the same address are assigned to the same process.