The problem I've had with using jsdom in scraping web pages is that it is not very forgiving about bad HTML. There are so many pages in the wild that have malformed HTML, and jsdom just pukes on it. I started using Apricot [1] which uses HtmlParser [2] which has been better. I'd like to hear what others are using to scrape bad webpages.
The issue is, as the above thread shows, that the creator of node-htmlparser does not want to support illegal HTML. Unfortunately that is not realistic.
A couple of months ago I posted a very similar article that uses node.js, request and jQuery to achieve the same goal. In this case, as long as jQuery can handle the response you shouldn't have any problem with malformed HTML...
Somewhat tangential, here is something I have long thought would be a very useful project, but unfortunately haven't had the time to build:
It is a scraping/crawling tool suite. Base it on webkit, with good scriptable plugin support (not just js, but expose the DOM to other languages too). It would consist of a few main parts.
1) what I call the Scrape-builder. This is essentially a fancy web browser, but, it has a rich UI that can be used to select portions of a web page, and expose the appropriate DOM elements, and how to find those elements in the page. By expose, I mean put into some sort of editor/ide - It could be raw html, or some sort of description language. in the editor, the elements one would want to scrap can then be selected, and put into some sort of object for later processing. This can include some form of scripting to mangle the data as needed. It can also include interactions with the javascript on the page, recording click macros (well event firing, and such). The point of this component is to allow content experts/non- or novice-programmers to easily arrange for the "interesting" data to be selected for scraping.
2) The second component for the suite is a scraping engine. It uses the description + macros + scripts from the Scrape-builder to actually pull data from the pages, and turn them into data objects. These objects can then be put on a queue for later processing with backend systems/code. The scraping engine is basically a stripped down webkit without the rendering/layout/display bits compiled in. It just builds the dom and executes the page's javascript to ultimately scrape the bits selected. This is driven by the spidering engine.
3) The spidering engine is what determines which pages to point the scraping engine at. It can be fed by external code, or it can be fed by scripts from the scraping engine, a feedback mechanism (some links on a page may be part of a single scraping, some may just be fodder for a later scraping). It can be thought of as a work queue for the scraping engines.
The main use-cases I see for this are specialized search and aggregation engines, which want to get at the interesting bits of sites which don't expose a good api, or where the data may be formatted, but hard to semantically infer without human intervention. Sure, it wouldn't be as efficient from a code execution point of view as say, custom scraping scripts, but it would allow for much faster response times to page changes, and allow better use of programmer time, by taking care of a lot of boilerplate or almost-boiler plate parts of scraping scenarios.
OK, one thing I'm confused about...what's the advantage scraping with node/jquery over a traditional scripting language like Ruby + Nokogiri or Mechanize?
It's true that this process won't render the page-w-ajax as your browser will, but I've found that if you do some web inspection of the page to determine the address and parameters for the backend scripts, then you don't even have to pull HTML at all. You just hit up the scripts and feed them parameters (or use Mechanize, if cookie/state-tracking is involved).
Partly the advantage is that you have CSS selectors to examine your document (I don't know if Ruby/Mechanize does that - I'm just saying what is good about node+jquery), and you have a language that all web developers know about. So it's about minimising friction from doing front end web work to doing scraping work. At my company this gives us a financial advantage - we can hire basic jQuery web developers to work on our scrapers.
Thanks for sharing. This is a bit off-topic, but if you are interested in Scraping Web Pages, you might find that http://cQuery.com is an interesting solution which uses CSS Selectors (much like jQuery) as its mechanism to extract content from live web pages.
It's worth noting that Hacker News seems to temporarily block IP addresses if too many requests are made. I'm not sure if it's requests per minute, or within an hour. But my IP was blocked 3 times when playing around with a similar script.
Phantomjs has a lot larger overhead (it runs a full browser). Plus it doesn't give you access to the full Node.js ecosystem (e.g. access to databases, etc). You can use Node's Phantomjs driver, but it spawns a child process to do the work, and seems a little complicated in how all the interactions work.
And of course you can scrape interactive sites - interactive sites still basically just use HTTP to request data. Just watch the network window in Chrome's developer tools, and figure out what HTTP requests the site is making that you are interested in. Then code them into request() calls.
Phantom doesn't need access to node apis. Separation of concerns. Run your phantom script to scrape and then pipe the results to your node script for processing. I've worked on 2 scraping projects using this method and it works great.
Sure, it can work great if your needs are simple. But if you're posting data to forms which you need from a database, and you don't know what you might need from the DB until runtime, it can get a bit more complex.
Also worth noting that jsdom is quite slow, and has a strict HTML parser. If you want something faster that will cope with more web pages, look up cheerio.
Remember that Phantom just runs a web browser. So writing a Phantom script is like writing an app on top of another app. This means you can use XMLHttpRequest or WebSockets to communicate with a back-end if that's necessary.
By default jsdom doesn't run JavaScript files unless they're specified within the program. However, if you call 'jsdom.jsdom' it will then run the scripts specified in the webpage, and those you specify in your program.
But will the pages scripts respond to events? For example on Twitter after the page loads it looks for a tweet in the hash bang and if there is one, does an XMLHttpRequest, then updates the DOM. Will Jsdom be able to scrape sites like that?
"curl" can scrape sites like that. Just watch the Chrome "Network" window and see what requests get made. Then use that. Scraping doesn't need a browser.
Of course curl can. And curl was the best option 15, even 10 years ago. But to scrape a complicated site with curl (or any other non-browser scrapping utility) you wind up basically rewriting the application you're attempting to scrape.
Browsers sole purpose is to run web apps. It doesn't make sense to use anything else.
At the low end of the scale I agree. But once you start scaling it's very difficult to run more than around 40 concurrent sessions with a browser-based scraper.
With a pure Nodejs scraper we can run over 1000 parallel sessions per CPU.
[1] https://github.com/silentrob/Apricot
[2] https://github.com/tautologistics/node-htmlparser