Pro scraping with Node.JS

simonw · on Jan 23, 2011

I was intrigued to see what CSS selector engine it was using...

https://github.com/chriso/node.io uses https://github.com/harryf/node-soupselect

https://github.com/harryf/node-soupselect is a port of my https://github.com/simonw/soupselect library for Python

https://github.com/simonw/soupselect is a port of my getElementsBySelector function for JavaScript: http://simonwillison.net/2003/Mar/25/getElementsBySelector/

I'm always surprised to see that code still being used - it's the least complete selector library out there by a long way.

harryf · on Jan 23, 2011

> I'm always surprised to see that code still being used - it's the least complete selector library out there by a long way.

For this use case being complete doesn't matter so much as users are just after a handy way pull content out of a page. And having "got intimate" with the source while porting it, it's a really elegant piece of code - have the impression that try to make it do more would ruin it.

chrisohara · on Jan 23, 2011

Hi Simon, great libs - thanks! there's been many improvements added to node-soupselect and node.io though - the API is here: https://github.com/chriso/node.io/wiki/API---CSS-Selectors-a...

simonw · on Jan 23, 2011

Oh nice - the .rawtext and .striptags methods are particularly useful.

marcusramberg · on Jan 23, 2011

http://mojolicio.us is way better for this kind of stuff. Here's the synopsis example redone using Mojo:

    $ perl -Mojo -e'g("reddit.com")->dom("a.title")->each(sub { warn shift->text })'

chrisohara · on Jan 23, 2011

The one liner is cool, but I guarantee that node.js's non-blocking IO will outperform perl any day of the week. Try scraping thousands of pages at once using perl..

marcusramberg · on Jan 23, 2011

mojolicious is using a non-blocking async runloop as well =)

harryf · on Jan 23, 2011

The problem you'd have with anything that represents a page as some kind of graph is you have to construct the whole tree before you can start doing anything with it. The API largely precludes streams. Callbacks would be possible but some of the conditional CSS selectors need a complete knowledge of the page before they can be resolved.

So while GET-ting pages to scrape can benefit from async IO, you're effectively "blocked" while scraping pieces out of the page itself.

thibaut_barrere · on Jan 23, 2011

Really interesting, thanks! This will probably the first thing I will use for real projects in node.js.

Does anyone knows how it compares to say Nokogiri or Hpricot, both in terms of speed and in terms of ability to handle crappy html ?

chrisohara · on Jan 23, 2011

This is in response to all the node/jsdom/jquery scraping posts that are popular lately. JSDom is hopeless for scraping - try parsing some slightly malformed HTML..

DTrejo · on Jan 23, 2011

Hey Chris, I was just trying to share a few things I'd learned. I know I haven't done as much scraping as others (like yourself and richcollins). Glad I've helped get some discussion going :)

chrisohara · on Jan 23, 2011

Hey David, it wasn't a stab at your blog - your post was great - anything that builds some more interest in node.js is positive :) I just hate reading about people having trouble with JSDom and putting it down to the node platform. JSDom is an excellent parser, it just fails miserably when you feed it malformed HTML, and as we know, a majority of the internet falls in to this category. I needed a framework that could scrape anything on the web so I built it myself

tmpvar · on Jan 23, 2011

just to be clear, jsdom is not a parser. By default it uses node-htmlparser which is not very lenient.

Have you tried using Aria's html5 parser? I hear it works better with malformed markup.

tszming · on Jan 23, 2011

Currently I am using Selenium RC + jQuery for scraping. It is slow but it is the most reliable (due to using a real browser) solution you can get.

Anyway, will try your method soon.

tworats · on Jan 23, 2011

Could you provide mode details on this, either in the comments here or in a blog post? I almost went with Selenium but ended up with a custom system to drive Firefox, would love to learn more about what the Selenium based solution looks like.

tszming · on Jan 24, 2011

Just a few lines of code, no magic needed.

Example: https://github.com/tszming/Selenium-Google-Scrapper

richcollins · on Jan 23, 2011

You should consider creating a binding to libsgml. It's written in C and its permissive.

chrisohara · on Jan 23, 2011

Interesting, thanks. I've been looking for a decent C lib to bind with node and play around with