Hacker News new | past | comments | ask | show | jobs | submit login
Pro scraping with Node.JS (github.com/chriso)
60 points by chrisohara on Jan 23, 2011 | hide | past | favorite | 18 comments



I was intrigued to see what CSS selector engine it was using...

https://github.com/chriso/node.io uses https://github.com/harryf/node-soupselect

https://github.com/harryf/node-soupselect is a port of my https://github.com/simonw/soupselect library for Python

https://github.com/simonw/soupselect is a port of my getElementsBySelector function for JavaScript: http://simonwillison.net/2003/Mar/25/getElementsBySelector/

I'm always surprised to see that code still being used - it's the least complete selector library out there by a long way.


> I'm always surprised to see that code still being used - it's the least complete selector library out there by a long way.

For this use case being complete doesn't matter so much as users are just after a handy way pull content out of a page. And having "got intimate" with the source while porting it, it's a really elegant piece of code - have the impression that try to make it do more would ruin it.


Hi Simon, great libs - thanks! there's been many improvements added to node-soupselect and node.io though - the API is here: https://github.com/chriso/node.io/wiki/API---CSS-Selectors-a...


Oh nice - the .rawtext and .striptags methods are particularly useful.


http://mojolicio.us is way better for this kind of stuff. Here's the synopsis example redone using Mojo:

    $ perl -Mojo -e'g("reddit.com")->dom("a.title")->each(sub { warn shift->text })'


The one liner is cool, but I guarantee that node.js's non-blocking IO will outperform perl any day of the week. Try scraping thousands of pages at once using perl..


mojolicious is using a non-blocking async runloop as well =)


The problem you'd have with anything that represents a page as some kind of graph is you have to construct the whole tree before you can start doing anything with it. The API largely precludes streams. Callbacks would be possible but some of the conditional CSS selectors need a complete knowledge of the page before they can be resolved.

So while GET-ting pages to scrape can benefit from async IO, you're effectively "blocked" while scraping pieces out of the page itself.


Really interesting, thanks! This will probably the first thing I will use for real projects in node.js.

Does anyone knows how it compares to say Nokogiri or Hpricot, both in terms of speed and in terms of ability to handle crappy html ?


This is in response to all the node/jsdom/jquery scraping posts that are popular lately. JSDom is hopeless for scraping - try parsing some slightly malformed HTML..


Hey Chris, I was just trying to share a few things I'd learned. I know I haven't done as much scraping as others (like yourself and richcollins). Glad I've helped get some discussion going :)


Hey David, it wasn't a stab at your blog - your post was great - anything that builds some more interest in node.js is positive :) I just hate reading about people having trouble with JSDom and putting it down to the node platform. JSDom is an excellent parser, it just fails miserably when you feed it malformed HTML, and as we know, a majority of the internet falls in to this category. I needed a framework that could scrape anything on the web so I built it myself


just to be clear, jsdom is not a parser. By default it uses node-htmlparser which is not very lenient.

Have you tried using Aria's html5 parser? I hear it works better with malformed markup.


Currently I am using Selenium RC + jQuery for scraping. It is slow but it is the most reliable (due to using a real browser) solution you can get.

Anyway, will try your method soon.


Could you provide mode details on this, either in the comments here or in a blog post? I almost went with Selenium but ended up with a custom system to drive Firefox, would love to learn more about what the Selenium based solution looks like.


Just a few lines of code, no magic needed.

Example: https://github.com/tszming/Selenium-Google-Scrapper


You should consider creating a binding to libsgml. It's written in C and its permissive.


Interesting, thanks. I've been looking for a decent C lib to bind with node and play around with




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: