> I'm always surprised to see that code still being used - it's the least complete selector library out there by a long way.
For this use case being complete doesn't matter so much as users are just after a handy way pull content out of a page. And having "got intimate" with the source while porting it, it's a really elegant piece of code - have the impression that try to make it do more would ruin it.
The one liner is cool, but I guarantee that node.js's non-blocking IO will outperform perl any day of the week. Try scraping thousands of pages at once using perl..
The problem you'd have with anything that represents a page as some kind of graph is you have to construct the whole tree before you can start doing anything with it. The API largely precludes streams. Callbacks would be possible but some of the conditional CSS selectors need a complete knowledge of the page before they can be resolved.
So while GET-ting pages to scrape can benefit from async IO, you're effectively "blocked" while scraping pieces out of the page itself.
This is in response to all the node/jsdom/jquery scraping posts that are popular lately. JSDom is hopeless for scraping - try parsing some slightly malformed HTML..
Hey Chris, I was just trying to share a few things I'd learned. I know I haven't done as much scraping as others (like yourself and richcollins). Glad I've helped get some discussion going :)
Hey David, it wasn't a stab at your blog - your post was great - anything that builds some more interest in node.js is positive :) I just hate reading about people having trouble with JSDom and putting it down to the node platform. JSDom is an excellent parser, it just fails miserably when you feed it malformed HTML, and as we know, a majority of the internet falls in to this category. I needed a framework that could scrape anything on the web so I built it myself
Could you provide mode details on this, either in the comments here or in a blog post? I almost went with Selenium but ended up with a custom system to drive Firefox, would love to learn more about what the Selenium based solution looks like.
https://github.com/chriso/node.io uses https://github.com/harryf/node-soupselect
https://github.com/harryf/node-soupselect is a port of my https://github.com/simonw/soupselect library for Python
https://github.com/simonw/soupselect is a port of my getElementsBySelector function for JavaScript: http://simonwillison.net/2003/Mar/25/getElementsBySelector/
I'm always surprised to see that code still being used - it's the least complete selector library out there by a long way.