Scraping made easy with jQuery and SelectorGadget (and Node.js!)

lamby · on Jan 22, 2011

It's neat using jQuery for this but I've found the arduous part of scraping isn't in the actual parsing and extraction of data your target page, but rather in the post-processing, working around incomplete data on the page, handling errors gracefully, keeping on top of layout/URL/data changes to the target site, not hitting your target site too often, logging into the target site if necessary, respecting robots.txt, keeping users informed of scraping, sane parallelisation of requests, and general problems associated with long-running background process.

All tractable problems with standard solutions, but it's difficult to accept the claim that the idea of using jQuery—which is still pretty neat IMO—now makes scraping easy.

mmaunder · on Jan 22, 2011

  perl -MLWP::UserAgent -e 'map { $_ =~ s/<a href="([^"]+)">([^<]+)<\/a><span class="comhead">([^<]+)<.+?<span id=[^>]*>(\d+ points)/print "$1 $2 $3 $4\n" if($i++ < 3)/ge } LWP::UserAgent->new->get("http://news.ycombinator.com/")->content;'

xtacy · on Jan 22, 2011

It would be much difficult to write many more complex scrapers just using regexes, which is why methods like the one posted above scale well with complexity.

For example, if you wanted to scrape comments on HN and get a tree-like data structure, regexes would be much more difficult to write and maintain!

mmaunder · on Jan 22, 2011

It scales. I ran WorkZoo.com (Time Mag top 50 website of 2005 - sold it the same year) and we scraped over 500 job boards and aggregated the jobs into a search engine. A team of devs developed and maintained the regex for each board and I managed them and I wrote the dev tools they used to develop the regex for each site we scraped. It was incredibly effective and maintainable.

Incidentally, the dev tools I wrote were in javascript and they created regex that we'd test in javascript and deploy in Perl. The regex engines are identical which is why it worked.

Scraper abstraction is for people too lazy to learn regex. Get a good book on regex, and learn how to use Perl's s/// regex with the 'e' modifier. It'll change your life.

mishoo · on Jan 22, 2011

... and later you begin to discover different, much better ways to solve problems that you used to do with regexps, and your life will get back to normality and happiness. ;-)

For a solid scrapper in Perl I'd use HTML::TreeBuilder / HTML::Element. Perhaps slower than regexps, but does real parsing and understands tag-soup HTML.

parasctr · on Jan 22, 2011

These modules are very resource intensive and they become a bottleneck if you are scraping at high volume. I had to stop using these modules in one of my tools because of that reason. However, for smaller jobs they are awesome and much easier to use and understand.

parasctr · on Jan 22, 2011

I do a lot of scraping using perl. Web Scraper is an awesome tool: http://search.cpan.org/~miyagawa/Web-Scraper-0.32/lib/Web/Sc...

weixiyen · on Jan 22, 2011

I also like this python scrape library: http://arshaw.com/scrapemark/

fmw · on Jan 22, 2011

I use Scrapy: http://scrapy.org/ another Python option: http://twill.idyll.org/ for Rails: http://nokogiri.org/ and perl: http://wwwsearch.sourceforge.net/mechanize/

wahnfrieden · on Jan 22, 2011

Python's lxml can do CSS selectors. I've used lxml for scraping and find it quite nice.

dazzla · on Jan 22, 2011

I've tried PHP scraping many different ways and settled on http://simplehtmldom.sourceforge.net . It uses jQuery style selectors as well.

richcollins · on Jan 22, 2011

I've been doing a lot of scraping in node. I've had much better luck using YUI + jsdom than jquery +jsdom. Many pages would fail using jquery and it also leaked memory like crazy.

chrisohara · on Jan 22, 2011

I suggest taking a look at node.io (https://github.com/chriso/node.io) - a scraping framework written for NodeJS. It uses htmlParser rather than jsdom and scales nicely. It also has support for handling timeouts, retries, etc.

Charuru · on Jan 22, 2011

Hmm why would this be?

JoshCole · on Jan 22, 2011

People using clojure can get selector based scraping using Enlive instead of JQuery [1]. It also ends up doubling as a templating library. I use it for templating on my website and I use it to scrape Hacker News, though that project the scraping is for isn't ready for launch [2].

1: https://github.com/cgrand/enlive

2: https://github.com/jColeChanged/mysite

DTrejo · on Jan 22, 2011

And here's a truly awesome tutorial for enlive (swannodette is a regular here as well):

http://github.com/swannodette/enlive-tutorial