Hacker News new | past | comments | ask | show | jobs | submit login
Scraping made easy with jQuery and SelectorGadget (and Node.js!) (dtrejo.com)
74 points by DTrejo on Jan 22, 2011 | hide | past | favorite | 16 comments



It's neat using jQuery for this but I've found the arduous part of scraping isn't in the actual parsing and extraction of data your target page, but rather in the post-processing, working around incomplete data on the page, handling errors gracefully, keeping on top of layout/URL/data changes to the target site, not hitting your target site too often, logging into the target site if necessary, respecting robots.txt, keeping users informed of scraping, sane parallelisation of requests, and general problems associated with long-running background process.

All tractable problems with standard solutions, but it's difficult to accept the claim that the idea of using jQuery—which is still pretty neat IMO—now makes scraping easy.


  perl -MLWP::UserAgent -e 'map { $_ =~ s/<a href="([^"]+)">([^<]+)<\/a><span class="comhead">([^<]+)<.+?<span id=[^>]*>(\d+ points)/print "$1 $2 $3 $4\n" if($i++ < 3)/ge } LWP::UserAgent->new->get("http://news.ycombinator.com/")->content;'


It would be much difficult to write many more complex scrapers just using regexes, which is why methods like the one posted above scale well with complexity.

For example, if you wanted to scrape comments on HN and get a tree-like data structure, regexes would be much more difficult to write and maintain!


It scales. I ran WorkZoo.com (Time Mag top 50 website of 2005 - sold it the same year) and we scraped over 500 job boards and aggregated the jobs into a search engine. A team of devs developed and maintained the regex for each board and I managed them and I wrote the dev tools they used to develop the regex for each site we scraped. It was incredibly effective and maintainable.

Incidentally, the dev tools I wrote were in javascript and they created regex that we'd test in javascript and deploy in Perl. The regex engines are identical which is why it worked.

Scraper abstraction is for people too lazy to learn regex. Get a good book on regex, and learn how to use Perl's s/// regex with the 'e' modifier. It'll change your life.


... and later you begin to discover different, much better ways to solve problems that you used to do with regexps, and your life will get back to normality and happiness. ;-)

For a solid scrapper in Perl I'd use HTML::TreeBuilder / HTML::Element. Perhaps slower than regexps, but does real parsing and understands tag-soup HTML.


These modules are very resource intensive and they become a bottleneck if you are scraping at high volume. I had to stop using these modules in one of my tools because of that reason. However, for smaller jobs they are awesome and much easier to use and understand.


I do a lot of scraping using perl. Web Scraper is an awesome tool: http://search.cpan.org/~miyagawa/Web-Scraper-0.32/lib/Web/Sc...


I also like this python scrape library: http://arshaw.com/scrapemark/



Python's lxml can do CSS selectors. I've used lxml for scraping and find it quite nice.


I've tried PHP scraping many different ways and settled on http://simplehtmldom.sourceforge.net . It uses jQuery style selectors as well.


I've been doing a lot of scraping in node. I've had much better luck using YUI + jsdom than jquery +jsdom. Many pages would fail using jquery and it also leaked memory like crazy.


I suggest taking a look at node.io (https://github.com/chriso/node.io) - a scraping framework written for NodeJS. It uses htmlParser rather than jsdom and scales nicely. It also has support for handling timeouts, retries, etc.


Hmm why would this be?


People using clojure can get selector based scraping using Enlive instead of JQuery [1]. It also ends up doubling as a templating library. I use it for templating on my website and I use it to scrape Hacker News, though that project the scraping is for isn't ready for launch [2].

1: https://github.com/cgrand/enlive

2: https://github.com/jColeChanged/mysite


And here's a truly awesome tutorial for enlive (swannodette is a regular here as well):

http://github.com/swannodette/enlive-tutorial




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: