Hacker News new | past | comments | ask | show | jobs | submit login
Upton: A Web Scraping Framework (propublica.org)
198 points by t1c1 on July 22, 2013 | hide | past | favorite | 61 comments



Looks like a very nice integrated solution for data scientists and researchers. It's interesting to see the different shapes tools like this take, depending on their target users. I've been happy with node.js + cheerio[1] and it's simplistic jquery-like API:

    request 'http://website.com/list_of_stories.html', (err, body) ->
        $ = cheerio.load(body)
        callback $('#comments li a.commenter-name').map(cheerio::text)
Plus if you need to handle javascript/ajax, just replace that with jsdom/chimera with minor changes.

[1] http://npmjs.org/package/cheerio


I found using cheerio with request and phantom.js makes anything possible.

https://github.com/mikeal/request https://github.com/sgentle/phantomjs-node



Cool library, Jeremy...another Ruby scraping framework you might want to check out and improve upon is Artsy's Spidey:

https://github.com/joeyAghion/spidey

Has a similar approach but also leaves storage (and caching) up to the end-user.


Thanks, Dan. I'll check that out, especially for good ideas on how to solve things I haven't solved yet. Spidering and scraping seem to be very related, but not quite the same -- and I admittedly know nothing about spidering.


Instead of (or in addition to) manually creating scrapers, you can use Diffbot to automatically extract this type of information from news articles using computer vision: http://diffbot.com/products/automatic/article. It also allows you to create rules with a WYSIWYG editor: http://diffbot.com/products/custom/


I was starting a project by using Scrapy: I was essentially going to query everyday a page (a search page accessed by a POST call), getting the results (a list of links) and then get every single result page (download an XML file).

The project was paused but I'm thinking about restarting it, and I was thinking if something like diffbot or import.io could be useful for me.. any experience doing these kind of stuff?


Does Upton have any way of dealing with JavaScript or ajax calls? For a lot of the scraping I do in Python, this is crucial for me. I use Selenium's Webdriver (along with beautiful soup or lxml) now for that - definitely open to other options.


Nope. Sorry. :)

While I'm sure this would be possible, I think a node.js scraper would probably be a better fit for that sort of a project.


Perhaps better phantomjs (http://phantomjs.org), perhaps using casper (http://casperjs.org) on top to handle some of the glue code. Phantomjs is a full headless browser (it'll give you screenshots of the pages it downloads if you want them); casper is a library that makes sequencing tasks somewhat easier.

Node, by itself, doesn't have full versions of a lot of the objects that Javascript on the pages would refer to (DOM, event model, etc.); phantomjs gives you all of that.


Or do what we do at Hubdoc, and use both Node and Phantom. Node for performance where it's possible, and Phantom where the site has been built in such a way that scraping in Node becomes not worth the effort of figuring out all the weird stuff they've done in client side JS.

We maintain a Node to Phantom bridge for this: https://github.com/baudehlo/node-phantom-simple


Curious, you use the webserver module in phantomjs, is that right? And that's how you do the inter-process communication? I'm curious how you chose that over websockets, or over HTTP polling from your phantomjs client against a local node server..

What about using something like node-gir, or whatever appjs does to combine the event loops of node/v8 and chromium/v8?


I wrote a backend system using Phantom and Akka to generate graphs using D3 and rasterize them into PNGs and put them into user-specific emails.

Phantom has some quirks but overall it's pretty solid.


There's phantomjs for that. Though I've actually had the most success with perl's WWW::Mechanize::Firefox and a headless X server.


You don't even need phantomjs, really. You can use python + webkitgtk+ through the gobject bindings. The problem with phantomjs is that it's using an old version of webkit from an old version of qtwebkit from qt 4.8, whenever that was released. By comparison, webkitgtk+ can be compiled from upstream webkit whenever you please.

If you insist on controlling webkit through javascript, you can use gnome-seedjs. But this is problematic/annoying because there's no commonjs implementation yet.. in phantomjs you can require() node modules in the outside context. Not so much in gnome-seedjs..

Also, X means it's not actually headless, even if you're using xserver-xorg-video-dummy or xvfb. For this reason, phantomjs got rid of the X requirement a number of versions ago.


Do you have any other resources about using webkitgtk with python for this sort of purpose?


No, I haven't found a definitive tutorial or reference (even for webkit). In general, look around for "import gi.repository.webkit" and you will find relevant things. I am not very sure how the other phantomjs developers are learning webkit things.. probably just reading code.



Or you can use a full browser with extension to do the scraper. In this way you are up to date with the latest browser release.


For the past couple of years, I've always done any web scraping with trusty python + beautiful soup or elementtree. I've recently started doing it with Clojure + Enlive (mostly as an excuse to use clojure for less academic excercises) but I really like it.

From that perspective, Upton looks pretty cool, especially the debug mode.


Have you taken a look at Scrapy (http://scrapy.org/)? My evolution has been from Perl to Python and I recently did a project with Scrapy that left me pretty happy.


Scrapy is really awesome. It's a soup to nuts queuing, fetching and extraction workflow tool and if I had to start a larger than trivial project (like an rss reader or shopping aggregation site) I would base my spider toolchain on it.


Yea - I was definitely impressed by it and I just got started. It felt as if I was able to get rid of all the boilerplate and just focus on getting the next page that needed to be crawled and the information to extract.


Thanks, I hope you like it.


I've been using http://diffbot.com/ for this sort of stuff, together with http://oembed.com/


Been loving Diffbot for my startup. Only downside is you very rapidly approach the freemium tier for any production loads. It works fantastically well though.


I use YQL to do web scraping. It lets me do something like this:

`select * from data.html.cssselect where url="www.yahoo.com" and css="#news a"`

Could you elaborate on the benefits of using Upton instead of this?


I've always used this method:

http://railscasts.com/episodes/190-screen-scraping-with-noko...

As well as Mechanize when working with sites that require session cookies and all that.

I'm wondering too what the advantages of Upton are?


AFAICT, YQL can only handle scraping individual pages that way.

Upton can scrape a whole set of pages. If you have a page that lists the pages you're interested in; suppose you're interested in HN commenters on front page posts, you could specify the front page URL and a selector for links to comment pages, and Upton would automatically scrape those pages and return them to you.

Upton could even write the commenter names to a CSV for you with just a filename and a CSS selector/XPath expression.

It's not stuff you couldn't do with YQL or Python/BeautifulSoup. But it's stuff that I didn't want to have to write over and over each time I wrote a new scraper.


Makes sense! Thanks for clarifying that.


This looks really good. We're mainly Python focused and have been working on a tool to try to 'train' a crawler to extract specific elements of a page. As this is a crawling thread I hope you don't mind me asking some advice on where to take it from here :)

Here's how it currently works:

1) It has a queue of domains that I have pre-processed. For the initial purposes I've restricted it to pages that I think are ecommerce based on $ signs, add to cart/basket type links etc

2) There is a visual tool that I then use to select certain parts of the page - eg price, product, image etc. I save these out as xpaths

3) Once I have done one URL I send a crawler to that domain and extract other pages that fit the profile of an ecommerce page and try to use the same mapping as number 2 above to extract the data

I have done a small video to show it in action:

http://www.screencast.com/t/riB3iiVMiSk

I'm not sure if I'm doing this the right way. If a site/page changes structure then I may have to re-map the data. I was hoping that someone would have some pointers for me in terms of any other ways to do this. Also with Javascript-heavy sites I've had some problems

If anyone has any knowledge of screen scraping, where it can be done more automatically, I'd really appreciate a steer!


I've done almost exactly this in the past. There's a hell of a lot of fiddling in keeping the xpaths both stable and general enough to be useful.

One approach I found absolutely vital was to have a rewriting, caching proxy between the crawler and the upstream site. This proxy allowed me to rewrite the page content into something much simpler for the crawler to get to grips with (RSS or Atom, say). I used Celerity (http://celerity.rubyforge.org/) with a hacked-on Mechanize API to do the rewriting, which let me handle JS-heavy pages almost as easily as static HTML ones. My original inspiration for this was _why's Mousehole (the source for which is here: https://github.com/evaryont/mousehole, I've got no idea if it runs on recent Rubies).

The proxy also gives you somewhere to raise an alert if, all of a sudden, your scraping fails because of an upstream change.

One tool I always intended to make some use of, but never got round to, was Ariel: http://ariel.rubyforge.org/. It looks like it ought to be able to totally remove the need to manually extract xpaths.


Thanks for this. I'll check these out


Check out Scrapy http://scrapy.org/


Recently I've been using CasperJS for my scraping needs. http://casperjs.org/


Web::Scraper is a really good one for perl. (http://search.cpan.org/~miyagawa/Web-Scraper-0.37/lib/Web/Sc...).


Typically with Perl there is more than one module for this :)

- pQuery | https://metacpan.org/module/pQuery

- Mojo::UserAgent | https://metacpan.org/module/Mojo%3a%3aUserAgent

- Scrappy | https://metacpan.org/release/Scrappy

- Web::Query | https://metacpan.org/module/Web%3a%3aQuery

- Web::Magic | https://metacpan.org/module/Web%3a%3aMagic

Above are specifically for scraping but one shouldn't forget WWW::Mechanize & LWP.

My preference over last few years is with pQuery. However Web::Query is Tokuhiro's pQuery improvement and Mojo::UserAgent looks very nifty.


What separates this from Nokogiri? (Don't take that as critical, more working code out in the world is better. Just wondering, as I use Nokogiri heavily for our company chat-bot, and couldn't tell the answer at a quick glance.)


Reposting a comment by the author from the article:

> Upton depends on Nokogiri, which is basically the BeautifulSoup port for Ruby.

> If you just used vanilla Nokogiri, you'd be responsible for writing code to fetch, save (maybe), debug and sew together all the pieces of your web scraper. Upton does a lot of that work for you, so you can skip the boilerplate.


Recently I've been using Fake Browser (fakeapp.com) for web scraping. While it's inefficient for large jobs, it's awesome for hacking together quick scripts. With Fake you can write your scraper in JavaScript and it run in an actual browser so it's very visual process. It's great for instances where you want to get past complicated authentication systems without writing code. Just sign in manually and start your script.


Sounds pretty cool. Also, for those operating in "java land" there are Commons HttpClient[1] and Apache Tika[2] which, together, are a pretty potent combination for scraping web data.

[1]: http://hc.apache.org/httpcomponents-client-ga/

[2]: http://tika.apache.org/


What about import.io? Anyone has experience using it?

Link: http://import.io/


Is this somewhat familiar to embedly.com? That's what I currently use, and it seems to work fine, but am curious if that's the best thing to be using (I'm typically just grabbing the image thumbnail, but it'd be nice to grab some text if it exists too.


Always used anemone with great results http://anemone.rubyforge.org/

Seems like it's not being actively developed, but again, never had a problem.

Edit: realize this is focused on single page scraping with data extraction. Could use them nicely together in fact.


I had never seen this before. I will check it out, at least for inspiration.

I'd say "scraping" is a little more focused on extracting data from specific pages as opposed to ALL pages as in "spidering", but the two are certainly cousins if not siblings. Anemone would probably be good at the same sorts of tasks Upton is designed for (i.e. scraping data contained on multiple pages).


Nice lib, you may take some ideas from pismo: https://github.com/peterc/pismo , more metadata oriented but returns a nokogiri doc as well.


Thanks, will check it out.


PHP SimpleDOM Class makes scraping just like working with jQuery on the backend. http://simplehtmldom.sourceforge.net/


I used this for many years, but the memory footprint is terrible.

I would recommend querypath. Very small footprint and takes a fraction of the cpu time.


For anyone interested in a Node.js crawler, my company recently released roach: https://github.com/PetroFeed/roach


What I need is something that can scrape .NET sites with lots of weird and signed AJAX stuff just to populate a select control.

I'm using node.js with several libraries and nothing has worked so far.


You might want to look at one of the headless web browsers, like phantom or casper,

http://phantomjs.org/ http://casperjs.org/


Already tried that, and already failed.

Some JS in the page makes webkit die.

I think a headless Firefox is my only hope.


Is web scraping legal?


If anything, it might be against TOS but ASAIK, breaking TOS is not illegal.


Who was the guy who got jail time for breaking a TOS? Think it was iPhone related.



Is there's no signup/login process is the TOS even enforceable?


It can be, I've just launched https://myshopdata.com where online retailers can scrape their own content for synchronising with third party marketplaces.


Your app looks really good.

What tech did you use if you don't mind answering?


Thanks!

My Shop Data is all PHP & MySQL, with Slim, Twig & Bootstrap. The web scraping aspect is another product of mine (forgive the clunky homepage, I'm going to turn this into an API platform) - https://grabnotify.com

GrabNotify is Node.js, Mongo, PHP, Bootstrap and PhantomJS. The undocumented API allows you to create a web crawler but define a JavaScript algorithm to extract the data off the page. Some retailers have dropdowns which update stock, images, etc, so this crawler can simulate mouse events, etc. My Shop Data will supply a custom crawler algorithm for each e-commerce web site through the API.

And finally, I've written a HTML to Markdown translator to extract page descriptions but keep some formatting while being transferable to other systems that don't support HTML.

The whole legality issue of web scraping is an interesting one. I'm planning to position GrabNotify as a web crawler, page monitor and HTML -> data tool, but only if you own or have permission to scrape the original content but need a simple way to grab and monitor the HTML into data. I'm not really interested in building a business of scraping other people's content without their permission.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: