Looks like a very nice integrated solution for data scientists and researchers. It's interesting to see the different shapes tools like this take, depending on their target users. I've been happy with node.js + cheerio[1] and it's simplistic jquery-like API:
Thanks, Dan. I'll check that out, especially for good ideas on how to solve things I haven't solved yet. Spidering and scraping seem to be very related, but not quite the same -- and I admittedly know nothing about spidering.
I was starting a project by using Scrapy: I was essentially going to query everyday a page (a search page accessed by a POST call), getting the results (a list of links) and then get every single result page (download an XML file).
The project was paused but I'm thinking about restarting it, and I was thinking if something like diffbot or import.io could be useful for me.. any experience doing these kind of stuff?
Does Upton have any way of dealing with JavaScript or ajax calls? For a lot of the scraping I do in Python, this is crucial for me. I use Selenium's Webdriver (along with beautiful soup or lxml) now for that - definitely open to other options.
Perhaps better phantomjs (http://phantomjs.org), perhaps using casper (http://casperjs.org) on top to handle some of the glue code. Phantomjs is a full headless browser (it'll give you screenshots of the pages it downloads if you want them); casper is a library that makes sequencing tasks somewhat easier.
Node, by itself, doesn't have full versions of a lot of the objects that Javascript on the pages would refer to (DOM, event model, etc.); phantomjs gives you all of that.
Or do what we do at Hubdoc, and use both Node and Phantom. Node for performance where it's possible, and Phantom where the site has been built in such a way that scraping in Node becomes not worth the effort of figuring out all the weird stuff they've done in client side JS.
Curious, you use the webserver module in phantomjs, is that right? And that's how you do the inter-process communication? I'm curious how you chose that over websockets, or over HTTP polling from your phantomjs client against a local node server..
What about using something like node-gir, or whatever appjs does to combine the event loops of node/v8 and chromium/v8?
You don't even need phantomjs, really. You can use python + webkitgtk+ through the gobject bindings. The problem with phantomjs is that it's using an old version of webkit from an old version of qtwebkit from qt 4.8, whenever that was released. By comparison, webkitgtk+ can be compiled from upstream webkit whenever you please.
If you insist on controlling webkit through javascript, you can use gnome-seedjs. But this is problematic/annoying because there's no commonjs implementation yet.. in phantomjs you can require() node modules in the outside context. Not so much in gnome-seedjs..
Also, X means it's not actually headless, even if you're using xserver-xorg-video-dummy or xvfb. For this reason, phantomjs got rid of the X requirement a number of versions ago.
No, I haven't found a definitive tutorial or reference (even for webkit). In general, look around for "import gi.repository.webkit" and you will find relevant things. I am not very sure how the other phantomjs developers are learning webkit things.. probably just reading code.
For the past couple of years, I've always done any web scraping with trusty python + beautiful soup or elementtree. I've recently started doing it with Clojure + Enlive (mostly as an excuse to use clojure for less academic excercises) but I really like it.
From that perspective, Upton looks pretty cool, especially the debug mode.
Have you taken a look at Scrapy (http://scrapy.org/)? My evolution has been from Perl to Python and I recently did a project with Scrapy that left me pretty happy.
Scrapy is really awesome. It's a soup to nuts queuing, fetching and extraction workflow tool and if I had to start a larger than trivial project (like an rss reader or shopping aggregation site) I would base my spider toolchain on it.
Yea - I was definitely impressed by it and I just got started. It felt as if I was able to get rid of all the boilerplate and just focus on getting the next page that needed to be crawled and the information to extract.
Been loving Diffbot for my startup. Only downside is you very rapidly approach the freemium tier for any production loads. It works fantastically well though.
AFAICT, YQL can only handle scraping individual pages that way.
Upton can scrape a whole set of pages. If you have a page that lists the pages you're interested in; suppose you're interested in HN commenters on front page posts, you could specify the front page URL and a selector for links to comment pages, and Upton would automatically scrape those pages and return them to you.
Upton could even write the commenter names to a CSV for you with just a filename and a CSS selector/XPath expression.
It's not stuff you couldn't do with YQL or Python/BeautifulSoup. But it's stuff that I didn't want to have to write over and over each time I wrote a new scraper.
This looks really good. We're mainly Python focused and have been working on a tool to try to 'train' a crawler to extract specific elements of a page. As this is a crawling thread I hope you don't mind me asking some advice on where to take it from here :)
Here's how it currently works:
1) It has a queue of domains that I have pre-processed. For the initial purposes I've restricted it to pages that I think are ecommerce based on $ signs, add to cart/basket type links etc
2) There is a visual tool that I then use to select certain parts of the page - eg price, product, image etc. I save these out as xpaths
3) Once I have done one URL I send a crawler to that domain and extract other pages that fit the profile of an ecommerce page and try to use the same mapping as number 2 above to extract the data
I'm not sure if I'm doing this the right way. If a site/page changes structure then I may have to re-map the data. I was hoping that someone would have some pointers for me in terms of any other ways to do this. Also with Javascript-heavy sites I've had some problems
If anyone has any knowledge of screen scraping, where it can be done more automatically, I'd really appreciate a steer!
I've done almost exactly this in the past. There's a hell of a lot of fiddling in keeping the xpaths both stable and general enough to be useful.
One approach I found absolutely vital was to have a rewriting, caching proxy between the crawler and the upstream site. This proxy allowed me to rewrite the page content into something much simpler for the crawler to get to grips with (RSS or Atom, say). I used Celerity (http://celerity.rubyforge.org/) with a hacked-on Mechanize API to do the rewriting, which let me handle JS-heavy pages almost as easily as static HTML ones. My original inspiration for this was _why's Mousehole (the source for which is here: https://github.com/evaryont/mousehole, I've got no idea if it runs on recent Rubies).
The proxy also gives you somewhere to raise an alert if, all of a sudden, your scraping fails because of an upstream change.
One tool I always intended to make some use of, but never got round to, was Ariel: http://ariel.rubyforge.org/. It looks like it ought to be able to totally remove the need to manually extract xpaths.
What separates this from Nokogiri? (Don't take that as critical, more working code out in the world is better. Just wondering, as I use Nokogiri heavily for our company chat-bot, and couldn't tell the answer at a quick glance.)
Reposting a comment by the author from the article:
> Upton depends on Nokogiri, which is basically the BeautifulSoup port for Ruby.
> If you just used vanilla Nokogiri, you'd be responsible for writing code to fetch, save (maybe), debug and sew together all the pieces of your web scraper. Upton does a lot of that work for you, so you can skip the boilerplate.
Recently I've been using Fake Browser (fakeapp.com) for web scraping. While it's inefficient for large jobs, it's awesome for hacking together quick scripts. With Fake you can write your scraper in JavaScript and it run in an actual browser so it's very visual process. It's great for instances where you want to get past complicated authentication systems without writing code. Just sign in manually and start your script.
Sounds pretty cool. Also, for those operating in "java land" there are Commons HttpClient[1] and Apache Tika[2] which, together, are a pretty potent combination for scraping web data.
Is this somewhat familiar to embedly.com? That's what I currently use, and it seems to work fine, but am curious if that's the best thing to be using (I'm typically just grabbing the image thumbnail, but it'd be nice to grab some text if it exists too.
I had never seen this before. I will check it out, at least for inspiration.
I'd say "scraping" is a little more focused on extracting data from specific pages as opposed to ALL pages as in "spidering", but the two are certainly cousins if not siblings. Anemone would probably be good at the same sorts of tasks Upton is designed for (i.e. scraping data contained on multiple pages).
It can be, I've just launched https://myshopdata.com where online retailers can scrape their own content for synchronising with third party marketplaces.
My Shop Data is all PHP & MySQL, with Slim, Twig & Bootstrap.
The web scraping aspect is another product of mine (forgive the clunky homepage, I'm going to turn this into an API platform) - https://grabnotify.com
GrabNotify is Node.js, Mongo, PHP, Bootstrap and PhantomJS. The undocumented API allows you to create a web crawler but define a JavaScript algorithm to extract the data off the page. Some retailers have dropdowns which update stock, images, etc, so this crawler can simulate mouse events, etc. My Shop Data will supply a custom crawler algorithm for each e-commerce web site through the API.
And finally, I've written a HTML to Markdown translator to extract page descriptions but keep some formatting while being transferable to other systems that don't support HTML.
The whole legality issue of web scraping is an interesting one. I'm planning to position GrabNotify as a web crawler, page monitor and HTML -> data tool, but only if you own or have permission to scrape the original content but need a simple way to grab and monitor the HTML into data. I'm not really interested in building a business of scraping other people's content without their permission.
[1] http://npmjs.org/package/cheerio