Does Upton have any way of dealing with JavaScript or ajax calls? For a lot of t...

jeremybmerrill · on July 22, 2013

Nope. Sorry. :)

While I'm sure this would be possible, I think a node.js scraper would probably be a better fit for that sort of a project.

rst · on July 22, 2013

Perhaps better phantomjs (http://phantomjs.org), perhaps using casper (http://casperjs.org) on top to handle some of the glue code. Phantomjs is a full headless browser (it'll give you screenshots of the pages it downloads if you want them); casper is a library that makes sequencing tasks somewhat easier.

Node, by itself, doesn't have full versions of a lot of the objects that Javascript on the pages would refer to (DOM, event model, etc.); phantomjs gives you all of that.

baudehlo · on July 22, 2013

Or do what we do at Hubdoc, and use both Node and Phantom. Node for performance where it's possible, and Phantom where the site has been built in such a way that scraping in Node becomes not worth the effort of figuring out all the weird stuff they've done in client side JS.

We maintain a Node to Phantom bridge for this: https://github.com/baudehlo/node-phantom-simple

kanzure · on July 22, 2013

Curious, you use the webserver module in phantomjs, is that right? And that's how you do the inter-process communication? I'm curious how you chose that over websockets, or over HTTP polling from your phantomjs client against a local node server..

What about using something like node-gir, or whatever appjs does to combine the event loops of node/v8 and chromium/v8?

saryant · on July 22, 2013

I wrote a backend system using Phantom and Akka to generate graphs using D3 and rasterize them into PNGs and put them into user-specific emails.

Phantom has some quirks but overall it's pretty solid.

a8da6b0c91d · on July 22, 2013

There's phantomjs for that. Though I've actually had the most success with perl's WWW::Mechanize::Firefox and a headless X server.

kanzure · on July 22, 2013

You don't even need phantomjs, really. You can use python + webkitgtk+ through the gobject bindings. The problem with phantomjs is that it's using an old version of webkit from an old version of qtwebkit from qt 4.8, whenever that was released. By comparison, webkitgtk+ can be compiled from upstream webkit whenever you please.

If you insist on controlling webkit through javascript, you can use gnome-seedjs. But this is problematic/annoying because there's no commonjs implementation yet.. in phantomjs you can require() node modules in the outside context. Not so much in gnome-seedjs..

Also, X means it's not actually headless, even if you're using xserver-xorg-video-dummy or xvfb. For this reason, phantomjs got rid of the X requirement a number of versions ago.

dangayle · on July 22, 2013

Do you have any other resources about using webkitgtk with python for this sort of purpose?

kanzure · on July 23, 2013

No, I haven't found a definitive tutorial or reference (even for webkit). In general, look around for "import gi.repository.webkit" and you will find relevant things. I am not very sure how the other phantomjs developers are learning webkit things.. probably just reading code.

jjoergensen · on July 22, 2013

http://en.wikipedia.org/wiki/Headless_system

wslh · on July 23, 2013

Or you can use a full browser with extension to do the scraper. In this way you are up to date with the latest browser release.