Hacker News new | past | comments | ask | show | jobs | submit login

Does Upton have any way of dealing with JavaScript or ajax calls? For a lot of the scraping I do in Python, this is crucial for me. I use Selenium's Webdriver (along with beautiful soup or lxml) now for that - definitely open to other options.



Nope. Sorry. :)

While I'm sure this would be possible, I think a node.js scraper would probably be a better fit for that sort of a project.


Perhaps better phantomjs (http://phantomjs.org), perhaps using casper (http://casperjs.org) on top to handle some of the glue code. Phantomjs is a full headless browser (it'll give you screenshots of the pages it downloads if you want them); casper is a library that makes sequencing tasks somewhat easier.

Node, by itself, doesn't have full versions of a lot of the objects that Javascript on the pages would refer to (DOM, event model, etc.); phantomjs gives you all of that.


Or do what we do at Hubdoc, and use both Node and Phantom. Node for performance where it's possible, and Phantom where the site has been built in such a way that scraping in Node becomes not worth the effort of figuring out all the weird stuff they've done in client side JS.

We maintain a Node to Phantom bridge for this: https://github.com/baudehlo/node-phantom-simple


Curious, you use the webserver module in phantomjs, is that right? And that's how you do the inter-process communication? I'm curious how you chose that over websockets, or over HTTP polling from your phantomjs client against a local node server..

What about using something like node-gir, or whatever appjs does to combine the event loops of node/v8 and chromium/v8?


I wrote a backend system using Phantom and Akka to generate graphs using D3 and rasterize them into PNGs and put them into user-specific emails.

Phantom has some quirks but overall it's pretty solid.


There's phantomjs for that. Though I've actually had the most success with perl's WWW::Mechanize::Firefox and a headless X server.


You don't even need phantomjs, really. You can use python + webkitgtk+ through the gobject bindings. The problem with phantomjs is that it's using an old version of webkit from an old version of qtwebkit from qt 4.8, whenever that was released. By comparison, webkitgtk+ can be compiled from upstream webkit whenever you please.

If you insist on controlling webkit through javascript, you can use gnome-seedjs. But this is problematic/annoying because there's no commonjs implementation yet.. in phantomjs you can require() node modules in the outside context. Not so much in gnome-seedjs..

Also, X means it's not actually headless, even if you're using xserver-xorg-video-dummy or xvfb. For this reason, phantomjs got rid of the X requirement a number of versions ago.


Do you have any other resources about using webkitgtk with python for this sort of purpose?


No, I haven't found a definitive tutorial or reference (even for webkit). In general, look around for "import gi.repository.webkit" and you will find relevant things. I am not very sure how the other phantomjs developers are learning webkit things.. probably just reading code.



Or you can use a full browser with extension to do the scraper. In this way you are up to date with the latest browser release.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: