Hacker News new | past | comments | ask | show | jobs | submit login
Ghost.py - a webkit web client written in python. (jeanphix.me)
173 points by aeurielesn on June 17, 2012 | hide | past | favorite | 32 comments



Individuals might also be interested in the reimplemented version of phantomjs in python (pyphantomjs): http://github.com/kanzure/pyphantomjs


pyphantomjs is not under active development anymore, unfortunately.


Good! A few days ago I was playing with mechanize to automate some form filling in wordpress posts (iTunes app details, automatically downloaded and then batch-add as post drafts). Gave up by the lack of AJAX-Javascript, turned instead to the Selenium web driver, which solved the problem in "seconds". I'll have to give Ghost.py a spin :)


Has anyone had any experience with both ghost and phantom (or any other options I may not have found), and know how they stack up in terms of rendering speed/etc? I'd imagine they're fairly similar, but if that's not the case I'd be heavily biased towards the faster of the two.


This is the Python equivalence of phantom.js, which provides a programming interface for testing rendered web pages without the overhead of actually opening up a browser (a la Selenium).


In order for this to be usable for a broad range of projects, it must contain:

  * Cookie support (I see this is partially implemented)
  * File download support
  * Mouse movement API (move to pos X,Y - click)
  * NSPlugins support (Flash, etc)
Of the latter ones, I cannot find a reference so I think they are not working yet. Once they are implemented, this is a nice alternative to PhantomJS.


This is a wrapper around PYQT which is a wrapper around webkit in QT. So if those things are supported in QT and PYQT there is hope.


Can it run inline Javascript as the page is loaded or do I have to explicitly tell it what JS to run? I want to scrape some pages that use JS packers to obfuscate their code so that it's only loaded by real browsers, but if I just use curl all I see is JS that needs to be evaluated before I can get anything useful out of it.


"JS packers to obfuscate their code so that it's only loaded by real browsers"

this is probably not what's happening. more likely, it's obfuscated for other reasons. curl doesn't parse or execute javascript.


It actually is in the case I'm talking about. I'm talking about illegal websites where the only money generated is by advertisements on human eyeballs. They go way out of their way to make sure no scrapers/robots can see the videos on the page since it costs them money for bandwidth. In addition to referrer checking and captcha, they also have inline javascript that evals itself to un-obfuscate itself and load the video on the page so that if someone somehow beats the first two methods and loads it by a command line interface, they still don't get the URL to the video.


pretty cool, wasn't aware of this at all. thanks for the explanation. but even if it were unpacked, curl wouldn't execute it.


On the other hand, I've learned something new about the paranoia of people who don't actually work in web development.

Explains those NoScript people quite well.


What do you mean? Who in your scenario here doesn't actually work in web development?


Can you provide some examples of such sites? I would love to learn more about this technique.


I sometimes `wget` the URLs in my spambox out of genuine curiosity as to what people are actually sending me, and there are a bunch of common patterns.

What's very surprising is the "obfuscated eval" statement -- some term which ultimately evaluates to 'window' is queried for something crazy like:

    w[(typeof 3)[4] + "va" + (document.body + "")[17]]
which ultimately is 'eval'. This is often combined with some sort of self-decrypting almost-binary-looking payload hidden in a div somewhere and requested by div.innerHTML. Replacing the "eval" with "console.log" can give you the decrypted payload, usually a redirect to another redirect to something which runs a Flash script, which is where my analysis stops.

I am not sure why they do this. My first thought is "to prevent being automatically taken down by The Man", but The Man could afford to automatically dispose of a computer while monitoring its network traffic, rebooting from a fixed disk image like a LiveCD afterwards. So it shouldn't be too hard to automatically discover the domains, IPs, and malicious programs involved. I don't know why you'd obfuscate a redirection.


The signature for them is that they start off with:

    eval(function(p,a,c,k,e,d))...
I've seen them most commonly used in not-quite-so-legal streaming websites online where their biggest problem is blocking robots from scraping their site and losing their advertising revenue.


okay then, that's the usual daftlogic's[1] output.

I thought you are talking about something new.

[1] http://www.daftlogic.com/projects-online-javascript-obfuscat...


Can this be used to suck in streaming Flash video?

There is this streaming camera of the ocean that I check often, but it's Flash and I'd love to check it from my iPhone. Could Ghost.py be used to get the Flash video? (then turn it into images by other means). Thanks.


You could just rip the RTMP stream itself, or whatever the source data is. Decompile the swf and check it out for yourself, or the source url for the video feed is probably provided in some lame xml config file. No need to write software around an entire browser to get your ocean feed.


Just turn on the Net Inspector in your favorite browser's debugger and start the stream; you'll probably see what url it's loading.


And hopefully it's not RTMPS!


With the Qt framework, maybe: http://trac.webkit.org/wiki/QtWebKitPlugins

Good luck at it, though - seems the hard way to go about it.



Anyone know - how does it handle file downloads?


Seems to get a timeout error on a lot of stuff, even if I set the timeout to ~60seconds with wait_timeout :(


[deleted]


No, because that from import will not put 'ghost' in your namespace, only Ghost.


Damn, you're absolutely right. That's what I get for reading code too early in the morning.


What is a webkit web client?


WebKit is a famous browser rendering engine. Client says that you can use ghost.py to "operate" or "drive" a WebKit instance as a headless browser.


Webkit is an open source browser project. It's based on a rewrite of KDE's browser, mostly done by Apple. KDE was going to merge it back into KHTML, but they found Apple hard to work with (Apple made lots of big changes without explaining them very well, and KDE didn't have the resources to keep up).

It powers Safari and Chrome, and a lot of smaller projects (mobile browsers, email clients, etc).

"Webkit" can refer to the browser (Apple's fork Safari, or a free version of Safari), the rendering engine (WebkitCore), or the rendering and javascript components (WebkitCore and JavaScriptCore).

Google uses WebkitCore (or maybe a fork) for rendering, but their own Javascript engine.

I'm guessing ghost.py uses WebkitCore and JavaScriptCore, so you can find out how a page will be seen by Safari (and probably Chrome, since Google's JS engine shouldn't be that different in the way it behaves).


Giggle at the ghostie! (Ok - and now burn my karma)


I don't know which one scares me more: the fact that there are My Little Pony references on HN, or that I actually got the reference.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: