Hacker News new | past | comments | ask | show | jobs | submit login
Web crawling and downloading ebooks with phantomJS (debuggerstepthrough.blogspot.co.il)
61 points by gillyb on June 17, 2012 | hide | past | favorite | 26 comments



What is the benefit of using phantomJS in this case? I understand that it is very useful if content is dependant on JS running.

But that doesn't seem to be the case here. With Python I would have used a parser like lxml or BeautifulSoup (and I'm sure there is something comparable for JS) coupled with Requests async methods. That would probably not only end up with shorter and more concise code, but also be a lot faster.


The script appears to rely on jQuery (which is presumably already included in the pages being scraped in this case). If you're already familiar with using jQuery for DOM manipulation, then using it for scraping is incredibly easy.

One advantage is that it's not always instantly obvious if you'll need JS to execute before you can scrape a page. If you start out with a simply html parser and then find out that you needed the JS to run first, you're going to have to start over. If you start out using phantomJS and then find out that you don't need any of the original JS to run, your script still works.


When you say 'rely on jQuery', I think it would be more precise to say that it relies on CSS selectors. Most libraries will actually provide you a way to access an HTML parse tree in that way.

I guess phantomjs is a good a tool as any, but there is really no need to evaluate Javascript for a bit of plain HTTP+HTML parsing.


It looks to me like it's using jQuery specific functions. It could be done with a simpler selector engine, but in this case, it looks pretty clear that it's either jQuery or a compatible library like Zepto.


As far as I see the only line with selectors is:

> return [ $($('h2 a')[0]).attr('title'), $($('h2 a')[1]).attr('title') ];

which are 2 css selectors and picking an element. That is pretty much covered by all of the available http parser libraries


It's CSS selectors and then wrapping the DOM element again in a function to give it an `attr` method, which is jQuery style. Other libraries may use that syntax too, but I'm pretty sure it started with jQuery (and if not, was certainly popularized by it).


The bulk of the code here is establishing the spider behavior, not parsing html or launching a request.


This technique is certainly useful in a variety of instance and I've done the same thing with both HTMLUnit and JWebUnit in Java. The "great site you know of" appears to be filled with books that are copyrighted and for-profit so I'm not sure you'd really want to publicize what you're doing on your blog.


Good idea, and thanks for the comment! I removed the website i was scraping from the screenshots. btw - All the old pages on the site i was scraping were broken links anyway. But I could now make small modifications to the script and have it working on a bunch of other sites as well... :)


In my country there is no problem in sharing copyrighted content as long as do not receive economical profit. Also, I have found the example very practical as introduction.


"In my country there is no problem in sharing copyrighted ..."

I guess that means that each e-book only has to be sold once in your country? The U.S. laws might be overly protective of IP, but that's an interesting problem for publishers who in theory need to earn a profit if they're going to continue as entities as well as for authors who need to feed their families.

This obviously wasn't a problem when the books were printed on dead trees, because you'd only share copies that had been purchased, and if you're friend was reading your book you no longer had access to it. Curiously, I could rent my copy of a book to you in the U.S. without violating copyright laws.


"I guess that means that each e-book only has to be sold once in your country?"

No.

It means that you can read a book and, if you really like it, you can buy it. It means that you can discover new authors, topics and so without a huge investment.

This can sound demagogic: I have never had enough money to buy the books I wanted, nor to waste it trying to discover new books and topics. But with downloaded books I learned about tech and other fields. Eventually, I bough more books (a lot from U.S) than if I had not discovered these topics.

Allowing private sharing (as long as there are not profit) and supporting authors are not in direct confrontation. In my humble opinion and personal experience, they are correlated.


I was not passing judgement on either you or your country's laws ... And I think the ability to try a book out before you buy it is important to the market. I think there are a lot of us who spend time in bookstores simply for this reason.

"Private sharing" and especially recommendations are also my favorite ways to find worthwhile books.


No problem, smoyer. I take it to first person because I thought that my personal experience could be a interesting answer.

Recommendations are great once you both have some favorites books in common. Because of that, I always check the Amazon's "Other people also bough" section.


You could also use the CasperJS wrapper and have the script automatically download those files for you.

See http://casperjs.org/api.html#casper.download


Wow! I just looked at the casperjs api, and it looks amazing!! Tons of great utilities to help you work with the DOM. This could be great, since i would like to implement the json wire protocol for phantomjs, and using casperjs will be of great help for that task! :)


If you like PhantomJS, be sure to also check out CasperJS. I use it with jQuery, Underscore and Underscore.string.

I just wish that jQuery had support for XPath style selectors as well. Chainable XPath would be hella sweet.


phantomJS + CasperJS make crawling easily. I build http://sp.iderman.info to help scratching easier.


Phantom is awesome. I tried to use it for testing, but it's too slow (10 seconds for one test). Anyone else tried it? Any tips?


We use it for hooking our JS unit tests for http://lanyrd.com/ in to our Jenkins continuous integration server. Our tests are written in QUnit (so we can run them in a regular browser), but Phantom is used to execute them from the command line as part of a Jenkins task.

We also have a separate Phantom/Casper script which we use to test that our Twitter login flow is working.


At my job, we do A LOT of UI testing using selenium. We mainly use the FireFoxDriver, and i did some basic tests comparing it to phantomjs, and phantomjs is much faster! If we use the plain HtmlDriver than it might be as fast, but that doesn't use WebKit, nor has good support for C#... :(


Is using Google Chrome faster?


What do you mean? Qunit testing? Yes. always.


Wouldn't Node be helpful here?


https://github.com/chriso/node.io

looking forward to being able to distribute jobs across multiple machines


Don't know, I'm not familiar with node that much... ...but it's definitely on my 'todo' list of stuff i need to learn more about :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: