Hacker News new | past | comments | ask | show | jobs | submit login
RoboBrowser: Your friendly neighborhood web scraper (github.com/jmcarp)
182 points by pmoriarty on May 12, 2016 | hide | past | favorite | 58 comments



I'm surprised nobody has mentioned WWW::Mechanize - classic perl library [1] or python port of it [2], which is much closer to RoboBrowser than selenium/phantomjs/horseman.

[1] http://search.cpan.org/~ether/WWW-Mechanize-1.75/lib/WWW/Mec...

[2] https://pypi.python.org/pypi/mechanize/


I've used the Perl version a bit and loved it many years ago, but I've also had great success more recently with the Ruby [0] version of Mechanize.

[0] https://github.com/sparklemotion/mechanize


Mechanize is outdated and python 2 only. We tried it and switched to RoboBrowser.


I find WWW::Mechanize::Firefox an interesting alternative. The API is similar to WWW::Mechanize, but it remote controls an instance of Firefox using the MozRepl addon. This means scraping/interacting with pages using JS works fine.

http://search.cpan.org/~corion/WWW-Mechanize-Firefox-0.78/li...


In some significant ways, python 2 is the better language than 3, though.


In what ways?


Python 3 fan myself, but there are two things I've heard from colleagues:

1. More modules for Python2 than Python3. A lot of projects are forced to go down to Python2 to allow them to use the modules they want to use. 2. It's what they're familiar with. Generally the older developers like using Python2 because they know all what they're getting, no matter what strings are attached. 3. Better syntax. Apparently some people enjoy the Python2 syntax more than the Python3 one. Not using brackets for print statements seem to be the biggest plus, even though in my opinion it looks more Pythonic.


I hope scrapers could be in a form of Chrome extensions, it would record my webpage actions as macros, then execute the macros on a remote headless server without downtime with periodic revisits. No need to program or config anything.


I would urge you to try nightmare[1], its a web scraper that uses electron as a headless browser, and they have a plugin called daydream[2] that records your webpage actions and convert it into a nightmare script. You have to do a couple of retries, but it does work.

[1] https://github.com/segmentio/nightmare [2] https://github.com/segmentio/daydream


SeleniumIDE[0] provides a nice and simple way of doing this, it's just a very simple Firefox addon that lets you record and playback mouse movements, typing, etc. You can then improve your macro through Selenium WebDriver.

[0] http://www.seleniumhq.org/


One problem with selenium last time I used it, was that it is very slow. Maybe this python library fixes this (i.e. no browser will show).


Alas, this is both a downside and an upside of Selenium. It's rather slow because it does need to spin up a Firefox instance, but it is very user-friendly and easy to learn because you can see exactly where you are at just by looking at the web browser.

You can run headless Selenium, and speed it up by using a static Firefox instance, but even then it'll be maybe 2-3x slower than some of the others.

The only reason this is (in my opinion), better than other solutions is because you can see the physical webpage it's loading, and for the sheer ease of use that this has. You don't even really need any coding experience to get a simple test running.


You can record tests in Selenium or something like Capybara and replay them using something like PhantomJS, which is a headless browser-like JS execution environment that does things like generate a would-be DOM: https://github.com/jnicklas/capybara

You can also use Selenium tests with things like https://www.browserstack.com/automate , where TL;DR they run your selenium test on dozens of browser + platform combinations and send you the results, like screenshots and any javascript errors. If you're familiar with CI stuff, you can see how powerful this has the potential to be. It's non-trivial but very possible to run your own cluster of selenium nodes as well; check out the official Selenium Grid: http://www.seleniumhq.org/projects/grid/


what is CI? ...CLI?


Continuous Integration.


You can use my project https://github.com/machinepublishers/jbrowserdriver which can be both headless or (what's the opposite of headless?) have a full GUI. It runs on Java using Java's built-in browser and will take a second to warm up when an instance is created, but after doing so, you can reuse the existing browser since I added a reset() API. Performance is comparable to desktop browsers (slightly slower), and for every action it blocks until AJAX page loads finish.


Headful, I would think.


The only reason to use Selenium at all these days is because it can run your tests in multiple real browsers. If you don't need to test cross-browser support, then you can use phantomjs or one of the bazillions of other tools mentioned in this thread.


The Resurrectio Chrome extension allows you to do this with PhantomJS


I know there's extensions that help already with generating finders for automated e2e testing, you might be able to use something like that.


Shameless plug, check out https://parsehub.com


Parsehub is pretty great, the free tier is interesting and the support is excellent.


Several startups have tried this. http://kimonolabs.com was is one I had experimented with and it was recently bought and shut down by Palantir.


I use this all the time, its excellent for less technical users. You can download a desktop version and still use it, despite it being shut down.


Does this run javascript on the page? I've done quite a bit of scraping with scrapy, and have had to use phantomjs in many cases because static html doesn't get what you're after.


At a glance, no - it uses Requests to fetch pages and BeautifulSoup to parse them, the latter of which only parses the HTML into a document object.

So static HTML parsing only.


I use PhantomJS as well, but (assuming you haven't already) look at CasperJS. It uses PhantomJS, but it is more friendly to use for bigger tasks.


You could use Splash for JS as well. (Disclaimer: working for Scrapinghub, the main maintainers of Scrapy/Splash.)


I've had a lot of success scraping websites with Capybara [1]. It's intended for writing acceptance tests of web apps, but it works remarkably well for scraping websites. It's written in Ruby, but the DSL it provides for interacting with web pages should be pretty understandable to anybody who's programmed before. It also supports multiple browsers, which means you can tradeoff along these axes:

- Headless vs. Not - JS support vs. Not

I put a repo together with a sample script [2] for scraping leads off of a website which I will not name, but whose name rhymes with 'help'. It uses the PhantomJS browser for headless JS support. It also includes a Vagrantfile so you can avoid installing all the dependencies on your local machine.

[1]: https://github.com/jnicklas/capybara

[2]: https://github.com/toasterlovin/scraping-yalp


I love PhantomJS or SlimerJS for scraping. Everything else includes extra hassles for cookie management, JavaScript emulation, faking user agents and whatnot. Best to simply use a headless browser. Selenium seems overly complicated, too.


Interesting for unprotected websites but it's easy to detect and to block: no valid js, no valid meta header, no valid cookie, suspect behavior...

Selenium is a much "elaborated" solution, but still, can be detected most of the time.

Disclosure: I'm DataDome co-founder. If you want to detect bad bots and scrapers on your website, don't hesitate to try out for free and to share your feedback with us https://datadome.co


I realize you have reasons not to answer this question, but out of curiosity, what sorts of thing can tip off the fact that a site is getting scraped by a real browser and selenium?


Of course I cannot go much into details, but we are using behavior detection and Javascript tracking (mouse, scroll, screen...).


After a glimpse, I should say that if the page needs javascript then use selenium else you use this. So this is like selenium without javascript. Am I right?


Does it support sites which require a JS enabled browser?


It doesn't. To scrape (or fake-API) js-only websites you have to either:

- drive a browser (firefox/chrome) via already mentioned here selenium/webdriver (potentially hiding the actual browser window into a virtual X by wrapping the whole thing with xvfb-run),

- or use one of the webkit-based toolkits: phantomjs [1] or headless horseman [2].

There is also an interesting project that combines the two, i.e. it drives a Firefox (or, more precisely, slightly outdated version of Gecko) to emulate a phantomjs-compatible API. [3]

phantomjs/slimerjs are pretty popular and even have tools that run on top of them, such as casperjs [4], that geared more to automated website testing, but can be quite good at scraping or fake-APIing too.

[1] http://phantomjs.org/

[2] https://github.com/johntitus/node-horseman

[3] https://slimerjs.org/

[4] http://casperjs.org/


I recently wrote a browser-driven scraper using Nightmare[1], which uses Electron under the hood. Another option for those who prefer python is dryscrape[2], although I haven't tried it.

[1] https://github.com/segmentio/nightmare

[2] http://dryscrape.readthedocs.io/en/latest/


Dryscrape is really cool! Thanks for sharing!


Last time I needed something like this I used selenium. And I use requests the rest of the time.


Same here. I use python selenium to hit a selenium server for some speed improvements. Chrome/Firefox/Phantomjs, and can inject custom javascript over the pages. Still about 10 seconds to load, render, process a page.


What benefit does it provide in comparison to Scrapy?


From what I can tell only recently starting to uzse Scrapy is that alot more "magic", shall we say, happens in the background so long procedures which could be a few hundred lines using bs4/requests/mechanize/etc can be minimized into a lot less. Looking at Robobrowser, it seems like it will reduce some of the coding effort but not to the extent that Scrapy does.


I think the main difference is that Scrapy is async - it downloads pages in parallel by default, so it is more efficient. But async APIs can be harder to use - you need callbacks or generators everywhere, so sync packages (like RoboBrowser) can be easier to get started.


Hmm, I can see why I'd want to use this library over piecing together requests and BS4 myself for every project. I love how simple the examples look.

I have a project I'm working on that will involve scraping many different websites on a daily basis. My only scraping experience so far is using cheerio[0] to scrape a single page with a 1,000 row HTML table. Should I start with something BS-based like this or should I jump straight into Scrapy? Or are there any other alternatives I should try?

[0]: https://github.com/cheeriojs/cheerio


If you want to scrape many websites on a daily basis, have a look at https://www.apifier.com as an alternative.

Disclaimer: I'm a cofounder there


I've used robobrowser for a project, where I needed to log in to a website and subsequently access pages as a logged in user. It worked well and I like the API. For "simple" scrapers that require authentication or some form of user interaction this is a good tool. If I need to scrape many pages from a site as fast as possible, I'd probably go for Scrapy though.


I'd like to write a small scraper for a website that uses NTLM authentication, the headers it sends are:

    HTTP/1.1 401 Unauthorized
    Server: Microsoft-IIS/8.5
    WWW-Authenticate: NTLM
    WWW-Authenticate: Negotiate
    ...
Does RoboBrowser support these kinds of protocols? I tried to get it to work with Scrapy, but it seemed non-trivial...


It's been years since I've used it, but I think cntlm can do this. Point your Scrapy code at the cntlm instance, and it should handle all of the NTLM headers for you.


Just curious... what is everyone using scrapers for?

I've done a lot of work scraping various sites and I can tell you this: basing any product on your ability to aggregate data via scraping will not work in the long run.

Eventually you will be asked not to scrape and then you'll get sued if you don't stop.

Case law is not in your favor here. See Craigslist Vs. 3Taps.


I had a quick look into the repository and unfortunately, it doesn't support WebSockets. Does anyone know of a browser automation library/framework that does support WebSockets?


What are the differences (advantages) from Selenium WebDriver and why I should use it?


If this doesn't run js, then what's it's edge vs. requests lib?


Could someone explain what this is for, maybe with a couple of examples? This is getting to be a problem on HN.


Really? It's right there on the main Github page, a 3 sentence description and 6 code examples.


I know, I read it. It's for "browsing the web without a standalone web browser," and I'm sure that if that was something I had needed, I would have said "Oh! How lovely!" But, since I didn't have that need already, I'm not clear why someone would want that. And I'd like to know! So could you give me a couple of practical use cases? "User stories," if you're into that?


Here are a couple of use cases :

* Lets say you are Google and you want to test if the site is working correctly every day. You could code up a Python script that opens up www.google.com, searches for "facebook" and makes sure that the first result points to www.facebook.com. This script can be configured to run everyday and if someone accidentally pushes an update to the site that causes www.facebook.com to not show up as the top result, the script automatically reverts the site back to its original state. This means users continue to get best search results even if an engineer made a mistake with the ranking algorithm.

* Lets say you are Ebay and you want to make sure that the prices for products on your site is competitive with those at Amazon. You can code up a Python script which searches for some products that customers regularly buy, like an iPhone, and extract the lowest offered price at Amazon. It can then compare them with the lowest offered price of an iPhone on Ebay. If the lowest offered price on Ebay is much larger than that at Amazon, you can offer a discount. This convinces the customer that they are getting competitive offers from Ebay and stops them from writing off Ebay when they want to shop online.


A dead simple example - scrape data from a webpage that doesn't have an API. You could down the wrong route of trying to parse the HTML and end up implementing a lot of logic manually OR you could use this wonderful library.


automation: go to this page, fill in the form, push submit, receive the result, process the result, send the processed result to another program for further analysis, finally emit and alert when attention is needed.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: