I'm surprised nobody has mentioned WWW::Mechanize - classic perl library [1] or python port of it [2], which is much closer to RoboBrowser than selenium/phantomjs/horseman.
I find WWW::Mechanize::Firefox an interesting alternative. The API is similar to WWW::Mechanize, but it remote controls an instance of Firefox using the MozRepl addon. This means scraping/interacting with pages using JS works fine.
Python 3 fan myself, but there are two things I've heard from colleagues:
1. More modules for Python2 than Python3. A lot of projects are forced to go down to Python2 to allow them to use the modules they want to use.
2. It's what they're familiar with. Generally the older developers like using Python2 because they know all what they're getting, no matter what strings are attached.
3. Better syntax. Apparently some people enjoy the Python2 syntax more than the Python3 one. Not using brackets for print statements seem to be the biggest plus, even though in my opinion it looks more Pythonic.
I hope scrapers could be in a form of Chrome extensions, it would record my webpage actions as macros, then execute the macros on a remote headless server without downtime with periodic revisits. No need to program or config anything.
I would urge you to try nightmare[1], its a web scraper that uses electron as a headless browser, and they have a plugin called daydream[2] that records your webpage actions and convert it into a nightmare script. You have to do a couple of retries, but it does work.
SeleniumIDE[0] provides a nice and simple way of doing this, it's just a very simple Firefox addon that lets you record and playback mouse movements, typing, etc. You can then improve your macro through Selenium WebDriver.
Alas, this is both a downside and an upside of Selenium. It's rather slow because it does need to spin up a Firefox instance, but it is very user-friendly and easy to learn because you can see exactly where you are at just by looking at the web browser.
You can run headless Selenium, and speed it up by using a static Firefox instance, but even then it'll be maybe 2-3x slower than some of the others.
The only reason this is (in my opinion), better than other solutions is because you can see the physical webpage it's loading, and for the sheer ease of use that this has. You don't even really need any coding experience to get a simple test running.
You can record tests in Selenium or something like Capybara and replay them using something like PhantomJS, which is a headless browser-like JS execution environment that does things like generate a would-be DOM: https://github.com/jnicklas/capybara
You can also use Selenium tests with things like https://www.browserstack.com/automate , where TL;DR they run your selenium test on dozens of browser + platform combinations and send you the results, like screenshots and any javascript errors. If you're familiar with CI stuff, you can see how powerful this has the potential to be. It's non-trivial but very possible to run your own cluster of selenium nodes as well; check out the official Selenium Grid: http://www.seleniumhq.org/projects/grid/
You can use my project https://github.com/machinepublishers/jbrowserdriver which can be both headless or (what's the opposite of headless?) have a full GUI. It runs on Java using Java's built-in browser and will take a second to warm up when an instance is created, but after doing so, you can reuse the existing browser since I added a reset() API. Performance is comparable to desktop browsers (slightly slower), and for every action it blocks until AJAX page loads finish.
The only reason to use Selenium at all these days is because it can run your tests in multiple real browsers. If you don't need to test cross-browser support, then you can use phantomjs or one of the bazillions of other tools mentioned in this thread.
Does this run javascript on the page? I've done quite a bit of scraping with scrapy, and have had to use phantomjs in many cases because static html doesn't get what you're after.
I've had a lot of success scraping websites with Capybara [1]. It's intended for writing acceptance tests of web apps, but it works remarkably well for scraping websites. It's written in Ruby, but the DSL it provides for interacting with web pages should be pretty understandable to anybody who's programmed before. It also supports multiple browsers, which means you can tradeoff along these axes:
- Headless vs. Not
- JS support vs. Not
I put a repo together with a sample script [2] for scraping leads off of a website which I will not name, but whose name rhymes with 'help'. It uses the PhantomJS browser for headless JS support. It also includes a Vagrantfile so you can avoid installing all the dependencies on your local machine.
I love PhantomJS or SlimerJS for scraping. Everything else includes extra hassles for cookie management, JavaScript emulation, faking user agents and whatnot. Best to simply use a headless browser. Selenium seems overly complicated, too.
Interesting for unprotected websites but it's easy to detect and to block: no valid js, no valid meta header, no valid cookie, suspect behavior...
Selenium is a much "elaborated" solution, but still, can be detected most of the time.
Disclosure: I'm DataDome co-founder.
If you want to detect bad bots and scrapers on your website, don't hesitate to try out for free and to share your feedback with us https://datadome.co
I realize you have reasons not to answer this question, but out of curiosity, what sorts of thing can tip off the fact that a site is getting scraped by a real browser and selenium?
After a glimpse, I should say that if the page needs javascript then use selenium else you use this. So this is like selenium without javascript. Am I right?
It doesn't. To scrape (or fake-API) js-only websites you have to either:
- drive a browser (firefox/chrome) via already mentioned here selenium/webdriver (potentially hiding the actual browser window into a virtual X by wrapping the whole thing with xvfb-run),
- or use one of the webkit-based toolkits: phantomjs [1] or headless horseman [2].
There is also an interesting project that combines the two, i.e. it drives a Firefox (or, more precisely, slightly outdated version of Gecko) to emulate a phantomjs-compatible API. [3]
phantomjs/slimerjs are pretty popular and even have tools that run on top of them, such as casperjs [4], that geared more to automated website testing, but can be quite good at scraping or fake-APIing too.
I recently wrote a browser-driven scraper using Nightmare[1], which uses Electron under the hood. Another option for those who prefer python is dryscrape[2], although I haven't tried it.
Same here. I use python selenium to hit a selenium server for some speed improvements. Chrome/Firefox/Phantomjs, and can inject custom javascript over the pages. Still about 10 seconds to load, render, process a page.
From what I can tell only recently starting to uzse Scrapy is that alot more "magic", shall we say, happens in the background so long procedures which could be a few hundred lines using bs4/requests/mechanize/etc can be minimized into a lot less. Looking at Robobrowser, it seems like it will reduce some of the coding effort but not to the extent that Scrapy does.
I think the main difference is that Scrapy is async - it downloads pages in parallel by default, so it is more efficient. But async APIs can be harder to use - you need callbacks or generators everywhere, so sync packages (like RoboBrowser) can be easier to get started.
Hmm, I can see why I'd want to use this library over piecing together requests and BS4 myself for every project. I love how simple the examples look.
I have a project I'm working on that will involve scraping many different websites on a daily basis. My only scraping experience so far is using cheerio[0] to scrape a single page with a 1,000 row HTML table. Should I start with something BS-based like this or should I jump straight into Scrapy? Or are there any other alternatives I should try?
I've used robobrowser for a project, where I needed to log in to a website and subsequently access pages as a logged in user. It worked well and I like the API. For "simple" scrapers that require authentication or some form of user interaction this is a good tool. If I need to scrape many pages from a site as fast as possible, I'd probably go for Scrapy though.
It's been years since I've used it, but I think cntlm can do this. Point your Scrapy code at the cntlm instance, and it should handle all of the NTLM headers for you.
Just curious... what is everyone using scrapers for?
I've done a lot of work scraping various sites and I can tell you this: basing any product on your ability to aggregate data via scraping will not work in the long run.
Eventually you will be asked not to scrape and then you'll get sued if you don't stop.
Case law is not in your favor here. See Craigslist Vs. 3Taps.
I had a quick look into the repository and unfortunately, it doesn't support WebSockets. Does anyone know of a browser automation library/framework that does support WebSockets?
I know, I read it. It's for "browsing the web without a standalone web browser," and I'm sure that if that was something I had needed, I would have said "Oh! How lovely!" But, since I didn't have that need already, I'm not clear why someone would want that. And I'd like to know! So could you give me a couple of practical use cases? "User stories," if you're into that?
* Lets say you are Google and you want to test if the site is working correctly every day. You could code up a Python script that opens up www.google.com, searches for "facebook" and makes sure that the first result points to www.facebook.com. This script can be configured to run everyday and if someone accidentally pushes an update to the site that causes www.facebook.com to not show up as the top result, the script automatically reverts the site back to its original state. This means users continue to get best search results even if an engineer made a mistake with the ranking algorithm.
* Lets say you are Ebay and you want to make sure that the prices for products on your site is competitive with those at Amazon. You can code up a Python script which searches for some products that customers regularly buy, like an iPhone, and extract the lowest offered price at Amazon. It can then compare them with the lowest offered price of an iPhone on Ebay. If the lowest offered price on Ebay is much larger than that at Amazon, you can offer a discount. This convinces the customer that they are getting competitive offers from Ebay and stops them from writing off Ebay when they want to shop online.
A dead simple example - scrape data from a webpage that doesn't have an API. You could down the wrong route of trying to parse the HTML and end up implementing a lot of logic manually OR you could use this wonderful library.
automation: go to this page, fill in the form, push submit, receive the result, process the result, send the processed result to another program for further analysis, finally emit and alert when attention is needed.
[1] http://search.cpan.org/~ether/WWW-Mechanize-1.75/lib/WWW/Mec...
[2] https://pypi.python.org/pypi/mechanize/