I love ScrapingHub (and use them) but these tips go completely against my own experience.
Whenever I've tried to extract data like that inside Spiders I would invariably (and 50,000 URLs later) come to the realization that my .parse() ing code did not cover some weird edge case on the scraped resource and that all data extracted was now basically untrustworthy and worthless. How to re-run all that with more robust logic? Re-start from URL #1
The only solution I've found is to completely de-couple scraping from parsing. parse() captures the url, the response body, request and response headers and then runs with the loot.
Once you've secured it though, these libraries look great.
PS: If you haven't used ScrapingHub you definitely should give it a try, they let you use their awesome & finely-tuned infrastructure completely for free. One of my first spiders ran for 180,000 pages and 50,000 items extracted for $0.
> code did not cover some weird edge case on the scraped resource and that all data extracted was now basically untrustworthy and worthless.
Your data should not be worthless just because you dont catch some edge cases early. Sure there are always some edge cases but best way to handle them is to have proper validation logic in scrapy pipelines - if something is missing some required fields for example or you get some invalid values (e.g. prices as sequence of characters without digits) you should detect that immediately and not after 50k urls. Rule of thumb is: "never trust data from internet" and always validate it carefully.
If you have validation and encounter edge cases you will be sure that they are actual weird outliers that you can either choose to ignore or somehow try to force into your model of content.
Hmm, I'll have to investigate that, any tips for libraries to use for validation that tie well into scrapy?
What do you do if you discover that your parsing logic needs to be changed after you've scraped a few thousand items? Re-run your spiders on the URLs that raised errors?
The very first thing I do with every scraping project is enable the HttpCacheMiddleware[0]. After downloading a page once, subsequent runs will automatically pull it from the local cache. This makes it way faster to experiment, and doesn't increase the burden of the website.
We actually do what you describe as well sometimes. In particular when we scrape sites with robust bot counter-measures to save on Crawlera [1] usage, or on crawls that take long enough that there's a genuine possibility that the site might change before you're done.
On the one hand side there's no shortage of users who want to crawl popular sites to monitor e.g. search engine ranking or prices. Which is kind of shady in some sense, or not - when there's no API there's no other way...
On the other there are also areas of the web where crawlers are simply not welcome. For instance, DARPA uses a number of our technologies to monitor the dark web for criminal activities:
As an "early" programmer playing with web scraping with the Nokogiri gem, I've been wondering about this aspect (although haven't encountered it yet).
Are there legal implications to scraping a site that actively tries to prevent bots from scraping it? I mean, if the data is publicly accessible on the web, could they go after you?
I don't plan on doing this for any malicious reasons or anything, and like I said, I haven't encountered it yet. Just having the "what if" thought of what my legal risks might be if I'm playing around with this and whether a site could come after me.
> Are there legal implications to scraping a site that actively tries to prevent bots from scraping it? I mean, if the data is publicly accessible on the web, could they go after you?
When we do projects, the baseline is if Google can see it we can too. So from a legal standpoint if Google is covered so are we.
From a legal standpoint firms do go after web scrapers. And lose more often than not. The exception is when you're logged in when you crawl. In that case you've implicitly accepted the terms of use. Some companies aggressively sue when you're logged in while scraping, so it's best to stay on the safe side. Further reading on the topic:
Microdata is indeed awesome but it's not always there. One website had location tags for me to extract from 97% of pages. Except for a few where the <meta> tags were just missing. Wound up using AlchemyAPI's entity extraction to try and get the location out of the text that way.
Thanks for the blogpost! I actually did not know about any of the 3 libraries and am gonna start using them.
Storage is cheap relative to hammering a server (or working around the bounds of rate-limitation), so always save your collected things, then crawl it as many time as you desire as you play around with the right scrapes.
That's my preferred design too. The first job of any external data capture process is to capture the full fidelity source data. Everything else belongs in a followon job.
Extremely well designed framework. It can cover more than 90% of use cases in my opinion. I'm currently working on a project written in Scala that requires a lot of scraping, and I feel really guilty that I'm not using Scrapy :(
I think you can still combine the two. For example Scrapy can be behind service/server to which you'd send request (with same args as if you were running it as a script + callback url) and after items get collected Scrapy can call your callback url sending all items in json format to your Scala app. Or if you want to avoid memory issues for sure, you can send each item to Scala app as it gets collected. Basically, idea is to wrap Scrapy spiders with web service features - then you can use it in combination with any other technology. Or you can use Scrapy Cloud to run your spiders at http://scrapinghub.com/.
In the project I work on we do have the usual periodic crawls and use ScrapyRT to let the frontend trigger realtime scrapes of specific items, all of this using the same spider code.
Edit: Worth nothing that we trigger the realtime scrapes via AMQP.
> Or if you want to avoid memory issues for sure, you can send each item to Scala app as it gets collected.
I've done something similar, you can just add a Pipeline (Pipeline Task? not sure of the right terminology) which posts off the data somewhere else. You can also then store the full item logs in S3 so you can go back occasionally and check nothing has been missed. Works an absolute treat.
I might have a look into that option actually. I'm planning on building a pseudo-framework on top of Akka, so it might make sense to simply communicate with a Scrapy app and handle the results.
Yay! Another article on scrapy. I'm just getting started and my first goal is to scrape a tedious web-based management console that I can't get API access to, and automate some tasks.
Very glad to learn about this site Scraping Hub. Keep the war stories coming. It's technologies like these that brighten up our otherwise drab tech careers and help some of us make it through the day.
I have been using this framework for more than three years and seen how it has evolved and made scrapping so easy. The portia project is also awesome. I have customised scrapy for almost all the cases like having single spiders for multiple sites and providing rules using JSON. I think it is highly scalable with bit of tweak and scrapy allows you to do very easily.
What would you recommend for people who need to operate a cluster of scrapyd instances? Are there any recommended ways of managing the distribution of tasks to multiple scrapyd instances?
I have looked at scrapyd-cluster, but I would prefer not to add a Zookeeper cluster to my stack. Currently I'm thinking of modifying scrapyd-cluster so that it uses AMQP (via Celery) to handle task distribution/retries/etc.
I appreciate this conflicts with the scrapinghub business model, so any tips you can offer would be greatly appreciated :-)
Hi, mryan! I'm the core Frontera developer. The precise answer heavily depends on your use case (what "task" is? scalability requirements, data flow), so please ask your question in Frontera google groups, and we will try to address it directly.
First you could try making use of Frontera, here are the different distribution models it provides out of the box http://frontera.readthedocs.org/en/latest/topics/run-modes.h.... Frontera is web crawling framework made in Scrapinghub, providing crawl frontier and scaling/distribution capabilities. Along with flexible queue and partitioning design, you will get also document metadata storage (HBase or RDBMS of your choice) with simple revisiting mechanism.
Second, we have a simple redis-based solution for scaling spiders https://github.com/rolando/scrapy-redis. It's only dependency is Redis, so it's easy to quick start, but it has only one queue shared between spiders, hard-coded partitioning, and Redis limiting scalability.
You mentioned a scrapy-cluster (not scrapyd-cluster probably). It provides a more sophisticated distribution model, allowing you to separate crawls with jobs concept within the same service, maintains separate per-spider queues (allowing to crawl politely, I hope you plan to do so? :), and forcing you to use it's prioritization model. Also it allows to use spiders of different types sharing the same Redis instance, and prioritize requests on cluster level. BTW, I haven't found any dependencies on Zookeeper.
None of the solutions provides provisioning out of the box. If spider was killed by OOM, or consume too much resources (open file descriptors, memory) you have to take care of it by yourself. You could use supervisord, or upstart or some custom process management solution. It all depends on you monitoring requirements.
Sounds like I have some more research to do - thanks for the detailed response.
Btw, you were right, I meant scrapy-cluster and not scrapyd-cluster: https://github.com/istresearch/scrapy-cluster. There is a requirement on Zookeeper/Kafka, which is the main showstopper for me.
Here is a first one : What are the best ways to detect changes in html sources with scrapy, thus giving missing data in automatic systems that need to be fed ?
Well, missing data can happen from problems in several different levels:
1) site changes caused the items that were scraped to be incomplete (missing fields) -- for this, one approach is to use an Item Validation Pipeline in Scrapy, perhaps using a JSON schema or something similar, logging errors or rejecting an item if it doesn't pass the validation.
2) site changes caused the scraping the items itself to fail: one solution is to store the sources and monitor the spider errors -- and when there are errors, you can rescrape from the stored sources (it can get a bit expensive store sources for big crawlers). Scrapy doesn't have a complete solution for this out-of-the-box, you have to build your own. You could use the HTTP cache mechanism and build a custom cache policy: http://doc.scrapy.org/en/latest/topics/downloader-middleware...
3) site changed the navigation structure, and the pages to be scraped from were never reached: this is the worst one, it's similar to the previous one, but it's one that you want to detect earlier -- saving the sources doesn't help much, since it happens at an early time during the crawl, so you want to be monitoring it.
One good practice is to split the crawl in two: one spider does the navigation and push the links of the pages to be scraped into a queue or something, and another spider reads the URLs from that and just scrape the data.
Hey, not sure if I understood what you mean. Did you mean:
1) detect pages that had changed since the last crawl, to avoid recrawling pages that hadn't changed?
2) detect pages that have changed their structure, breaking down the Spider that crawl it.
1) detect pages that had changed since the last crawl, to avoid recrawling pages that hadn't changed?
You could use the deltafetch[1] middleware. It ignores requests to pages with items extracted in previous crawls.
2) detect pages that have changed their structure, breaking down the Spider that crawl it.
This is a tough one, since most of the spiders are heavily based on the HTML structure. You could use Spidermon [2] to monitor your spiders. It's available as an addon in the Scrapy Cloud platform [3], and there are plans to open source it in the near future. Also, dealing automatically with pages that change their structure is in the roadmap for Portia [4].
> 1) detect pages that had changed since the last crawl, to avoid recrawling pages that hadn't changed?
Usually web clients use https://en.wikipedia.org/wiki/HTTP_ETag , afais. If a web app\server lacks that skill, then you could compute your own hash and check it yourself, instead of processing that condition at the network layer.
That looks pretty cool. I was planning on writing something similar in Scala, but I'm not sure if I have enough experience with the language to get it done.
If you're lazy (and if you're into FP you must be hehehe), just use that Clojure library. Calling Clojure code from Java is easy, and I'm sure it's not much harder from Scala.
Man, I feel old. Does anyone remember learning web scraping from one section of Fravia's site? Ever try to move forward from that to write a fully fledged search engine? These memories are from 15 years ago... quite amusing how much hasn't changed. In hindsight it was probably easier back then due to the lack of JS-reliant pages, less awareness of automation and less scraper detection algorithms.
What I need is an API scraper, scrapy seems to be mostly for HTML. I know how to look at network requests in Chrome dev tools and JS function to understand the shape of REST API, so I need something to plan the exploration of the arguments space. For example if you want to scrap airbnb, you look at their API, find there is a REST call with a lat, long box, I need something to automatically explore an area, if the api only give the 50 first results and you hit this number of calls it should schedule 4 calls with half the size boxes and so on. If the request has cursors, you should be able to indicate to the scraper how to follow it. I don't know what is the best tool for that.
Are you talking about something that would map out the API? The closest analogy to what I understood from your text was http://nodered.org/ in terms of design, I haven't seen anything like this for scraping yet. Definitely an interesting product to be made here if it doesn't exist yet.
I like this article but for its discussion of these libraries. On another note...
Am I the only one who dislikes Scrapy? I think it's basically the iOS of scraping tools: It's incredibly easy to setup and use, and then as soon as you need to do something even minutely non-standard it reveals itself to be frustratingly inflexible.
I do a lot of scraping specific pages and often have to auth, form-fill, refresh, recurse, use a custom SSL/TLS adapter, etc., in order to get what I'm after. I'm sure Scrapy would be great if I just had a giant queue of GET requests. Also, don't get me started on the Reactor.
We use Scrapy for a few projects and it is really really good. They have a commercial-side to them, which is fine, but for anyone doing crawling/scraping I'd strongly recommend it. Good article also
Great to see Scrapy getting some love. It's really well done and it scales well (used it to scape ~2m job posts from ATS & government job banks in 2-3 hours).
Here's slides for a talk I gave about an interesting approach to scraping in Clojure [1]. This framework works really well when you have hierarchical data thats a few pages deep. Another highlight is the decoupling of parsing and downloading pages.
I love scraping web and produce structured data from web pages. The only downside of using XPath or similar extracting approach is necessity of constant maintenance. If I have enough knowledge about machine learning, I would like to write a framework that analysis similar pages and finds structure of data without giving which parts of page should be extracted.
Quick question. Could you use Scrapy to specific individual pages from thousands (or millions) of sites, or would you be better off using a search engine crawler like Nutch for this? I want to crawl the first page of a number of specific sites and was looking into the technologies for this.
Yes. Each spider in Scrapy has a "start_urls" parameter/method, so you'd just need to fill that up with all your domains and make sure the spider has freedom to crawl across domains. Each URL would be accessed, you'd do whatever you want to do and when the spider has visited them all, it would quit.
> would you be better off using a search engine crawler like Nutch for this?
Fwiw we recently met a client who tried Scrapy + Frontera vs Nutch, and their assessment was that Scrapy + Frontera is twice as fast. Here's a deck on Frontera FYI:
What kinds of careers often deal with web scraping and doing these sorts of tasks? I am really interested in the field and some of you seem to be real experts in this field.
- developers who want to develop some data-based product (a travel agency website, who finds the best deals from airline companies);
- lawyers can use it to structure the data from Judgments and Laws (so that they are able to query the data for things like: which judges have interpreted this law in their judgments) (more on this: http://blog.scrapinghub.com/2016/01/13/vizlegal-rise-of-mach...)
- (data-)journalists who work on investigative data-based articles (they use it to gather the data to build visualizations, infographics, and also to support their arguments).
- real state agencies can use it to grab the prices of their competitors, or to get a map of what people are selling, what are the areas where there is more demand.
- large companies that want to track their online reputation can scrape forums, blogs, etc, for further analysis.
- online retailers that want to keep their prices balanced with their competitors can scrape the competitors websites collecting prices from them.
Nokogiri is a tag soup parser. Scrapy is a web scraping framework.
In addition to tag soup parsing, Scrapy handles a slew of things such as text encoding problems, retrying urls that fail owing to network problems (if you wish), dispatching requests across multiple spiders with a shared crawl frontier (see Frontera), shared code between similar spiders using middlewares and pipelines, and what have you.
There's a Lisp joke that goes something like every sufficiently complex piece of software in C is a slow, buggy, poorly implemented version of Lisp. Very much the same could be said about Scrapy and web scraping projects. :-)
Not that we're aware of. Most rubyists use request and a tag soup parser, without benefiting from any type of parallelization that you get from Scrapy.
Whenever I've tried to extract data like that inside Spiders I would invariably (and 50,000 URLs later) come to the realization that my .parse() ing code did not cover some weird edge case on the scraped resource and that all data extracted was now basically untrustworthy and worthless. How to re-run all that with more robust logic? Re-start from URL #1
The only solution I've found is to completely de-couple scraping from parsing. parse() captures the url, the response body, request and response headers and then runs with the loot.
Once you've secured it though, these libraries look great.
PS: If you haven't used ScrapingHub you definitely should give it a try, they let you use their awesome & finely-tuned infrastructure completely for free. One of my first spiders ran for 180,000 pages and 50,000 items extracted for $0.