Hacker News new | past | comments | ask | show | jobs | submit login
Python web scraping resources (jakeaustwick.me)
158 points by dchuk on Aug 4, 2014 | hide | past | favorite | 33 comments



A nice write up. At a previous company I built a solution (was forced into java but ultimately the same process...) that used many of these techniques. Some suggested next steps/additional enhancements if you need to do this repeatedly and at scale.

Implement global throttling on a per domain basis.

Consider some abstraction. I implemented an abstract fetcher (with a number of concrete fetchers that were runtime selectable), and an abstract/concrete parser. Compose a Scraper with these two things. Allow for a runtime switch that will determine which fetcher to use (javascript enabled, straight requests, etc.). If you want to get really fancy, in your database of urls, you can flag urls that need to use a heavier, full fledged browser.

For the fetcher, use the selenium bindings. We tested phantomjs and chrome and chrome outperformed phantom. It might have been the java bindings (ghostdriver) but w/e, something you just have to test for yourself. Once we settled on chrome, I built a chrome plugin to block ads and other unrelated calls. This adds LOADs of time. Its pretty tricky but you can inject in a list of well crafted regexes and it drops initial load times dramatically.

For the parser, you may want to consider a fallback system. Often times the particular piece of data you want (say a title) can be found in a handful of places on the page. It will make your parsing much more reliable.

Compose a 'bot runner' from the Scraper. We had json documents that described fields we were abstracting and all the fallback rules used to locate the needed data. Lastly, the bot runner can be expanding to include things like navigation and other fancier tricks.

If you go for broke, build a system for generating bots, think chrome plugin.

Don't forget, pruning dead URLs is a tricky little problem but an important one.

To scale this whole operation linearly, we used a queue (redis at first, eventually kafka) and storm. Storm allowed us to arbitrary expand and contract our bot runners.

Scraping is a problem that just about everyone encounters and alot of the most standard solutions seem to really fall short. Your article is an excellent start.


Curious what was the catalyst to switch from Redis to Kafka? Reliability that a given message was received and processed and replaying that to a different consumer in the event of failure?


mostly just scale. Keeping your queue in memory isnt necesarilly a requirement unless you have a lambda arch or a realtime requirement (which we started with). For our larger operations, kafka, which is disk backed, distributed with a little more ease and cheaper was a nice option. For reliability, we were using some of storms primitives as well as the mechanisms inside qless, the queuing library we were using on top of redis. https://github.com/ChannelIQ/qless-java


I would highly recommend Scrapy if you plan on doing any serious scraping: http://scrapy.org/.


I wouldn't. I used this for a project and then quickly regretted it for the following reasons:

* XPath selectors just plain suck.

* The item pipeline is way too straitjacketed. You have to do all sorts of fucking around with settings in order to make the sequence of events work in the (programmatic) way you want it because the framework developers 'assumed' there's only one way you'd really want to do it.

* Scrapy does not play well with other projects. You can integrate it with django if you want a minimal web UI but it's a pain to do so.

* Tons of useless features. Telnet console? wtf?

* It's assumed that the 'end' of the pipeline will be some sort of serialization - to database, xml, json or something. Actually I usually just want to feed into the end of another project without any kind of serialization using plain old python objects. If I want serialization I probably want to do it myself.

* For some reason DjangoItem didn't really work (although by the time I tried to get it to work I'd kind of given up).

IMO this is a classic case of "framework that should have been a library".

Here's what I used instead after scrapping scrapy:

* mechanize - to mimic a web browser. I used requests sometimes too, but it doesn't really excel at pretending to be a web browser, so for that reason I usually used mechanize as a drop in replacement. * celery - to schedule the crawling / spin off multiple crawlers / rate-limiting / etc. * pyquery - because xpath selectors suck and jquery selectors are better. * python generators - to do pipelining.

I'm largely happy with the outcome. The code is less straitjacketed, easier to understand and easier to integrate into other projects if necessary (you don't have the headache of trying to get two frameworks to play together nicely).


I agree. XPath and freinds are too low level. Scrapy is nice though to control HTTP/assets download but in fact even this sucks. If you're serious with scraping you must go with a headless approach which also means Python-only doesn't work.


Hey,

A good feedback, thanks!

> XPath selectors just plain suck.

Scrapy supports CSS selectors.

> The item pipeline is way too straitjacketed. You have to do all sorts of fucking around with settings in order to make the sequence of events work in the (programmatic) way you want it because the framework developers 'assumed' there's only one way you'd really want to do it.

Could you plese give an example?

> Scrapy does not play well with other projects. You can integrate it with django if you want a minimal web UI but it's a pain to do so.

This is true. But it is a pain to integrate any even-loop based app with another app that is not event-loop based. It is also true that Scrapy is not easy to plug into existing event loop (e.g. if you already have twisted or tornado-based service), but it should be fixed soon.

> Tons of useless features. Telnet console? wtf?

Telnet console is a Twisted feature; it came almost for free, and it is useful to debug long-running spiders (which can run hours and days).

> It's assumed that the 'end' of the pipeline will be some sort of serialization - to database, xml, json or something. Actually I usually just want to feed into the end of another project without any kind of serialization using plain old python objects. If I want serialization I probably want to do it myself.

If you don't want serialization then you want a single process both for crawling and for other tasks. This rules out synchronous solutions - you can't e.g. integrate a crawler with django efficiently without serialization. If you just want to do some post-processing then I don't see why putting code to Scrapy spider is worse than putting it to other script and calling Scrapy from this script.

> For some reason DjangoItem didn't really work (although by the time I tried to get it to work I'd kind of given up).

This may be true.. I don't quite get what is it for :)

> IMO this is a classic case of "framework that should have been a library".

It can't be a library like requests or mechanize for technical reasons - to make crawling efficient Scrapy uses event loop. It can (and should) be a library for twisted/tornado/asyncio; it is possible to use Scrapy as a such library now, but this is not straightforward; this should (and will) be simplified.

> * mechanize - to mimic a web browser. I used requests sometimes too, but it doesn't really excel at pretending to be a web browser, so for that reason I usually used mechanize as a drop in replacement. * celery - to schedule the crawling / spin off multiple crawlers / rate-limiting / etc. * pyquery - because xpath selectors suck and jquery selectors are better. * python generators - to do pipelining.

Celery is also not the easiest piece of software. Scrapy is just a single Python process that doesn't require any databases, etc.; Celery requires to deploy a broker and have a place to store task results; it is also less efficient for IO-bound tasks.


>Scrapy supports CSS selectors.

Still far inferior to JQuery selectors.

>Could you plese give an example?

The example I'm thinking of is when I was trying to create a pipeline that would output a skeleton configuration file when you passed one switch and would process and serialize the data parsed when you passed another. It was possible but kludgy.

>But it is a pain to integrate any even-loop based app with another app that is not event-loop based.

That's not where the pain lies. It's more the fact that it has its own weird configuration/setup quirks (e.g. its own settings.py, reliance on environment variables, executables).

>If you don't want serialization then you want a single process both for crawling and for other tasks. This rules out synchronous solutions - you can't e.g. integrate a crawler with django efficiently without serialization.

I don't really want scrapy doing process handling at all. It's not particularly good at it. Celery is much better.

Using other code to do serialization also doesn't necessitate running it on the same process. You can import the django ORM wherever you want and use it to save to the DB. I know you can do that - but, again, kludgy.

>It can't be a library like requests or mechanize for technical reasons - to make crawling efficient Scrapy uses event loop.

I get that. It should have been more like twisted from the outset though. The developers were clearly inspired from django and that led them down a treacherous path.

>It can (and should) be a library for twisted/tornado/asyncio; it is possible to use Scrapy as a such library now, but this is not straightforward; this should (and will) be simplified.

Well, that's good I suppose. I still think that it focuses on bringing together a bunch of mediocre modules for which, individually, you can find much better equivalents. Also, (unlike django) tight, seamless integration between those modules doesn't really gain you much.

>Celery is also not the easiest piece of software.

The problem it is solving (distributed task processing) is not an easy problem. Celery is not simple, but it is GOOD.

>Scrapy is just a single Python process that doesn't require any databases, etc. Celery requires to deploy a broker and have a place to store task results; it is also less efficient for IO-bound tasks.

A) You can use redis as a broker and that's trivial to set up. I always have a redis available anyway because I always need a cache of some kind (even when crawling!).

B) My crawling tasks are never I/O bound or CPU bound. They're bound by the rate limiting imposed upon me by the websites I'm trying to crawl.

C) I'm usually using celery anyway. I still have to do task processing that DOESN'T involve crawling. Where do I put that code when I'm using scrapy?


> The example I'm thinking of is when I was trying to create a pipeline that would output a skeleton configuration file when you passed one switch and would process and serialize the data parsed when you passed another. It was possible but kludgy.

I don't get it - how is creating a configuration file related to the processing of the items, why would you do it in items pipeline?

> It's more the fact that it has its own weird configuration/setup quirks (e.g. its own settings.py, reliance on environment variables, executables).

It is possible to create a Crawler from any settings object (not just a module), and Scrapy does not rely on executables AFAIK. But this all is poorly documented. Also, there is an ongoing GSoC project to make settings easier and more "official".

> I don't really want scrapy doing process handling at all.

Scrapy doesn't handle processes, it is single-threaded and uses a single process. This means that you can use e.g. a shared in-process state.

> Using other code to do serialization also doesn't necessitate running it on the same process. You can import the django ORM wherever you want and use it to save to the DB. I know you can do that - but, again, kludgy.

You can't move Python objects between processes without serialization. Why is using django ORM kludgy in Scrapy but not in Celery?

> The problem it is solving (distributed task processing) is not an easy problem. Celery is not simple, but it is GOOD.

You don't necessarily need distributed task processing to do web crawling. Celery is a great piece of software, and it is developing nicely, but you always pay for complexity. For example, I faced the following problems when I was using Celery:

* When redis was used as a broker its memory usage was growing infinitely. Lots of debugging, found a reason and a hacky way to overcome it (https://github.com/celery/celery/issues/436). The issue was fixed, but apparently there is still a similar issue when MongoDB is used as a broker.

* Celery stopped processing without anything useful in logs (and of course Celery error sending facilities failed and I didn't have external monitoring) - it turns out an unicode exception was eaten. A couple of days of nightmarish debugging; see https://github.com/celery/celery/issues/92.

* I implemented an email sender using Celery + RabbitMQ once. I think I was sending email text to tasks as parameters. Never do that (just use MTA:)! When a large batch of emails was sent at once RabbitMQ used all memory, corrupted its database, dropped the queue; I haven't found a way to check which emails were sent and which were not. This was 100% my fault, but it shows that complex setup is not your friend.

Crawling tasks differ - e.g. if you need to crawl many different websites (which is not uncommon) you will almost certainly be IO and CPU limited. Scrapy is not a system for distributed task processing, it is just an event-loop based crawler. I'm not saying your way to solve the problem is wrong; if you already use celery it makes a lot of sense to use it for crawling as well. But I don't agree that going distributed turtles all the way down with celery+redis+DB for storage+... is easier or more efficient than using plain Scrapy. A lot of tasks can be solved by writing a spider, getting a json file with data and doing whatever one wants with it (upload to DB, etc).


This.

The Scrapy tutorial is good if you just want to use scrapy to crawl a site and extract a bunch of information, one time.

If you want to do scraping as a small part of another Python project, then it can be easier just to use Scrapy's HtmlXPathSelector, which is more forgiving than a real XML parser.

    import urllib2
    from scrapy.selector import HtmlXPathSelector
    from scrapy.http import TextResponse
    
    url = 'http://www.dmoz.org/Computers/Programming/Languages/Python/Books/'
    my_xpath = '//title/text()'
    
    req = urllib2.Request(self.url, headers={'User-Agent' : "Mozilla or whatever"})
    url_response = urllib2.urlopen(req)
    body = url_response.read()
    response = TextResponse(url = '', body = body, encoding = 'utf-8')
    hxs = HtmlXPathSelector(response)
    result = hxs.select(my_xpath).extract()


HTMLXPathSelector is just a very small wrapper around lxml, it doesn't add anything parsing wise. You might as well just use lxml directly if you don't already have scrapy as a dependency.

https://github.com/scrapy/scrapy/tree/master/scrapy/selector


pyquery's a good alternative too. it's a slightly larger wrapper around lxml that lets you use jquery selectors.


lxml's `cssselect` method is nice for this - I found that with `xpath` and `cssselect` I have no need for anything else. I use cssselect for simple queries, like "a.something" - which would be needlessly verbose in XPath - and xpath for more complex ones, for example when I need access to axes or I want to apply some simple transform to the data before processing it in Python. Worked very well for me.


Doh! Too late to edit or delete my original comment :(

(And I can't downvote my own comment.)


I wish every post/tool out there on scraping covered obeying robots.txt. It's a crap standard, but it's what we've got.

"Just ignore it" is a great way to identify yourself as a crappy netizen.


I'm glad that all of the sites you target want your scraper to access them. The goal in many cases where one would use a scraper is to access information not provided in an API or otherwise encased in HTML. Most of their robots.txt are "User-Agent: *\nDisallow: /\n"


Then we have no right to scrape that content.

Why is there an implied right to scrape?


As a self-taught Python and Ruby programmer, I really appreciate this.


Clean and direct write up. It reminds me when I used to work on crawling/scrapping. Covered most of the topics that needs to know for web crawling.


Very nice, thanks for posting =).

Can people suggest any additional resources/reading on scraping/crawling as well?

I was hoping to experiment with it in GoLang, but there doesn't seem to be much on crawling/scraping with GoLang, except for GoQuery (https://github.com/PuerkitoBio/goquery)


Little distributed web scraper project I created a while back if anyone is interested / needs resources : https://github.com/Diastro/Zeek


Really nice. I was just going to comment on how similar this was to scraping I did during my days working in SEO, and then I saw your username! Long time no talk. What's up man? Sounds like we should get in touch.


indepth article! Had to learn most of the stuff the hard way, could need that a couple of weeks ago:) Proxies nowadays are really cheap. Isnt ignoring the robot.txt opening doors for suing? Scraping copyrighted material should be avoided too in my opinion, but i guess that only matters if you get caught:)


I need to scrape an eccomerce site this weekend, and this will be a great resource to keep bookmarked. Thanks.


Check out http://diffbot.com/products/automatic/product/ to do this fully automated.


I'll self-promote too then. https://screenslicer.com


very cool! probably one of the first web scrapers I've run across with a front end where you can easily show users / investors etc what this does. and it seems to work pretty well. really like your output display as well.


python selenium bindings are also nice if you absolutely have to deal with information put together by javascript code.


I actually cover this (very briefly) here: http://jakeaustwick.me/python-web-scraping-resource/#thesite...

I should dedicate a section to it though, will stick that on my to-do list.


I've always opted for Ghost.py rather than selenium as I've found it uses less memory and pretty capable. Admittedly my scrapes are normally pretty targeted and not over 10k+ sites.

One query I had from your piece: "When I've found myself in the unfortunate place of getting my proxies banned before on certain sites, they have been more the happy to switch them out for new IP's for me."

If you're getting banned from sites, is it not time to leave them alone? If they don't want your traffic (which the admin/system has judged as too much), should you really be circumventing it? With that and the robots.txt bit (which admittedly you justified), you've got to be careful not to slip into a bit of a grey area with scraping, which people regard suspiciously in the first place.


Can't the Python code just call PhantomJS directly? That's what I did.


Have you thought about using a headless web browser like casperjs?


PhantomJS can be run through Selenium.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: