Beating Google With CouchDB, Celery and Whoosh (Part 1)

timclark · on Oct 6, 2011

http://andrewwilkinson.wordpress.com/2011/09/29/beating-goog... - part two actually contains the interesting content.

bialecki · on Oct 6, 2011

Nice post.

For those considering something like this, you might want to consider using scrapy, a Python web crawler, instead of rolling your own crawler.

I remember when I was looking for something like this a year ago and found that project. I've used it for a few things and it does a nice job abstracting away most of the core scraping architecture, but leaving room to be extended as necessary. After playing with it for a few weeks, makes you realize just how easy it is to grab a bunch of data from the web if you need it.

fuzzythinker · on Oct 6, 2011

I would recommend haystack also for someone wanting to take this to the next step. Since django is already being used and he can swap Whoosh for solr if needed to.

jsherer · on Oct 6, 2011

Whoosh is fast enough for small amounts of data, or when the searches do not need to be performed in realtime. But, I agree, something like Lucene/Solr or Sphinx might be a bit better for indexing large amounts of content (i.e., the provided example of a webcrawler indexing web pages).

robterrell · on Oct 6, 2011

Seems like using Cloudant's BigCouch fork, with its built-in Lucene indexing, might be a better choice for this.

arethuza · on Oct 6, 2011

Thanks - I was reading the article thinking "what I'd really want was CouchDB with built in Lucene indexing and searching" - and there it is! :-)

stavros · on Oct 6, 2011

Unfortunately, whoosh is too slow for any nontrivial amount of content. You're much better off using solr or ElasticSearch.