Hacker News new | past | comments | ask | show | jobs | submit login
Beating Google With CouchDB, Celery and Whoosh (Part 1) (andrewwilkinson.wordpress.com)
74 points by markokocic on Oct 6, 2011 | hide | past | favorite | 7 comments



http://andrewwilkinson.wordpress.com/2011/09/29/beating-goog... - part two actually contains the interesting content.


Nice post.

For those considering something like this, you might want to consider using scrapy, a Python web crawler, instead of rolling your own crawler.

I remember when I was looking for something like this a year ago and found that project. I've used it for a few things and it does a nice job abstracting away most of the core scraping architecture, but leaving room to be extended as necessary. After playing with it for a few weeks, makes you realize just how easy it is to grab a bunch of data from the web if you need it.


I would recommend haystack also for someone wanting to take this to the next step. Since django is already being used and he can swap Whoosh for solr if needed to.


Whoosh is fast enough for small amounts of data, or when the searches do not need to be performed in realtime. But, I agree, something like Lucene/Solr or Sphinx might be a bit better for indexing large amounts of content (i.e., the provided example of a webcrawler indexing web pages).


Seems like using Cloudant's BigCouch fork, with its built-in Lucene indexing, might be a better choice for this.


Thanks - I was reading the article thinking "what I'd really want was CouchDB with built in Lucene indexing and searching" - and there it is! :-)


Unfortunately, whoosh is too slow for any nontrivial amount of content. You're much better off using solr or ElasticSearch.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: