How I Wrote a Search Engine in 6 Weeks

okeumeni · on April 21, 2008

I will suggest that you read the article again and revise your judgement. This is why.

I founded a search engine (web, images, video and business search) now in Alpha, with limited resources as well; trust me I know how hard it is.

Simple questions, how good is your crawler? (You shouldn’t implement a scheme for each web site, though I understand for your business it kinda make sense)

How much information do you have in your repository? (You should consider data in TB if not you are still far from any trouble) Ferret is a great indexing tool but how much data can it index? How scalable is it? Looking at the tech behind Ferret, how much resource does it use? How good is your relevancy model? (This question is tightly link to your indexing)

I loved your idea, Just one thing: work on the relevancy again; I searched for ‘ruby on rails’ and got ‘ruby’ only related results first the <relevant ones> after. Also I will suggest you cache images to enhance user experience. Please don’t take my review personal.

Readmore · on April 21, 2008

That's good advice. I'll admit the scaling aspects of Ferret, and Rails at this point, scare me quite a bit. My goal right now is to continue to expand my index and learn what's working and what's not.

As was mentioned in another comment I'm looking into other, more long term, indexing solutions (possibly SOLR), so hopefully that will help in the relevancy area. I'm also working on an in-house ranking algo to better sort results.

I appreciate your feedback.

tyohn · on April 21, 2008

6 weeks? Why so long? - (just kidding) I am building a search engine as my current project. It took all of a weekend to get it up and running. I didn't do it alone my friend helped so I guess maybe that doesn't count ;) We've been crawling for a little under 3 weeks and I keep making interface and search-results tweaks but otherwise it works "ok". I am in the process of switching it over to S3 (maybe EC2) - after that change I think I'll open it up to the public.

Readmore · on April 21, 2008

Sounds cool. let me know when you launch I'd like to take a look.

kradic · on April 21, 2008

Why do they use a cartoon of Ann Coulter as their logo?

Readmore · on April 21, 2008

Haha. I hadn't ever looked at her like that.... now I may have to change it.

henning · on April 21, 2008

Are you sure it's Ann Coulter? It definitely has the thousand-yard stare, but I don't see an adam's apple or muscular, veiny hands.

gojomo · on April 21, 2008

Thanks for sharing your experience.

For doing product search at the few-hundred-thousand-item scale, I would suggest SOLR rather than Nutch from the Lucene family.

You'd need to do your own crawling/scraping, but the indexing is solid, simple, and flexible. (SOLR's pedigree is from CNET's own product search.)

Readmore · on April 21, 2008

Thanks for the info. I have looked at Solr and it looks great. I'll give it a try and write up my thoughts. [Edited to correct iPhone typing mistakes]

fizx · on April 21, 2008

  a. You wrote a crawler, not a search engine.  
  b. Ferret will bite you in the ass.  
  c. For a really good off the shelf crawler, look at Heretrix.

Readmore · on April 21, 2008

Since you can go to www.embought.com and search for products I would have to say your wrong. If I had only written a crawler I would have a nice collection of webpages on my hard drive and nothing more. Why are you so down on Ferret? what problems have you had with it? Just making an off the cuff statement without facts to back it up doesn't make you look very credible.

petercooper · on April 21, 2008

Just on the Ferret side of things, it does have a pretty bad reputation in some circles. Google for "ferret "corrupted index"" .. there are 59 results alone just for that limited query.

I don't use Ferret myself (I tried it once; it seemed pretty good) but I'm very well read in the Ruby community (I run Ruby Inside and RubyFlow) and I've seen more than enough people saying bad things about Ferret, how it corrupted their indexes, concurrency issues, and what not, to personally avoid it. Solr and Sphinx seem to get a far better rap.

..

I should add that I had a play with your site after reading your article, and it's pretty good. There's a lot of trash out there in this field and you've pulled together a good site. Kudos.

Readmore · on April 21, 2008

Thanks! I really like ruby inside as well, you have a lot of great info there.