Full-Text Search in JavaScript

jka · on Oct 18, 2015

There's a pretty neat little project called 'lunr.js' which can provide a fairly fully-featured JavaScript search engine for use in the browser.

It supports multi-field search, stop-word removal, tf-idf - but no Okapi BM25 alas.

http://lunrjs.com/

azdle · on Oct 18, 2015

Theres also http://elasticlunr.com/, I'm not sure what algorithm it uses, but the site says "Elasticlunr.js use quite the same scoring mechanism as Elasticsearch, and also this scoring mechanism is used by lucene.", so maybe? I can't say I know that much about this topic...

I actually just used that library in creating a search for the docs site where I work and I have to say it works really well, it's a fully static site (hosted on github pages) and all the search is done in the browser based on an json "index" file that I generate along side the rest of the site. http://docs.exosite.com/

olivernn · on Oct 18, 2015

I'm fairly sure elastic lunr is a fork of lunr, I still seem to have the most commits even! [1]

I'll have to take a look and see what @weixsong added, perhaps there are some changes that I can merge upstream.

[1] https://github.com/weixsong/elasticlunr.js/graphs/contributo...

olivernn · on Oct 18, 2015

The last time this article was on HN I took a look at adding Okapi BM25 to lunr, from what I remember the changes don't seem to huge, its just a matter of getting the time to sit down and implement it!

ninjakeyboard · on Oct 18, 2015

I built similar tf-idf w/ cosine similarity search in scala if anyone is curious https://github.com/jasongoodwin/tfidf-search

jb1991 · on Oct 18, 2015

I think all the HN love is giving this web server a rough time.

JonnieCache · on Oct 18, 2015

https://archive.is/EU35A

stephanheijl · on Oct 18, 2015

It looks pretty cool, but it slowed my browser down to a crawl when I had it open in the background. That seems like something that needs optimizing.

meeper16 · on Oct 18, 2015

An approach more along the lines of machine learning would be to use what word2vec originated from at Berkeley Lab https://www.kaggle.com/c/word2vec-nlp-tutorial/forums/t/1234...

andor · on Oct 18, 2015

I'd call that Information Retrieval, not Machine Learning. Indexing documents doesn't "learn" any more than a file system storing files, and TF-IDF and BM25 are simply weighting functions.

But of course, it's an interesting topic. If you want to learn more, the first edition of the most widely used textbook is free:

http://www-nlp.stanford.edu/IR-book/

draker · on Oct 18, 2015

Novice question: Is the "www-" just part of the given subdomain or is there something else that occurs using that syntax?

"http://nlp.stanford.edu/IR-book/" resolves to the same resource.

quesera · on Oct 18, 2015

No magic. "www-nlp" and "www" are explicitly configured to point to the same IP address in the DNS.

Then (in the simplest case) the responding webserver is configured to treat the two hostnames (and possibly others) as identical and serve the same files.

So I guess there is magic. CNAMEs and virtual hosts and HTTP1.1.

ddevault · on Oct 18, 2015

www-nlp.stanford.edu is a CNAME that points to nlp.stanford.edu. Tip: use the `dig` tool to examine DNS records.

ninjakeyboard · on Oct 18, 2015

Isn't eg even linear regression considered to be machine learning? http://aimotion.blogspot.ca/2011/10/machine-learning-with-py...

dang · on Oct 18, 2015

Ok, we took "Machine Learning: " out of the title.

don_draper · on Oct 18, 2015

As more data is indexed the system "learns" what makes a relevant search based upon the words in each document. Though I agree the definition of learning in a machine learning context is kind of grey.

amirouche · on Oct 18, 2015

It doesn't learn, it fully supervised and doesn't have feedback loop.

I don't say it's not interesting, it's step forward compared to other articles in my collection [1][2][3] about the same subject allows to dive easily into the metrics and heuristics used in IR.

[1] https://www.youtube.com/watch?v=gJwFHSeFg44

[2] http://aakashjapi.com/fuckin-search-engines-how-do-they-work...

[3] https://class.coursera.org/nlp/lecture

eivarv · on Oct 18, 2015

I don't know if the definition is really that grey.

Tom Mitchell's popular definition (from his book "Machine Learning"), which is is even quoted in the opening of the Wikipedia article on Machine Learning, comes to mind:

"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E" [1]

[1]: Mitchell, T. (1997). Machine Learning, McGraw Hill. ISBN 0-07-042807-7, p.2.

[2]: https://en.wikipedia.org/wiki/Machine_learning#Overview