Hacker News new | past | comments | ask | show | jobs | submit login
Full-Text Search in JavaScript (burakkanber.com)
87 points by kiril-me on Oct 18, 2015 | hide | past | favorite | 18 comments



There's a pretty neat little project called 'lunr.js' which can provide a fairly fully-featured JavaScript search engine for use in the browser.

It supports multi-field search, stop-word removal, tf-idf - but no Okapi BM25 alas.

http://lunrjs.com/


Theres also http://elasticlunr.com/, I'm not sure what algorithm it uses, but the site says "Elasticlunr.js use quite the same scoring mechanism as Elasticsearch, and also this scoring mechanism is used by lucene.", so maybe? I can't say I know that much about this topic...

I actually just used that library in creating a search for the docs site where I work and I have to say it works really well, it's a fully static site (hosted on github pages) and all the search is done in the browser based on an json "index" file that I generate along side the rest of the site. http://docs.exosite.com/


I'm fairly sure elastic lunr is a fork of lunr, I still seem to have the most commits even! [1]

I'll have to take a look and see what @weixsong added, perhaps there are some changes that I can merge upstream.

[1] https://github.com/weixsong/elasticlunr.js/graphs/contributo...


The last time this article was on HN I took a look at adding Okapi BM25 to lunr, from what I remember the changes don't seem to huge, its just a matter of getting the time to sit down and implement it!


I built similar tf-idf w/ cosine similarity search in scala if anyone is curious https://github.com/jasongoodwin/tfidf-search


I think all the HN love is giving this web server a rough time.



It looks pretty cool, but it slowed my browser down to a crawl when I had it open in the background. That seems like something that needs optimizing.


An approach more along the lines of machine learning would be to use what word2vec originated from at Berkeley Lab https://www.kaggle.com/c/word2vec-nlp-tutorial/forums/t/1234...


I'd call that Information Retrieval, not Machine Learning. Indexing documents doesn't "learn" any more than a file system storing files, and TF-IDF and BM25 are simply weighting functions.

But of course, it's an interesting topic. If you want to learn more, the first edition of the most widely used textbook is free:

http://www-nlp.stanford.edu/IR-book/


Novice question: Is the "www-" just part of the given subdomain or is there something else that occurs using that syntax?

"http://nlp.stanford.edu/IR-book/" resolves to the same resource.


No magic. "www-nlp" and "www" are explicitly configured to point to the same IP address in the DNS.

Then (in the simplest case) the responding webserver is configured to treat the two hostnames (and possibly others) as identical and serve the same files.

So I guess there is magic. CNAMEs and virtual hosts and HTTP1.1.


www-nlp.stanford.edu is a CNAME that points to nlp.stanford.edu. Tip: use the `dig` tool to examine DNS records.


Isn't eg even linear regression considered to be machine learning? http://aimotion.blogspot.ca/2011/10/machine-learning-with-py...


Ok, we took "Machine Learning: " out of the title.


As more data is indexed the system "learns" what makes a relevant search based upon the words in each document. Though I agree the definition of learning in a machine learning context is kind of grey.


It doesn't learn, it fully supervised and doesn't have feedback loop.

I don't say it's not interesting, it's step forward compared to other articles in my collection [1][2][3] about the same subject allows to dive easily into the metrics and heuristics used in IR.

[1] https://www.youtube.com/watch?v=gJwFHSeFg44

[2] http://aakashjapi.com/fuckin-search-engines-how-do-they-work...

[3] https://class.coursera.org/nlp/lecture


I don't know if the definition is really that grey.

Tom Mitchell's popular definition (from his book "Machine Learning"), which is is even quoted in the opening of the Wikipedia article on Machine Learning, comes to mind:

"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E" [1]

[1]: Mitchell, T. (1997). Machine Learning, McGraw Hill. ISBN 0-07-042807-7, p.2.

[2]: https://en.wikipedia.org/wiki/Machine_learning#Overview




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: