Hacker News new | past | comments | ask | show | jobs | submit login

>Isn't this a concern as the main objective of search is to provide accurate results?

If it provides 99% of the results and misses some documents because of some weird bug or encoding issue, then it could very well be good enough for their purposes. Heck, even 90% could be good depending on what they do (e.g. serve articles in an online site).

For other uses, like police or medical records obviously they'd need 100% results.




I know "good enough" is probably not just a good idea with a startup, it's possibly mandatory since there's only so much time and money. But as a user/consumer/customer/target demographic I can't begin to describe how much I disdain knowing that something exists on a site but being unable to find it using search, particularly when I know the exact title. Reddit's search several years ago was quite bad and left a sour taste in my mouth.


I'm already cringing about people in this thread talking about "language detection" and "stemming" as if there are good, easy solutions to them.

Take your favorite language detector, like cld2. Apply it to some real-world language, like random posts on Twitter. Did it detect the languages correctly? Welp, there goes that idea.

(Tweets are too short, you say? Tough. Search queries are shorter. You probably aren't lucky enough for your domain's text to be complete articles from the Wall Street Journal, which is what the typical NLP algorithm was trained on.)

Stemming will always be difficult and subtle. It's useful but it isn't even linguistically well-defined, so you'll have to tweak it a lot. If stemming seems easy, you haven't looked at where it goes wrong for your use case.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: