Hacker News new | past | comments | ask | show | jobs | submit login

Isn't this basically what PRECIS (RFC 7564) is about? There are open source implementations of that, like golang.org/x/text/secure/precis for Go (including the predefined profiles for e.g. usernames) or Unicode::Precis for Perl.



In the search engine context, the problem to be solved is that both French and English speakers are likely to type [cafe] and not [café] -- the French speaker because they might be on an English keyboard, or because they know it's not ambiguous.

In the search space, therefore, when you index the word 'café', you also index 'cafe' with a smaller weight. And when you see the query [café], you expand the query to ('café' OR 'cafe'-with-smaller-weight)

And you don't want to do either of this if the two words are actually different!

As an example of this in the wild, the ElasticSearch docs talk about the issue: https://www.elastic.co/guide/en/elasticsearch/guide/current/...

PRECIS appears to be aimed more at figuring out if 2 usernames are 'the same'.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: