Looks like a neat library. I’m curious if there’s a tl;dr on how this is better ...

kmschaal · 2024-02-15T08:45:51 1707986751

Hi, thank you for your interest! I believe that most fuzzy search implementations lack accuracy in one aspect or another. The primary goals of my library are accuracy and query performance. However, I haven't looked into Fuse yet. I'm highly interested in hearing feedback from people who have tried both libraries with their datasets.

LoganDark · 2024-02-15T09:58:22 1707991102

What is your definition of "accuracy" within the context of fuzzy search?

kmschaal · 2024-02-15T10:32:09 1707993129

It's subjective, I have to admit. I would say a search is accurate if most people find what they are looking for in their dataset in the first try.

Distance definitions such as the Levenshtein and Damerau-Levenshtein distances provide a solid basis for discussions on accuracy. However, they are costly to compute and hence not widely adopted in fuzzy search libraries.

I started by using the known filter equation for the Levenshtein distance and computed a quality score with a leightweight formula. Then, I realized that the filter equation can be extended to the Damerau-Levenshtein distance by sorting the characters of the 3-grams.

In my tests, this implementation worked well. Please let me know how it works for you if you test it.

LoganDark · 2024-02-16T00:04:59 1708041899

> It's subjective, I have to admit. I would say a search is accurate if most people find what they are looking for in their dataset in the first try.

I'd say a search is accurate if it finds what most closely matches the query, for some definition of "matches". A search is useful if most people can find what they are looking for on the first try.

That is, a search being accurate doesn't necessarily translate to usefulness, if people don't (or can't) know how to write those accurate queries.

I'd imagine this is why fuzzy searches exist. Fuzzier queries allow for a larger spectrum of possible matches, which means a larger set of queries can turn up those results someone is looking for. Queries do not have to be as precise, and writing useful queries is easier.

But to me it seems diametrically opposed to accuracy. Usefulness is a much more intuitive measure, because the query does not have to be perfectly accurate in order to find the right result.

Alternatively, you could focus on the quality of ranking of the returned matches: how often the correct result is near the top (and how near) when the user finds what they are looking for. Ideally you want this as high as possible.

kmschaal · 2024-02-16T16:09:21 1708099761

Thank you for your explanation, I get your point. Regarding your last suggestion: I think it would be great to measure how often the correct result is near the top. However, don't we face the same issue as before? What is the correct result? Is it the term the user has in mind when writing the query? But what if they make a typo, and the term with the typo also exists in the dataset? Or they just type half of the term they are thinking of, but there are many terms in the dataset with the same prefix?

So, in the end, I believe it's worthwhile to try different implementations and share our subjective experiences.

LoganDark · 2024-02-16T18:53:12 1708109592

Yes, you are right.

swyx · 2024-02-15T09:03:49 1707987829

for those looking for options i found ufuzzy a lightweight and good enough version of fuse: https://swyxkit.netlify.app/ufuzzy-search

si1entstill · 2024-02-15T16:06:42 1708013202

I had a similar question. I've been using fuse for years and have had almost no qualms with the data it returns after some light tuning.

kmschaal · 2024-02-15T20:19:56 1708028396

Glad to hear that it works well, will look into it.