Show HN: Frontend Fuzzy Search

trescenzi · 2024-02-15T13:21:44 1708003304

Have you considered building the search piece in wasm for performance? A few years back I was looking for a library like this and couldn’t find one that seemed fast enough for what I wanted. I ended up building one in rust and it’s easily 2x as fast as the js code I started with.

It’s not nearly as good a fuzzy search as this is as I just kinda tinkered with it till it felt ok for the purpose but it could be way better.

Source: https://github.com/trescenzi/brainstorm Site: https://brainstorm.tcrez.xyz/

kmschaal · 2024-02-15T13:46:24 1708004784

Hi, thanks for sharing! No, I haven't taken wasm into consideration. It would be interesting to see the performance gain, in particular for indexing.

jwr · 2024-02-15T15:14:23 1708010063

I should probably post the code for my n-gram fuzzy search with an inverted index. Without tests, the entire code fits on a single screen (I love ClojureScript). It works on the backend (JVM) and the frontend (Javascript) and has been in production use for the last [checks notes] 8 years.

zodvik · 2024-02-15T19:55:58 1708026958

Please do, that would be awesome.

funcDropShadow · 2024-02-16T13:31:24 1708090284

RoyalSloth · 2024-02-15T11:00:50 1707994850

Good job, this looks really well done. One thing I am wondering, however, is how easy it would be to integrate a pre-generated inverted index file into this fuzzy searcher?

Context: Say I have a bunch of blog posts from which I create an inverted index at "export to html" time, in order to avoid indexing the blog content at runtime on every page visit. Is there a way to persist the internal state of the fuzzy search across different page requests (e.g, /blog-post-1, /blog-post-2), so it only builds its internal state/index once?

The pre-generated inverted index could be quite large and I would like to avoid parsing it on every page request.

kmschaal · 2024-02-15T11:21:06 1707996066

Thank you for your kind feedback! That's a great idea. I have implemented a memento object that can be used for transferring the serialized state of the searcher. The intent of the implementation was to transfer the searcher between a web worker and the main thread. You could try to serve the Memento from the server and store it in an index db. You may have a look at the web worker example I provide in the repository.

RoyalSloth · 2024-02-15T11:53:37 1707998017

> You could try to serve the Memento from the server and store it in an index db

I am not sure what you mean by "store it in an index db", but I was thinking about using the searcher on a static website (no real backend, only a fileserver serving pre-generated html files). So if I understand you correctly, in order for this to work I would have to cache Memento via a local storage and load it on every page load/search request.

Unfortunately the index would change over time, thus one would have to detect this somehow and regenerate Memento as well.

kmschaal · 2024-02-15T12:15:38 1707999338

Sorry for the confusion. I think you could generate the memento each time you compile your blog into HTML. The memento can be stored as a json file and served statically by your fileserver. When a user visits your page, retrieve the memento from the server. Then, initialize the searcher with it. In this way you avoid indexing content at runtime.

As a bonus you could cache the memento in the local storage or session storage.

Kalabasa · 2024-02-15T14:16:30 1708006590

When i did my static site search function some time ago, I used Elasticlunr. I was able to pregenerate the index file as a big json file that is loaded at the client.

http://elasticlunr.com/

terpimost · 2024-02-15T17:25:15 1708017915

Thank you. We need more libs like that. I just researched the field yesterday and https://github.com/leeoniya/uFuzzy looked pretty good. But there is a gap in the market of such libs. Just few allow to send the whole html document, serialize and deserialize index to be used in browser, highlighting the matches is desired feature.

Most importantly very few fuzzy search libs can get a simple substring match as a priority, which is understandable but not helpful. Imagine searching for “xample” and not having “example” among the results.

kmschaal · 2024-02-15T20:13:23 1708028003

Thank you for your comment! It is indeed a problem in plain fuzzy search libraries (like this one) that substring matches can have a lower quality than unequal strings of similar length. A solution to that is to implement a higher level search controller that queries the fuzzy searcher, as well as a suffix array searcher. The controller than mixes the matches and returns the best matches across the two searchers. One can even add more searchers, e.g. a phonetic one. With the correct parameters this approach works well.

metayrnc · 2024-02-15T13:06:02 1708002362

Almost all frontend fuzzy search libraries I tried fail in correcting small typos (whoch is pretty common when searching). From the demo it seems this library can handle it. I will definitely be trying it out on my own project. Thanks for sharing it!

kmschaal · 2024-02-15T13:49:38 1708004978

Thank you, sounds great!

newusertoday · 2024-02-16T06:38:18 1708065498

Any comparison data with https://github.com/nextapps-de/flexsearch? Also what player are using to play/pause gif in your readme?

kmschaal · 2024-02-16T07:51:40 1708069900

Hi, thank you for your questions. Unfortunately there are no comparisons yet. The gif is simply a looped screen recording created with Camtasia. Best regards!

tkcranny · 2024-02-15T08:08:14 1707984494

Looks like a neat library.

I’m curious if there’s a tl;dr on how this is better or different to Fuse, which is a very popular established client side fuzzy searching library.

https://github.com/krisk/fuse

kmschaal · 2024-02-15T08:45:51 1707986751

Hi, thank you for your interest! I believe that most fuzzy search implementations lack accuracy in one aspect or another. The primary goals of my library are accuracy and query performance. However, I haven't looked into Fuse yet. I'm highly interested in hearing feedback from people who have tried both libraries with their datasets.

LoganDark · 2024-02-15T09:58:22 1707991102

What is your definition of "accuracy" within the context of fuzzy search?

kmschaal · 2024-02-15T10:32:09 1707993129

It's subjective, I have to admit. I would say a search is accurate if most people find what they are looking for in their dataset in the first try.

Distance definitions such as the Levenshtein and Damerau-Levenshtein distances provide a solid basis for discussions on accuracy. However, they are costly to compute and hence not widely adopted in fuzzy search libraries.

I started by using the known filter equation for the Levenshtein distance and computed a quality score with a leightweight formula. Then, I realized that the filter equation can be extended to the Damerau-Levenshtein distance by sorting the characters of the 3-grams.

In my tests, this implementation worked well. Please let me know how it works for you if you test it.

LoganDark · 2024-02-16T00:04:59 1708041899

> It's subjective, I have to admit. I would say a search is accurate if most people find what they are looking for in their dataset in the first try.

I'd say a search is accurate if it finds what most closely matches the query, for some definition of "matches". A search is useful if most people can find what they are looking for on the first try.

That is, a search being accurate doesn't necessarily translate to usefulness, if people don't (or can't) know how to write those accurate queries.

I'd imagine this is why fuzzy searches exist. Fuzzier queries allow for a larger spectrum of possible matches, which means a larger set of queries can turn up those results someone is looking for. Queries do not have to be as precise, and writing useful queries is easier.

But to me it seems diametrically opposed to accuracy. Usefulness is a much more intuitive measure, because the query does not have to be perfectly accurate in order to find the right result.

Alternatively, you could focus on the quality of ranking of the returned matches: how often the correct result is near the top (and how near) when the user finds what they are looking for. Ideally you want this as high as possible.

kmschaal · 2024-02-16T16:09:21 1708099761

Thank you for your explanation, I get your point. Regarding your last suggestion: I think it would be great to measure how often the correct result is near the top. However, don't we face the same issue as before? What is the correct result? Is it the term the user has in mind when writing the query? But what if they make a typo, and the term with the typo also exists in the dataset? Or they just type half of the term they are thinking of, but there are many terms in the dataset with the same prefix?

So, in the end, I believe it's worthwhile to try different implementations and share our subjective experiences.

LoganDark · 2024-02-16T18:53:12 1708109592

Yes, you are right.

swyx · 2024-02-15T09:03:49 1707987829

for those looking for options i found ufuzzy a lightweight and good enough version of fuse: https://swyxkit.netlify.app/ufuzzy-search

si1entstill · 2024-02-15T16:06:42 1708013202

I had a similar question. I've been using fuse for years and have had almost no qualms with the data it returns after some light tuning.

kmschaal · 2024-02-15T20:19:56 1708028396

Glad to hear that it works well, will look into it.

giovannibonetti · 2024-02-15T19:14:13 1708024453

Great library, thanks for sharing it. As someone else mentioned, perhaps looking into WASM if you have time might be interesting. One DB that has been getting in this space is DuckDB, which recently announced WASM support for extensions like full-text search [1]. On the other hand, I haven't seen it in practice to check if it is a good benchmark or if it has different use cases.

[1] https://duckdb.org/2023/12/18/duckdb-extensions-in-wasm.html

_heimdall · 2024-02-15T12:44:07 1708001047

Nice to see this as a framework-agnostic library!

I have always been a bit unsure when to reach for client-side code for festures like search, filtering, and pagination though. The features are powerful for sure and if the site is otherwise static I can see some value there, but it means having to ship all searchable data to the browser. It feels like an easy security risk to avoid, and offloading that work to a backend where the state lives feels more straightforward (as long as you have a hosted backend at all).

jdthedisciple · 2024-02-26T13:07:54 1708952874

Nice work- Hats off!

efilife · 2024-02-16T08:25:06 1708071906

I've always used Minisearch for stuff like this. Would be cool to see comparisons on this project's github page

lacoolj · 2024-02-15T20:19:11 1708028351

oh very cool. would love to see the comparison/benchmark against the library I've used in projects for years (Fuse - https://github.com/krisk/Fuse).

Keep up the great work!

kmschaal · 2024-02-17T16:38:59 1708187939

Thank you! I will do a comparison in the near future and update the readme.

henio · 2024-02-15T08:26:09 1707985569

Very nice lib, with proper documentation!

kmschaal · 2024-02-15T08:46:05 1707986765

Thank you!

leeoniya · 2024-02-20T02:05:56 1708394756

neat!

i will add it to the uFuzzy benchmarks :)