Hacker News new | past | comments | ask | show | jobs | submit login
PruningRadixTrie – Faster Radix trie for prefix search and auto-complete (github.com/wolfgarbe)
188 points by g0xA52A2A 9 months ago | hide | past | favorite | 32 comments



It's 1000x compared to a very naive baseline, exhaustive traversal of the matching subtree. Nobody in practice would do that, storing the maximum score for each subtree so you can prune is a very common technique, definitely not novel here.

Even doing a Levenshtein-bounded beam search along the trie is pretty much common practice, see for example this paper [1] from 2011.

There are more sophisticated ways of doing this in space-efficient ways, for example this paper [2] I co-authored in 2013.

[1] https://www.microsoft.com/en-us/research/wp-content/uploads/...

[2] https://www.microsoft.com/en-us/research/wp-content/uploads/...


What is state of the art currently?


here is a more recent paper where I am one of the authors: http://www.vldb.org/pvldb/vol9/p828-deng.pdf



Lucene's WFST is absurdly fast


And also; are there implementations to look at? Or libraries/open source dbs/search engines that use these?


Is the `PruningRadixTrie` the same as the Completion Trie?


Part of the optimization here seems to be that you only return the top 10 results. This might result in a really annoying UX though if you can't page through other results.

For example to tag someone on LinkedIn in a post you have to type their name (this comes up when I share someone's blog post on LinkedIn who I'm not connected with). But if their name doesn't come up within the top 10 results there seems to be literally no way to tag them.

And this isn't a permissions issue about tagging people I'm not connected with either. If the person I'm trying to tag who I'm not connected with has an uncommon name, I can normally find them in the auto complete list and tag them. If they have a common name, I often can't.

I'm not posting this to ask for solutions, just to say that being arbitrarily limited to top results can sometimes be infuriating.


OP includes the entire trie if you set the minimum rank to -INF, so that's not an issue with the data structure but the UX, I would think? The data structure can show all the results, it's just a matter of knowing when the user wants to see more.

You'd need something like a button for 'show more results', or perhaps use a timeout ('if no user action in 10s, assume the desired result is missing & show more results) where it increases the _n_ limit (say, doubling each time). Then the user only pays for what they use, and the common case remains accelerated.


You could use scroll event(s) to trigger more results.


He keeps the sorted frequency count. It doesn't have to be the top 10. You can query arbitrary k.


Not "part", this is the whole basis of the data structure.


It seems to me that an hybrid solution using a pruning trie for prefixes and a normal trie for almost complete completions would work.


You mean you do both in parallel, asynchronously, return the first top 10 results, and by the time the user gets to the point it needs the rest of the results, the classical trie is done, or at least it is close to ready?


The fine-grained results look like:

- 1444x faster for single character prefixes

- 252x faster for two character prefixes

- 55x faster for three character prefixes

- ~20x faster for 4 and 5 character prefixes

- <= 5x faster for longer prefixes

I used to work on a production auto-complete system operating at over 100k peak QPS. For prefixes of length one and two we would not even bother hitting the server, just from a quality perspective, not because of latency/throughput considerations. Btw, up until 3 characters, you could store everything in an in-memory hash map. 20x speedup on length 4 and 5 prefixes is still very impressive, but not quite 1000x speedup either.


I also worked on a production auto complete feature for a web app a bit ago and I couldn't agree more with the quality sentiment. One or two characters is almost never enough to give a meaningful result. Using history or similar user search is much more effective than trying to guess what someone meant by "th".


> One or two characters is almost never enough to give a meaningful result.

TFA acknowledges that and mentions the exception:

While a prefix of length=1 is not very useful for the Latin alphabet, it does make sense for CJK languages


How does that fare for non-English queries? E.g. are two chars still not enough for Chinese languages?


I've seen 1.5 chars/token used as a rule of thumb for estimating token counts in Chinese text.


Here is what I hate in modern completion UIs.

You type one character, and the thing scampers off like a rabid squirrel, hogging the CPU, so that the UI becomes unresponsive to me typing the remaining characters.

If you just prioritize the interactive response, the algorithm is almost moot.

When the user types m, you could as well show a completely random selection of 20 things that start with m. Or nothing. Take your time; the user will type more stuff.

When there are enough characters of context so that the selection is pruned down, there you need to be responsive. The user starts to see meaningful things in the list. When they type another character to refine it, they want it updated instantly.

Completion that is too eager is exactly like a person who doesn't wait for you to finish talking, jumps to conclusions about what you're going to say and starts replying.

Here is another thing I hate: when I see a completion I like that is itself not complete, let me continue! Sometimes the completion gives me the word I was going to type, but there is more; i.e. I wanted just to complete a part of my typing, not to choose that item as the result of the query.

I find myself manually typing the word I see in the completion list, because I know that if I click on it, it will search for that, which is not what I want.

The completion should always go into the query box, where the user can refine it before submitting it. Or at least make it an option.

It doesn't actually cost an extra click. All it means that if I use down arrow to go into the completions, the currently selected completion is copied to the query box. Then I can type more characters, or hit Enter to dispatch the query. If all I want is to dispatch the completion, it's the same number of keystrokes: down arrow to pick a completion, then Enter.

Do not have it so that when I hit down arrow, the text I typed stays the same, so that the only choice I have is to hit Enter to dispatch "microsoft" or else continue my original text from "micr". Maybe I want the word "microsoft" and then continue typing " teams" or whatever.


I wrote something like this for contact auto completion - running in the browser we had to ensure that both tree construction was fast, and also that completion was instantaneous. So the next level of optimization is to amortize tree construction over searches (using the observation that most of the tree is never accessed)

https://github.com/superhuman/trie-ing


interesting. i made something much more stupid, but brutally effective: https://github.com/leeoniya/uFuzzy

i guess if you wanted results ordered by contact frequency you can just keep the original haystack sorted by frequency, and would get the Richard match first. or keep the frequency in an adjacent lookup and sort/pick the result afterwards.

PSA: don't use the ultra popular Fuse.js for fuzzy matching, it's super slow and has pretty terrible match quality. e.g. https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uF...


If youre interested in a TypeScript fork of this that also supports deletion, see here: https://github.com/shortwave/trie

There are also a couple of bug fixes in there, for example: https://github.com/shortwave/trie/commit/1e7045d89cc20011251...


In the game "keep talking and nobody explodes" one of the puzzles has you match a limited set of characters in each position to a dictionary of possible words. A task I was having a lot of trouble doing in the limited time provided. I ended up rewriting the dictionary as a prefix tree and including it as a addendum to the manual.

It was neat seeing seeing these algorithms work at human scale and how much better it made the process.


This is interesting, I am curious if it's possible to add more than 10 results and how it would look.

Also if it makes sense to use something like this for code autocomplete - I would prefer I could just scroll to find everything.

I think though in case of code parsing through the code and add things to the dictionary may be more expensive than looking through it.


Inspiring, but we use Postgres full text search for all autocomplete functionality and it has worked just fine.


Can you elaborate a bit about this? I am a noob on Full-Text search by the way who might have to implement this at {work} soon.


What is the state of art in longest prefix matching e.g for router prefixes in routing tables?


Nice piece of engineering and surely worth a read. The ideas are hardly novel, tho.


Or put a cache in front.


This is a data structure for implementing dictionary lookup, the sort of thing you'd use to make a cache...


I don’t think a trie makes a good cache.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: