Hacker News new | past | comments | ask | show | jobs | submit login
Fast Levenshtein distance using a Trie (stevehanov.ca)
107 points by rcfox on Jan 15, 2011 | hide | past | favorite | 16 comments



Coincidentally, just today I implemented the greedy packed trie algorithm from Liang's thesis on LaTeX hyphenation [1]. The idea is to store the trie into a big array, overlapping the storage of your child nodes with one another where possible. This reduces the memory footprint of the trie (when compared to using a separate hash table for each node) and increases the lookup speed as you don't need a hash table, you just index directly for each character. You can also collapse common suffixes into the same state, saving a bunch more space. Would be a nice enhancement to this algorithm; probably more so if it were written in C/C++.

[1] http://www.tug.org/docs/liang/liang-thesis.pdf


The author makes a factual error regarding Norvig's version. He writes:

> Peter Norvig approaches the problem using a different way of thinking. He first stores his dictionary in a hash-table for fast lookup. Then he goes through hundreds, or even thousands of combinations of spelling mutations of the target word and checks if each one is in the dictionary. This system is clever, but breaks down quickly if you want to find words with an error greater than 1.

Norvig writes [1]:

> The literature on spelling correction claims that 80 to 95% of spelling errors are an edit distance of 1 from the target. As we shall see shortly, I put together a development corpus of 270 spelling errors, and found that only 76% of them have edit distance 1. Perhaps the examples I found are harder than typical errors. Anyway, I thought this was not good enough, so we'll need to consider edit distance 2. That's easy: just apply edits1 to all the results of edits1:

I wrote some code for this a few months ago and tried tries, which ultimately rejected in favor of a fast Levenshtein implementation. On a side note the author misses one other possible avenue, which is a family of algorithms using bit Vector operations [2] [3] to measure edit distance.

Anyway IIRC my code went the opposite way. It traversed the word then looked in the trie for matches. This quickly broke down as edit distance increased and I suspect this is what the author is observing in the other implementation.

The author's approach of going the other way is uninteresting idea and I might retrofit that to my code to see how it performs.

[1]: http://norvig.com/spell-correct.html

[2]: http://portal.acm.org/citation.cfm?id=846095

[3]: http://www.odinlake.net/papers/onsjo-cudastrmatch-sigal09.pd...


Interesting, thanks.

BTW, do you use speech recognition? I think you typod "an interesting" as "uninteresting" at the end there. :)


iPad. I spotted that later, too late to fix. :(


This is great. For a prepackaged implementation that works with any language with a Sphinx API check out misc/suggest in the sphinx tarball.

http://sphinxsearch.com/downloads/release/

It also lets you feed in your own dictionary from your index and works great for spelling suggestions. I haven't benchmarked it but it seems extremely fast - as sphinx is with most things.

Here's the suggest text file from the sphinx 0.9.9 source:

0) What's this. This sample shows how to implement simple keyword correction suggestions (ie. "did you mean") using Sphinx.

1) Requirements. You will need Sphinx, MySQL, and PHP CLI.

2) Quickstart. (Skip first indexer command to use bundled sample.)

indexer YOURINDEX --config YOURCONFIG.CONF --buildstops

dict.txt 100000 --buildfreqs

cat dict.txt | php suggest.php --builddict > dict.sql

mysql -u root test < dict.sql

indexer --config suggest.conf --all

searchd --config suggest.conf

php suggest.php --query sphynx


Is the upper bound

  O( <number of words> * <max word length> ^2 ) 
correct ? It seems to me that the upper bound should be

  O(<number of words> * <max word length>) ^2
Looks like a typo.

In fact there is another fairly simple optimization one can make if one is interested only in computing the distance, but not interested in the actual edit sequence. If you look at the algorithm for filling up the table you will see you refer only to the previous row. So one can work with just two consecutive rows, thereby substantially reducing the space complexity, but not the time complexity unfortunately.

The time complexity can however be reduced on average to O(n * d), where n is the length of the longer word and d is the edit distance between the two words, this optimization is by Ukkonen. There have been other optimizations since then. The best I believe gives a time complexity of O(n + d^2)


The consecutive row optimisation is an interesting one and in practice, it has a large impact on the runtime, even though the theoretical time complexity does not change.

Having a smaller table => it can fit in the cache => better temporal locality of reference => faster runtime.


Very true. Those fast matrix and linear algebra libraries: BLAS, LAPACK are fast not because of using different algorithms but because of optimizing for cache reuse. Its the same reason why sometimes binary search trees performs better than (unoptimized) hash tables.


It looks like you get this algorithm almost directly if you just memoise a very simple recursive Levenshtein distance function.

(This insight could be implemented directly in Haskell by using e.g. the MemoTrie package)

The standard imperative Levenshtein algorithm (with dynamic programming) is also basically just a memoised version of that recursive version -- except that the memoisation is cleared immediately after the result you originally asked for is returned.


I had a post on HN a while back about an extension of this problem: matching the search term against all substrings of the corpus strings (eg searching for a misquote in a list of books).

http://news.ycombinator.com/item?id=1982351


This post impressed me a lot. Steve have improved one famous algorithm for a special but very common case by combining it with one of the coolest datastructures out there. And (applying a pinch of poetic freedom here) yielded results that surpass those of the academics and giants of the field in versatility and power. All with a very terse piece of code that fits within a blog post. Very inspiring!


> And (applying a pinch of poetic freedom here) yielded results that surpass those of the academics and giants of the field in versatility and power.

I (as well as a dozen other students) were taught an IR course where we had to implement a search engine from scratch. We used tries both to provide spelling suggestions and to build inverted indices for text. That was 7 years ago.

Not to discredit Steve, he's done great work and his article is a great resource, but it's not the groundbreaking discovery you make it to be.

PS. Tries are awesome. <3


  yielded results that surpass those of the academics and giants of the field
That's an impressive pinch :)


Awesome.

I love Tries, I implemented one last year for a bog standard auto-complete suggestion box. While they may not be the fastest or most efficient data structures in the world, they're simple to write and have very reasonable performance.


Did I misunderstand what he said or is he hosting rhymebrain.com on a netbook ??





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: