Hacker News new | past | comments | ask | show | jobs | submit login
Levenshtein Distance (wikipedia.org)
215 points by tosh on Aug 22, 2019 | hide | past | favorite | 126 comments



We wrote a generalization of the Levenshtein distance algorithm for regular expressions in Hyperscan (https://github.com/intel/hyperscan), which allows you to do Levenstein distance matching over regular expressions. It is not necessarily a cheap thing to do, given that incautious use of edit distances and permissive regular expressions can result in automata that match almost anything.

I believe that the first widely available approximate regex matcher was Tre (https://laurikari.net/tre/). However, while everybody wants to talk about approximate matching like they have something to say, most algorithmists act like they forgot about Tre.


agrep is from 1988. I heard about it some 10 years later.

https://en.wikipedia.org/wiki/Agrep says "A more recent agrep is the command-line tool provided with the TRE regular expression library. TRE agrep is more powerful than Wu-Manber agrep since it allows weights and total costs to be assigned separately to individual groups in the pattern."


You're right. I can't believe I forgot agrep (so addled was I with my "forgot about tre" pun). Our departmental Unixen machines had agrep in '91, I think.


Here's a helpful tip in case it comes in handy one day:

If you need to run Levenshtein Distance over tens of thousands of strings, perhaps as part of a fuzzy search, and you only need to match on distances of 0 or 1, not any distance in general, then there's an optimization on Levenshtein Distance where you hard code the function just for distances of 0 or 1. It's not difficult to do and the payoff might enable new applications of what would otherwise be too costly.


If you're fuzzily looking up members in a fixed corpus, you can also employ the trick from https://github.com/wolfgarbe/SymSpell, which is essentially just to, for each string in your corpus, enumerate all of the variants of that string with one letter removed, and put both the original string and each variant in a hashtable as a key, with the original string as the value. To do a lookup, you do the same enumeration (original query plus each variant with one letter dropped), and look all of them up in your hashtable. The values you get out of that are a list of all of the strings in your corpus at edit distances 0 or 1, and potentially some at edit distance 2; you can do a subsequent Levenshtein calculation on each to weed out the distance-2 strings, but you only have to do it on this massively reduced set rather than on your whole corpus.

So like, to index a string "hello", you'd add all of {"hello":"hello","ello":"hello","hllo":"hello","helo":"hello","hell":"hello"} into your table, and for the query "gello", you'd look up all of ["gello", "ello", "gllo", "gelo", "gell"], and get a match on "ello"->"hello", then do a Levenshtein calculation dist("gello","hello") to confirm it's within ED=1 (it is), and be done. (Bonus: the same method works with Damerau-Levenshtein distance as well.)


I've heard this called a deletion neighborhood


Does this work if you want to look up multiple words and grade them?

For example if the input query is "ello" and there is a dict.txt with millions of entries, it should output the words that are closest to the query: "hell", "hello", "fellowship", "mellow" and their respective matching scores (e.g 0.8, 0.9, 0.3, 0.7 respectively). The scores are just random numbers in this case.

(edited for clarity)


In which case do you wind up with distance=2? I figured it'd jump out at me in an obvious way, but it hasn't.


Can't think on a real word example, but abc would match bcd in the above algo, as both permute to bc, but they are distance 2. So, any case where one letter is deleted and another is added at a different position.


Boat and Oats is a real world example with Levenshtein distance 2 that would pass through the algo


Yeah, that's the crux -- two different real words storing the same variant but having omitted different letters.


Haha, I knew the answer was going to be obvious, and it was so obviously obvious. Thanks for answering. :)


Ha ha, I imagined this algorithm for spell checking almost 20 years ago when I was wondering how Google did it.


There's a specialized algorithm specifically for running levenshtein distance over tens of thousands of strings called bitap[1]. I wrote an impl in rust a while back[2].

But your advice still applies! Run time scales with max distance.

[1] https://en.wikipedia.org/wiki/Bitap_algorithm [2] https://github.com/heyimalex/bitap


Interesting. I used a similar trick to compute a lower bound on the levenshtein distance. I was able to skip the real levenshtein distance calculation, if the lower bound already exceeded a threshold ( I used it for clustering ).



Is there a comprehensive guide that explains the differences between the different libraries for fuzzy/approximate string matching?

There is SymSpell, agrep, fuzzywuzzy, closestmatch, fuzzysearch, fuzzyset, SimString, ukkonen and others in Github.

The problem is I don't have a clue which one to use for each case, which are the cases and what are the differences between the libraries.

A writeup illustrating several cases with a recommendation of a library and a practical example for each one would be very helpful.

Btw, if anyone has experience with this type of stuff I would be interested in talking to you (mail is in profile).


There's also metaphone[1] for words that might sound similar but are spelled differently for when levenshtein might be too sensitive.

For example, someone might have mistyped "berry" as "barry" but you don't want to match it against "terry" (both have a levenshtein distance of 1)

[1] https://en.wikipedia.org/wiki/Metaphone


Thanks, I was searching the comments to see if anyone suggested a phonetic variation. The default Linux spelling checker is usually good enough but sometimes I only remember how a word sounds and the close letter variations are clearly wildly incorrect, while asking a search engine or phone is able to figure out exactly the word I mean nearly all of the time.


It's not exactly what you are looking for, but I did a blog post about the TRE regex engine which does approximate matching and included links to some other good blog posts: http://ducktape.blot.im/tre-a-regex-engine-with-approximate-...

Also, my favorite library for approx matching, which is heavy duty in some cases, is https://github.com/Martinsos/edlib

Which has python bindings and Nim binding among others.


Author of fuzzysearch here.

Most of these libraries focus on calculating distance metrics between pairs of strings, or finding the nearest match to a string among many candidates.

On the other hand, fuzzysearch is used for finding near matches to a string within much longer strings or texts. Common use cases are fuzzy searches in DNA sequences and long texts such as books and articles.


I found the readme of https://github.com/tdebatty/java-string-similarity particularly helpful, and of course the (Java) lib itself. It's not comprehensive by any means, but it gives a good starting point


At my first professional programming gig the company I worked for was selling items on Amazon. The listings were big ticket items that had several sellers competing on them. Any seller could change some of the details of the listing, so suddenly all their competition would be shipping the item with wrong specs. I wrote a program that periodically grabbed the titles of our live Amazon listings and compared them to our database of what we actually wanted the title to read. I used Levenshtein Distance as a sort of "severity of change" metric, the program would sort the changes accordingly and send an email to a person on the sales team. It was fun implementing Levenshtein Distance, but it wasn't a perfect metric for this use case. Some of the most severe changes would be a single digit changing; e.g. "2gb" vs "8gb"


You can modify the basic algorithm to weight changes differently. I had the opposite case--differing numbers were more likely to be what the user was looking for. I made digit changes cost 1 point, one string being longer cost 1 point per character (suffixes were another case where it was likely the objective) and everything else cost 2.


Someday I want to write a diff utility specifically for code. Whitespace addition/removal cost like .1, addition/removal of a curly brace .5. There are a handful of other things default diff does that make it really annoying for code.


I did this for a school project, because students would "cheat" on group projects by exclusively changing whitespace in diffs, or moving code around, thereby making it look like they had large commits.

We tracked code "moves", whitespace changes, and "trivial" changes via Levenshtein distance--which isn't great for capturing most simple renamings--and then marked them up separately in an "augmented diff".

We briefly looked into doing AST edit-distance, but that turned out to be infeasible for many reasons.

Over-all it helped the prof grade faster, at least in the degenerate cases.


Measuring a student's impact by a project on the number of lines of code they committed can be very misleading.

Sure, in the extremes it will give sensible results (someone wrote only 3 lines, or wrote 90% of the code base).

But minor, questionably motivated refactorings can create huge commits, while someone may come up with some great design while walking around for hours and sketching on paper, which all ultimately boils down to a short function.

Dumb metrics like this cannot replace actually talking with the students, asking questions about their choices and alternatives, having them do presentations or lab notebooks/logbooks etc.


> Dumb metrics like this cannot replace actually talking with the students, asking questions about their choices and alternatives, having them do presentations or lab notebooks/logbooks etc.

As said in the parent:

>> Over-all it helped the prof grade faster, at least in the degenerate cases.


Diff already has a number of options for ignoring whitespace changes:

https://www.gnu.org/software/diffutils/manual/html_node/Whit...


I never actually used any of these, but have stumbled upon some tools which create/apply diffs on ASTs of your source rather than lines[1]. I think the keywords to search is "semantic diff/patch/merge", "AST diff" or some combination of these.

[1] https://lwn.net/Articles/315686/


Same thing: "Math, 4 grade" and "Math, 5 grade" are very different books, even if they are by the same author :)


I was once asked in an interview how I'd find a row in a relational database where the value in one column was the shortest levenshtein distance away from a search key, and I was absolutely stumped when I realized I didn't really know a sort/search that optimized that way. Of course, by the time I realized I could email the interviewer for the answer an awkward amount of time had passed.

So now I ask HN: is there any "good" way to do this short of some kind of answer like "find a database that natively offers levenshtein distance search"? Are there actually sorting algorithms for sorting by levenshtein distance (other than just using L as a sort function in a traditional sort function)? Are there search algorithms that work well for searching by L distance?


Depending on the RDBMS and the target text (ideally, you’re looking to compare an entire column to the search text, and in a known character set, and known casing, etc.), I’d take one of two approaches:

1) write a UDF—in C, or for SQL Server, in pretty much any managed language—that returned the Levenshtein distance between a passed param and a column name, with a “max distance” param to aid in optimizing (e.g., maybe it doesn’t bother calculating the distance if the length difference exceeds the max Levenshtein distance)

Then you could write a query that looks something like:

    SELECT product, description WHERE levenshtein(`fobar`, product, 2) <= 2
-or-

2) if you only care about a distance of 1 or 2, create a set of secondary tables where you’d pre-calculate the permutations of every word, with a FK pointing back to the PK of the actual row.

Alternately, a combination of a Levenshtein distance function and an RDBMS that already supports fuzzy text matching seems (to me) to be ideal.


I think sometimes it is easy to forget that SQL is but an abstraction over algorithmic processing and shaping of data. A UDF is only a way to extend the built in algorithms, albeit, one which may not be fully integrated with a given query engine architecture and therefore cannot be fully optimized without having access to the engines internals.

I would argue your answer is the best as as far as doing an actual Levenshtein distance calculation.

That being said, sort orders are implemented algorithmically by the query engine. ORDER BY in SQL Server, as I recall implements up to 7 different algorithms built in, depending on the case, as far as I recall. But those algorithms are optimized for the engine and have access to the internals.

I can think of two interesting ways to look at this, one being how one could use an open source database to extend the built in sort algorithms in the query engine, perhaps controlled by collation settings, to get a native levenshtein sort implemented that has access to internals. Collations defining the rules for sort order, but this would probably be within a column, not against a value (although if you’re extending the query engine, why not). Example of collation settings [1] and how they control the rules of sort order - perhaps a SET command to define levenshtein order and then a native extension to the Postgres query engine, and now you can sort a column by Levenshtein distances.

The other thing I was thinking was word vectors (but this is similar to Levenshtein distance) and still in a relational database requires something special. Maybe some of the full text search implementations in database could do this out of the box.

[1] https://docs.microsoft.com/en-us/sql/relational-databases/co...


I don't think that there is a way to index by levenshtein distance generally because there is a many-to-many mapping between search terms and keys to match and there is potentially an infinite number of either.

What I've seen databases that need to do fuzzy matching on strings do more often is index a list of substrings of keys and match on those. So hypothetically if I had a row identified with the string 'abc', I would have a separate table with the strings: 'a', 'b', 'c', 'ab', 'bc', 'abc' all mapped to the row in question and an index built on them.

If you had a cap on the edit distance you knew was acceptable for searches, you could do something similar with levenshtein distance, where you could save every string with an edit distance less than say, 3, to a separate table along with the edit distance. This would be expensive in terms of space though, if your keys were very large. I suppose you could optimize it by just pre-computing the distance on a prefix (say the first three or four characters) and using that to narrow the search space.


Isn't that an n-gram search? I think postgres can do this.


I know that Elasticsearch does it natively. I'm not sure about Postgres. SQL Server is my main RDBMs currently.


yep - using the pg_trgrm extension. https://www.postgresql.org/docs/current/pgtrgm.html


Wouldn't call it 'good' but it would be better than applying L on each row.

Instead enumerate all unique permutations of the search term, calculate it's levensten distance and then search through for a match...doing it one by one starting from distance 0 to max and returning the first match.

If the variable in complexity size is the number of rows then it's better to enumerate all permutations and do a single search over all rows, ordering results by distance and returning only the first match


I wonder if the Wagner-Fischer algorithm could be vectorized using substring functions... It doesn't seem trivial. https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorit...


Postgres has a function to calculate levenshtein distance.


Levenshtein distance is a metric [1] and discrete metrics can be indexed with BK trees, taking advantage of the triangle inequality. For a detailed explanation, see [2].

I know of no real world implementation of this, however.

[1] https://en.wikipedia.org/wiki/Metric_(mathematics) [2] http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK...


If you only care about the row with the shortest distance (and assuming no precomputed optimizations) you can save compute cycles by exiting the algorithm as soon as the distance is guaranteed to be larger or equal to the shortest distance thus far. For very large searches the algorithmic complexity drops to roughly linear in the number of rows, the length of the search string, and the shortest distance thus far (the distance to any strings whose length differs by more than the shortest distance thus far need not be computed at all).


Not sure how it's supposed to be solved, but I thought of a cute hack in the special case where the character set of the strings is limited, e.g. lowercase ASCII only.

Let A be the alphabet used. Precompute the "letter sum" of each string and store it. Then given two strings with letter sums c1 and c2, they cannot have Levenshtein distance k if |c1-c2| > k|A|. This could allow you to skip a lot of rows.


I may be misunderstanding you, but an orderless sum of characters in a string won't be an effective prefilter for any string similarity algo.

Think "atc" and "cat". Same sum.


The sum forms a necessary but not sufficient condition for being within a certain Levenshtein distance. In your example, the inequality I gave above does not apply since |c1-c2| = 0. You would have to calculate the Levenshtein distance. In cases where the inequality is satisfied, you do not have to calculate the Levenshtein distance at all.

The idea is that any edit operation on a string will at most change the letter sum by |A|.


That may be a wrong question to ask. If do not need exact levenshtein distance, but some measure of similarity then there are a lot of options. For example postgres has index-supported similarity based on amount of common trigrams.

https://www.postgresql.org/docs/11/pgtrgm.html


yeah pgtrgm works really well for detecting typos or almost-matches.


I can't think of any clever optimizations, unless there are some convenient constraints on the database or search query values. You could just precompute a lot of stuff to help you narrow down the search, but that's really not going to be feasible unless you have some constraints or some metrics on what sort of searches are being performed.


Suffix trees are a cool algorithm for fuzzy text matching. There is a linux tool called 'sary' that implements approximate grep, but it takes some time and space to convert plain text to suffix array format.

http://sary.sourceforge.net/


sort/index the db table by string length, start your search with the closest length strings (bigger and smaller) and then stop if you find a levenshtein distance that's smaller than that the length difference between the key and the next closest length string? In many(most?) cases you'd still end up searching the whole table but at least in some cases you could end the search early.



BTW, this algorithm is implemented in ClickHouse for fuzzy string matching employing rather clever technique for SIMD optimization: https://github.com/yandex/ClickHouse/blob/a9cfe4ce91b5cdfedb...


My first job at Microsoft was to make a system that could catalog the “contracts” or “extension points” installed in Windows. I’d define a category of extensions, such as a COM object registration and my program used normalized levenshtein difference to warn about potential typos. After removing all matching fields names from the instance & class I compared the edit distance in the remaining cross product. If any had a normalized distance < N it produced a warning. It even helped me find a mistyped GUID in a prerelease copy of Windows.


We use the Damerau-Levenshtein edit distance for fuzzy matching in our data prep tool. Here is a post about how it works: http://blog.easymorph.com/2017/08/fuzzy-matching-and-merging...

Damerau-Levenshtein [1] is more practical then just Levenshtein as it includes transposition - a common cause of typos.

[1] https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_di...


I love this algorithm -- it's so elegant and succinct.

Slightly off-topic, but one of the algorithms I'm most proud of coming up with as an engineer was an offshoot of Levenshtein distance where I had to find all pairs of strings in a set of ~1m strings which were fairly similar. 1m x 1m x O(n^2) is very slow. There turned out be a good way to figure out a lower bound on distances using some precomputation, and that eliminated almost all of the pairwise O(n^2) Levenshtein computations.

I'm anyone's interested, here the description: https://www.quora.com/What-is-the-best-programming-algorithm...


Your sample use case in your 'At Google' response is eery.

> "At Google, say we had 1 million plaintext passwords..."


That wasn't the actual use case and we didn't store passwords in plaintext. But I didn't want to give away the actual use case. In hindsight, passwords are definitely a bad fake example.


Only application I've used this for so far is a technique for defining and recognizing gestures. They're called "flash gestures" (or that's what the page where I read about them called them) where you find the closest of 8 directions to the gesture direction and encode each gesture as a sequence of digits 1-8 corresponding to these closest directions, then use the Levenshtein distance algorithm to find the closest matching pattern.

EDIT: Found the reference I used to implement this.

https://books.google.com/books?id=_1rSBQAAQBAJ&pg=PA85&lpg=P...


Here's my attempt to implement Levenshtein Distance in Go, with lots of comments:

https://github.com/cbartley/diffy/blob/master/src/diffy/diff...


Very nice! If you're looking for one more variant to add: LD with a limit on the acceptable distance can be done more efficiently, and is nearly always what you want in practice.

I wrote a commented version years back that might be a handy reference; it's since been deprecated because we moved it out of StringUtils, but the original code is here https://github.com/apache/commons-lang/blob/master/src/main/...


> This implementation only computes the distance if it's less than or equal to the threshold value, returning -1 if it's greater. The advantage is performance: unbounded distance is O(nm), but a bound of k allows us to reduce it to O(km) time by only computing a diagonal stripe of width 2k + 1 of the cost table. It is also possible to use this to compute the unbounded Levenshtein distance by starting the threshold at 1 and doubling each time until the distance is found; this is O(dm), where d is the distance.

That's pretty cool, especially the doubling scheme. I'm using a modified form of Levenshtein Distance for comparing lines when diffing files, and that's pretty expensive since code files that are thousands of lines long are not uncommon. Since you are usually comparing one file to another version of itself, the differences are often small though, so an incremental approach would really pay off.


We used this when building an app to determine 'meaningful' discrepencies between prelimary radiology reports (done by Residents) and final reports (done by attending radiologists). It's a fun algorithm to implement that isn't as simple as it seems. I think there's lots of uses for it.


+1


is Levenshtein distance a "metric" in the sense of "metric spaces"? For example, the TaxiCab metric.

https://en.wikipedia.org/wiki/Taxicab_geometry

https://en.wikipedia.org/wiki/Metric_(mathematics)

Also, this looks like could be related to the Hamming Code ? Error correcting codes.

https://en.wikipedia.org/wiki/Hamming_distance


Yes, it is a metric. The following is BS "proof by restatement of requirements".

1. Always >= 0 (what is a negative character edit?)

2. It satisfies identiy in that if no edits are required to make them equal, metric distance is 0 and they are equal (applies to either ordering of strings A->B and B->A)

3. It is symmetric (run the edits backwards to get B->A instead of A->B)

4. It satisfies triangle inequality. exercise left to reader, but intuitive since the edit distance is always the "shortest path" through character changes between two strings A and B, and deviating to visit an intermediate string C would not decrease number of edits from A->C->B vs original path A->B

I'm sleep deprived due to "offspring induced insomnia" so take this with a grain of salt.


I became intimately acquainted with Levenshtein's distance for my summer internship project, and have implemented it myself a few times for interviews. It was central to an event de-duplication system I built. Amazing how such a seemingly trivial idea can actually be a) non-trivial to implement and b) be so powerful.

EDIT: By 'seemingly trivial', I mean that if you explain it to a lay person it seems easy at first blush - not that it is actually trivial.


The problem with "trivial" is that it's in the eye of the beyolder. And everything can seem "trivial" after you've studied it long enough ;-) For instance, I'd say the matrix implementation of Levenshtein is quite trivial...


I've used Levenshtein distance to sort fuzzysearch results by order of relevance. Worked very well combined with a special-case to sort exact matches first.


I used the Jaro-Winkler algorithm[0] for this and and it worked pretty well. It ranks common prefixes higher.

[0]: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance


Why would a special case be necessary? The edit distance for two identical strings is zero.


It's an autocompletion for emoji, so several letters will be missing regardless.

I'm that case, a long word like "helicopter" will be placed at a greater distance than "elipsis" despite being what I want "heli" to actually match, so I just added a "startsWith" check to override such issues.


Have you experimented with n-grams for approximate matching? I've found it works very well, especially if you add special characters to the alphabet to mark "start of string" and "end of string": like "heli" would have trigrams {"^he", "hel", "eli", "li$"}, which would certainly be a better trigram match to "helicopter" than to "elipsis" (sic).

https://en.wikipedia.org/wiki/N-gram#n-grams_for_approximate...


I haven't, mostly because it's fast enough already and I don't really feel like implementing this stuff myself.


Postgres has support for trigrams and indexes on them. I used them to help match addresses at a previous company, worked fantastically for the part we used them for.


It's a client-side library.


Weight missing characters very low.


Probably a case of speed, edit distance can take longer to calculate even in the case of 0.


Yes I liked its implementation in the fuzzywuzzy python library.. worked like a charm when trying to match street addresses and business names.


I use the levenshtein distance for NLP data cleaning/spell correction. I also use a bloom filter that can validate if the word does not exists in my corpus, so I don't have to run the algorithm for every word.


This link was gray for me because I recently used Levenshtein distance to sort an arbitrary list of strings by finding a "comparison word" that gave the ordering that I wanted. Definitely more of a cutesy (and difficult to explain) way of doing things rather than practical but Case statements are so boring. It was nice that Presto had it built in.

Did some more digging to see if there was a unique comparison word that would enable me to sort any group of strings but found that it was impossible. Still a fun afternoon of learning.


The names 'edit distance' and 'levenshtein distance' are a bit unfortunate.

When I was trying to match batches of OCR'd strings to known strings, I knew some algorithm like this had to exist, but I could not find it. My searches kept turning up Jaccard similarity and Hamming distance again and again.

(Once I found it and began working on an implementation, I started noticing my compiler would find miss-typed variable names and suggest the correctly-spelled variable for me.)


There are a lot of unhelpfully named metrics, algorithms, etc. but I actually think “edit distance” is fairly illustrative as these things go.

Another good one is Manhattan distance (or better yet, city block distance) in place of Hamming distance. It gives a nice mental picture of the number of road segments you’d need to travel in a city to get from point A to point B.


One step further (multi-layer Manhattan distance) will also get you optimal DNA, RNA and protein sequence alignment (Myers, Gotoh and others, 1977-1981). The edit path (often discarded) can be repurposed for simulating RNA and protein folding, as well as calculating DNA melting points (used for DNA-DNA hybridization studies).


Also know as the "Taxi Cab" metric... as in how a taxi cab would traverse a city along parallel/perpendicular streets.


Googling distance between two strings has Lev. dist as the top response.


What type of algorithm does Algolia use for its search (HN search specifically)?

Is it anything like this (typeahead example)? https://rawgit.com/jeancroy/FuzzySearch/master/demo/autocomp...

What is the name for this kind of approximate string matching?


Algolia's engine relies on an optimised Damerau-Levenshtein + prefix matching algorithm.


For some reason the wikipedia article does not mention the more efficient Myers diff approach. I have recently implemented it here: https://jasonhpriestley.com/array-distance-and-recursion (for edit distance but Levenshtein should work similarly)


I'm using Levenshtein distance combined with Trie/DAWG in my iOS keyboard, to detect typos and offer suggestions


If you consider strings as list of characters you can abstract levenshtein to work on any two lists. For example i use this with list of types to provide a hoogle-like search for some programming languages stdlib. Ironically i don't know what hoogle actually uses but i think they atleast considered such an approach.


I'm surprised this page and the page on "string metrics" don't say more about the mathematical implications of defining a distance metric on strings. Something interesting must pop up if you apply theorems about metric spaces from analysis and topology to the set of strings with edit distance.


This metric induces the discrete topology, which isn't very useful. Regarding its metric structure, you could apply the contraction mapping theorem but I can't think of any interesting contractions off the top of my head.

I'm just spit-balling here... but when a natural metric induces the discrete topology then it's probably more useful to look at connectivity graphs induced by the metric. For instance the graph on all strings with two strings connected by an edge if their edit distance equals 1. Restrict to words in a dictionary and you have a basic spelling suggestion algorithm.

Edit: the analyst in me really wants a translation-invariant abelian group structure ... then we can do Fourier analysis...


> Restrict to words in a dictionary and you have a basic spelling suggestion algorithm.

Yep.

https://norvig.com/spell-correct.html


I’m kind of tacking this on here, but does anyone have a good recommendation for notes or a book on applied topology for someone with a decent but not extensive background in applied math? In particular, my goal is to more deeply understand manifold learning techniques like MDS and UMAP.


The upside of it being metric is that you can optimise lookups by storing the strings in a BK-tree.


The implications are more practical, rather than theoretical. As others have said, the theoretical implications are mostly for continuous spaces, which strings are not.

From the practical side, there are a lot of algorithms designed to work on vectors that can work on sets of strings. It depends on whether the algorithm really needs to use the underlying vector, or whether it just uses distances.

For example: Clustering! By using string metrics, you can group sets of strings into clusters using standard algorithms. Some algorithms like K-means won't work (in general, you can't find the string half-way in-between two strings), but some algorithms can apply, especially hierarchical clustering.

I think you could apply MDS to sets of strings the same way, but I'm a little less familiar with the guts of that algorithm.


Yes. I have used clustering on Levenshtein distance in this way to find misspellings in user-entered data fields, in order to canonicalize the data for more analysis.


Had to use this to improve the speed of our admin user search after the table got fairly large (>1mil). The biggest problem was that it's a general search that searched display_name as well as email. We ended up cleverly omitting the email search unless '@' was included in the search string.


This was the first algorithm I ever searched for on wikipedia, back when I first learned to code. Good memories.


And then there's that strange yet oddly persistent certainty that back then it was called the Levenshtain distance.


lucky you :) there was no wikipedia when I first ran into this algorithm. Foxpro for DOS & Novell Netware had an implementation of Levenshtein distance that I ended up using as part of a college project on database basics. I used it to detect typos in data entry by comparing entered names with all names already in the table. It was blazingly fast because we never exceeded 30 names (the size of our class) and my profs were super impressed because they thought I did something way more complicated than I had.


Ugh. At a past job I was tasked with matching records, based on street addresses and business names, from various systems. Levenshtein Distance was pretty instrumental in helping find potential matches once we had cleaned up the address format and business name as much as possible.


I did the same thing. I was merging our billing db with that of a business we had just bought. Due to the nature of our business, customers could exist in both.

So I had a routine that normalized the address, as best I could anyway, didn't help that it was all just one varchar field. Then implimented Levenshtein Distance, messed with the weighting a bit to fit our particular data and away it went. Saved a bunch of headaches. It wasn't perfect, but it was better than hand matching a couple thousand accounts.


Another cool metric that's used commonly when measuring quality difference between two texts is a BLEU score (https://en.wikipedia.org/wiki/BLEU)


Yup, it's used a lot in machine translation to compare different systems, and to a lesser extent in natural language generation (where it's completely inappropriate...)


For numerical sequences, see dynamic time warping (DTW) algorithm.

https://en.wikipedia.org/wiki/Dynamic_time_warping


What other languages have it built in?

https://www.php.net/manual/en/function.levenshtein.php



We have implemented and using this in MarkLogic:

https://docs.marklogic.com/spell.levenshteinDistance


I like using levenshtein distance for building out bayseian networks. Its works well in conjunction with other distance/ similarity metrics that are better for continuous data.


I stumbled across the concept of Damerau-Levenshtein about two years ago. Applying it to a a large database of very messy consumer data in Pandas returned near miraculous results!


OT: Why are so many mathematical objects named after some dudes and not after what they represent, i.e. in this case edit distance? Never understood this obsession.


Historically, at least to my eyes, this seems to derive from how mathematics has been cited by mathematicians.

Before a thing is called 'Levenshtein Distance' it will be used by someone saying 'calculating distance between these two strings the way Levenshtein did in [0]'. Eventually, when enough people do this, or someone gets particularly bold, the method will instead be given its name (occasionally, the original author will be so bold, but that seems rare). If I were looking to procrastinate more than I already am I would go find some examples of this, examples that unfortunately don't spring immediately to mind though I'm sure I remember their shadows.

As to why they are not named for what they represent - it's only after the thing has been invented, tested, and named that we realise that it is the correct or best way to do a thing. Often there are multiple ways of doing things (this is Levenshteins edit distance) and so they are named to distinguish them from each other. The survivor is not renamed just because the others are forgotten.

[0] not a real reference


I believe Turing Machines are an example. Turing originally called them "automatic machines" or a-machines.


I think the main reason is that there is no short phrase like 'edit distance' that can be used to describe the function/structure of most mathematical objects/concepts/etc.


I think it might be better this way, using people’s names. basically because names are fairly meaningless but easier to remember than a uuid. Otherwise we’d constantly be feeling that things were misnamed. E.g. something might have been the best edit distance algorithm twenty years ago, but there’s been a better one for 19 years but it has to have a convoluted name due to primacy of the first name.


103 comments and not one mention of line coding!

This distance metric is exceptionally useful when picking the states to map to in line codes with error correction/detection.



This is a major part of how https://safepass.me/ works.


I used this to implement a spam detection system a few years back. It was super helpful.


Anyone know of a library that is good for OCR post processing (correcting errors)?


What's the difference between Levenshtein distance and Hamming distance?


https://en.wikipedia.org/wiki/Edit_distance#Formal_definitio...

> Given two strings a and b on an alphabet Σ (e.g. the set of ASCII characters, the set of bytes [0..255], etc.), the edit distance d(a, b) is the minimum-weight series of edit operations that transforms a into b. One of the simplest sets of edit operations is that defined by Levenshtein in 1966 Insertion of a single symbol. If a = uv, then inserting the symbol x produces uxv. This can also be denoted ε→x, using ε to denote the empty string. Deletion of a single symbol changes uxv to uv (x→ε). Substitution of a single symbol x for a symbol y ≠ x changes uxv to uyv (x→y).

...

> Other variants of edit distance are obtained by restricting the set of operations. Longest common subsequence (LCS) distance is edit distance with insertion and deletion as the only two edit operations, both at unit cost. Similarly, by only allowing substitutions (again at unit cost), Hamming distance is obtained; this must be restricted to equal-length strings.


ripgrep [Discussion] Fuzzy Search

  https://github.com/BurntSushi/ripgrep/issues/1053


I've been playing around with the data made available [1] by the states that are part of the Streamlined Sales Tax system [2].

The boundary file address records include, among other things:

  address range start
  address range end
  odd/even/both flag
  directional prefix
  street name
  street suffix
  directional postfix
  zip code
The address range specifies the range of numerical addresses that the record applies to, and the odd/even/both flag indicates if it is for just the odd address, just the even addresses, or both.

The direction prefix and suffix are supposed to be one of {blank, N, S, E, W, NE, NW, SE, SW}, but some states botch that and I've seen EA, EB, NO, SO, and WE end up in there.

The suffix field is for things like LANE, AVE, ST, and the like. It is a 4 character field, so a street that residents thing of as ELM AVENUE would be name=ELM suffix=AVE in the boundary file. Or it would be name="ELM AVENUE" with no suffix. (There are even some with AVENUE in the name and something else in the suffix, such as ROOSEVELT AVENUE ALY in Albany GA. The ALY is the suffix). (There are a 318 distinct suffixes across all the SST states, for the curious [3]).

Anyway, my point is that the data is kind of loose.

I want to be able to take what a user puts on an address form:

  address
  city
  state
  zip code
and find a good match in the boundary files. Most people manage to put something reasonable in the city, state, and zip code fields, and manage to put their street address at the start of the address field, so I can pretty reliably get the zip code and the street number. It's the rest of the address line that is problematic.

At first I was trying to figure out what in there might be prefix, postfix, and suffix to isolate just the street name. But because of both the aforementioned looseness of the boundary data, and differences between how the user might think of their address and how their state thinks of it, this didn't work too well.

I was considering keeping an index that maps words that occur in street names to those street names, and then looking for the longest word in the user supplied address and looking up all streets in the user's zip code that contain that word and that include the user's street number. If that returned more than one match, the idea was then to look at the prefix, postfix, and suffix and see how well those matched with things in the user address. Looking at a few addresses that had given me trouble, though, it looked like this plan to resolve them would still be tricky and unreliable.

So then I decided to try a Levenshtein distance approach. Given zip, street number, street, where street is whatever the user included in the "address" field after the street number, I pull up all records in my boundary DB that match on zip code and street number. I don't even include street name in the query.

Then for each result, I take the prefix, name, suffix, and postfix and join them separated by space. I then simply take the one with the smallest Levenshtein distance from what the user supplied. With this, I no longer care if something is name=ELM, suffix=AVE or name="ELM AVE", suffix="".

This has worked remarkably well. Even if the user throws on stuff that doesn't belong (some people manage to put the city, state, and/or zip in the address box) it usually doesn't throw it off. It raises the Levenshtein distance of the correct match by sometimes a lot...but it also raises the incorrect match distances too, so the correct one still wins.

The boundary files also have secondary address fields, for things like apartments, unit, suites, and the like. These are an abbreviation (APT, e.g.) and low and high ends of a range, and an odd/even/both flag. Unfortunately the ranges for these aren't always numeric. E.g., there is 695 E MORRILL AVE in Columbus, OH, which has secondary fields of "AP", "A", "M", and the odd/even/both flag set to both. I assume that means Apartments A-M.

So far, I've ignored the secondary address fields. Every time I've seen more than one result for the lowest Levenshtein distance, the matches have been close enough to each other to be in the same taxing districts, so would have the same rate and the tax would be allocated to the same entities, so further resolution is not necessary.

At some point I'll take my scripts that clean up the raw rate and boundary files and transform them into a form more suitable for DB use, and that make sqlite and mysql databases from those, and that do the address matching and also sales tax lookups by address, and put 'em on Github, but that will probably not be very soon.

[1] https://www.streamlinedsalestax.org/Shared-Pages/rate-and-bo...

[2] https://www.streamlinedsalestax.org/

[3] https://pastebin.com/tQLGTGS9




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: