I thought I would post a practical application, FWIW. I recently applied edit distance to a problem I was facing on my free dating site. Being free, we attract an inordinate number of scammers. I've been developing an arsenal of tools to help identify and track suspects in order to delete their accounts before they do too much damage. The most successful to date has used edit distance.
Scammers tend to post identical or slightly altered messages to a large number of users. As you might expect, this has the effect of quickly polluting the valid ecosystem. The solution I employed was to delay sending messages between users long enough to capture spammer patterns and compare the messages to one another.
I use the Levenshtein distance algorithm (which has O(n^2) performance). If I get a similarity over a certain threshold. I quarantine the message for further analysis. To work around the performance, I bound the problem by capping the number of messages used for analysis.
Dick Lipton's blog has a huge collection of really amazing kiddie pool tastes and pointers within awesome nuggets of theoretical computer science, one of his posts that I want to walk through carefully when i have the time is the one on the quantum method, for proving various nifty algorithmic complexity results etc.
It is indeed amazing. But why put so much effort into a blog? Makes you wonder whether this blog is intended primarily to attract more PhD students to the field.
Prof. Lipton is, after all, a professor. From his blog, it seems that he truly likes teaching and exposing ideas. An excerpt from his blog:
"I have worked in the area of theory of computation since 1973. I find the whole area exciting and want to share some of that excitement with you."
Many times, interesting ideas which can be explained to a non-expert are scattered in countless papers and books, and it's hard for someone new in the field to put all the pieces together by himself. Thus, a blog seems a great idea:
* it's much faster and more informal than peer-reviewed publications.
* it allows one to communicate ideas to a broader audience.
* it allows one to have almost immediate feedback from the readers.
* it's open to everyone who happens to be interested!
It makes a lot more sense to publish scientific results on the web than on paper. Hyperlinks, more graphs, more data, less costs, more feedback. Scientists still publish in journals because of inertia and an antiquated incentive structure (grant money is proportional to the number of published papers, esp. in high-profile journals). I hope we'll eventually make the transition, and possibly, so does the author of that blog.
What a coincidence! I have been looking at the edit distance problem, because it can be applied to computer vision. One of the uses is if you have a stereo image pair you can figure out the depth of the objects in the scene, by matching corresponding features. The edit distance problem helps with figuring out which pixels in one stereo image correspond to pixels in the other stereo image.
I am hoping with a bit more effort I can understand the dynamic programming side of it better.
I spent a year and a half as an undergraduate converting a custom C edit distance implementation into Java for a computer assisted language instruction system. This is a really nice article discussing the topic. I think the interesting facet of dynamic programming is how it shows the dynamic between space and time. In his creative book on algorithms, Udi Manber gives a nice description of a general approach to designing algorithms with dynamic programming.
I haven't a clue this was like 99-01, it wasn't just edit distance algorithm but also all the UI code surrounding it and optimizing things for learning a foreign language. Basically it was a quiz program in (e.g.) German which compared the student's answer to the proper answer and then generated suggestions when appropriate. We did a lot of tweaking of that code, iterating through versions starting with an attempt to just port some C code and onward - I was pretty green at that time and it wasn't a full time job, so it's not like implementing edit distance took you know, 70 man weeks...
A completely different approach to the problem is to minimize the number of necessary distance computation in the first place. There are various indexing algorithms which make finding nearest neighbours feasible in large datasets. Incidentally, I implemented two of them. :)
For real usage, you would probably want to reimplement them with performance in mind (and in a different language), though. My implementation is dog slow. But on the plus side, it is very readable and heavily documented.
I know of other labs using edit distance to compare people's timelines (to estimate cost of shifting from one lifestyle to another). And the most widely use application of edit distances is in the comparison of any DNA/RNA/protein sequence for similarity, both for identification purposes and for the study of their evolution.
Scammers tend to post identical or slightly altered messages to a large number of users. As you might expect, this has the effect of quickly polluting the valid ecosystem. The solution I employed was to delay sending messages between users long enough to capture spammer patterns and compare the messages to one another.
I use the Levenshtein distance algorithm (which has O(n^2) performance). If I get a similarity over a certain threshold. I quarantine the message for further analysis. To work around the performance, I bound the problem by capping the number of messages used for analysis.
It has been incredibly effective.