FuzzyWuzzy: Fuzzy String Matching in Python

mopoke · on July 9, 2011

Maybe I'm being a hypersensitive brit, but "Fuzzy Wuzzy" is as a pretty offensive term in the UK.

See top entry: http://www.urbandictionary.com/define.php?term=fuzzy+wuzzy

Not to take anything away from the tech - that looks awesome and I can already think of a few uses for it.

Terretta · on July 9, 2011

I think the children's puzzler "Fuzzy Wuzzy was a bear, but Fuzzy Wuzzy had no hair. So Fuzzy Wuzzy wasn't fuzzy was he?" is better known despite the Urban Dctionary selection biased votes.

baha_man · on July 9, 2011

Not in the UK. I don't think the tongue-twister is well known here, but the phrase is used in the other sense in the TV show Dad's Army[1], which is still shown on the BBC. However, it's only used by one character[2], who's supposed to be 70 at the time World War II starts. So, I think it's a very outdated expression, and not likely to cause offence when used in the context of a programming library. I've certainly never heard the expression used other than in the TV show (where, as far as I know, it's still not censored - it's understood that the character's views were outdated even at the time the show was set).

The term is from a Kipling poem[3]:

"A derogatory term for a black person, especially one with fuzzy hair... From... one of Rudyard Kipling's... poems, written in 1918. The poem is in the voice of an unsophisticated British soldier and expresses admiration rather than contempt, although expressed in terms that sound patronizing today."

[1] http://en.wikipedia.org/wiki/Dads_Army

[2] http://en.wikipedia.org/wiki/Lance-Corporal_Jack_Jones

[3] http://www.phrases.org.uk/meanings/146100.html

nestlequ1k · on July 9, 2011

Guess there's not enough brits here for that to matter. It'd be a tad different if it started with an N.

jdietrich · on July 10, 2011

As a scouser, I'm always amused by Mizage Software's window management program Divvy.

Seriously people, Google your package names or you might end up looking like a divvy.

http://www.urbandictionary.com/define.php?term=divvy

aonic · on July 8, 2011

I did something similar for product matching across a Yahoo! store with products in a Amazon merchant account.

I had a set of products from Yahoo! that needed their equivalent product in a set of products from Amazon. I indexed all the Amazon products into Xapian and let the search functionality do its magic by using the Yahoo product title as the search keyword. It also had a scoring mechanism and worked flawlessly for my needs.

plainOldText · on July 9, 2011

While reading this article I started laughing of amazement.(if that is even possible) It is delightful to discover something you knew you wanted which is delivered to you free, courtesy of others.

acslater00 · on July 9, 2011

Well if you like, you can thank us by buying a very expensive ticket on SeatGeek =P

ahi · on July 9, 2011

I heartily recommend "Introduction to Information Retrieval": http://www.amazon.com/Introduction-Information-Retrieval-Chr...

Skim it once to collect vocabulary, then use it as a reference for IR algorithms.

ianl · on July 8, 2011

I can remember the pain of doing this as a first year intern at a sporting odds aggregation site, the biggest challenge was dealing with the invalid xml and non standardized naming scheme. Montreal Canadians, The Habs, etc.

Our eventual solution was to use a trained matcher, but obviously it was not ideal since human intervention was required :(

acslater00 · on July 8, 2011

Yeah completely non-standard names (like nicknames, abbreviations, acronyms) are a real pain to deal with, and string matching just completely fails on them. We (seatgeek) handle it the low tech way -- a giant list of name aliases that we run through during pre-processing. Not exactly worthy of a blog post, but it does the job well enough.

alexitosrv · on July 9, 2011

I did something similar, in Oracle and PostgreSQL, for a governmental entity. Its main purpose was to perform data fusion, where a set of not so dissimilar records represented the same person in several heterogeneous data sources. It was fun, because the concepts involved, but not so much because the syntactic sugar of the sql involved.

It's great to now have this in python.

Terretta · on July 9, 2011

I found Google Refine saves some programming:

http://google-opensource.blogspot.com/2010/11/announcing-goo...

skawaii · on July 8, 2011

This looks pretty awesome. I remember when I was thinking about making a quote Website back in college. I had just learned about the Levenshtein distance algorithm in a class and was exciting about finding a real-life (re: non-contrived) scenario to apply it to.

Anyway, this looks like a really useful library. Glad it's freely available.

ecito · on July 8, 2011

woah this looks really useful. Is there a gem for ruby that does this? I've just been doing the first 'String Similarity' step using levenshtein distance

chime · on July 9, 2011

> fuzz.partial_ratio("YANKEES", "NEW YORK YANKEES") ⇒ 100

From what I can see, this will also give 100 for 'NEW', 'KEES', 'YANK' - all of which could mean something completely different. How do they deal with this?

josegonzalez · on July 9, 2011

Context. We also know dates and times - more or less at least, there may be some conversion to UTC if necessary - as well as other information about the event - categories, locations etc.

On occasion there are false positives, in which case Our algorithm is the Borg. They will be assimilated. Their grammatical and syntactical distinctiveness will be added to our own. Resistance is futile.

__mark · on July 9, 2011

It's not often people sell tickets to a yanking of new knees? At least that would be my guess, that they also look for keywords.

nsomaru · on July 9, 2011

As a Python programmer, would you guys recommend Google Refine vs FuzzyWuzzy vs Febrl (http://sourceforge.net/projects/febrl/)

Purpose: Find duplicates in mess data sets with names and physical addresses

saygt · on July 9, 2011

Awesome! I was just about to start searching for something like this. Thank you HN

kragen · on July 9, 2011

Looks pretty useful! I wonder if a simple application of TF/IDF could improve the results by giving you better token weights. (Then you'd have to be comparing token sets, of course, rather than strings.)

johnrob · on July 8, 2011

Thanks, this will be useful for many screen scraping tasks!

john2x · on July 9, 2011

Thanks for the explanations. Very helpful.

on July 8, 2011

[deleted]

wisty · on July 8, 2011

Well, it expands difflib. It looks a bit like what google-refine does.