Python Fuzzy Matching with FuzzyWuzzy — Core Concepts

Master FuzzyWuzzy's four scoring functions — simple ratio, partial ratio, token sort, and token set — to handle messy real-world text matching in Python.

FuzzyWuzzy is a Python library that uses sequence matching to score how similar two strings are on a scale from 0 to 100. It wraps Python’s difflib.SequenceMatcher with convenient functions tailored for common real-world matching problems.

For the underlying theory, see Python String Similarity Algorithms.

Installation Note

FuzzyWuzzy’s original package uses python-Levenshtein for speed. The modern drop-in replacement is rapidfuzz, which is faster, has no GPL dependency, and provides the same API. Both are covered here.

The Four Scoring Functions

Simple Ratio

Compares two strings directly using SequenceMatcher and returns a score.

“New York Mets” vs “New York Meats” scores around 96 — very close.

This works well when both strings are roughly the same length and structure.

Partial Ratio

Finds the best substring match. It slides the shorter string along the longer one and returns the highest score.

“Yankees” vs “New York Yankees” scores 100 with partial ratio, because “Yankees” appears perfectly within the longer string. Simple ratio would score much lower because of the length difference.

Best for: Matching when one string is a subset of the other.

Token Sort Ratio

Splits both strings into words, sorts them alphabetically, then compares. This neutralizes word order differences.

“John Smith Jr.” vs “Jr. Smith John” scores 100 after sorting, because alphabetically they produce the same sequence.

Best for: Names, titles, or phrases where word order varies.

Token Set Ratio

The most forgiving scorer. It splits into word sets, then compares the intersection with each string’s unique words.

“Los Angeles Lakers basketball” vs “Lakers Los Angeles” scores 100 because token set focuses on shared words and treats extra words as less important.

Best for: Records with inconsistent detail levels — one entry has extra context the other doesn’t.

Choosing the Right Scorer

Situation	Scorer	Why
Two similar-length full names	Simple ratio	Direct comparison works
Short query vs long record	Partial ratio	Finds the needle in the haystack
Same words, different order	Token sort	Order-independent
Extra words in one string	Token set	Ignores extras
Not sure	Token set	Most forgiving default

Extracting Best Matches

Beyond comparing two strings, FuzzyWuzzy provides process.extract() to find the best matches from a list of choices. You supply a query and a list, and it returns the top matches with scores.

This is the primary API for searching — you rarely compare strings one pair at a time in practice.

Threshold Selection

The right threshold depends on your domain:

Names and addresses: 85-90 (typos are small)
Product catalog matching: 75-85 (abbreviations and variations are common)
Free-text descriptions: 60-75 (paraphrasing causes bigger differences)

Start with 85, test against labeled examples, and adjust. Too high misses valid matches. Too low floods you with false positives.

Common Misconception

“FuzzyWuzzy handles all fuzzy matching needs.” FuzzyWuzzy is excellent for short strings — names, addresses, product titles. For document-level similarity, it’s the wrong tool. TF-IDF cosine similarity or embedding-based approaches work better for paragraphs and beyond. Always match the tool to the text length.

One Thing to Remember

FuzzyWuzzy offers four scorers for four scenarios — simple ratio for similar strings, partial for subsets, token sort for reordered words, and token set for uneven detail — pick the one that fits your data’s messiness.

pythonfuzzy-matchingfuzzywuzzytext-processing