Python String Similarity Algorithms — Deep Dive
String similarity is foundational to search, deduplication, record linkage, and spell checking. This deep dive implements the major algorithms from scratch, compares their behavior, and covers production-grade approaches for real datasets.
Levenshtein Distance: The Dynamic Programming Classic
The Levenshtein distance between two strings is the minimum number of single-character insertions, deletions, or substitutions to transform one into the other.
def levenshtein(s: str, t: str) -> int:
"""Classic O(mn) dynamic programming implementation."""
m, n = len(s), len(t)
# Only need two rows at a time
prev = list(range(n + 1))
curr = [0] * (n + 1)
for i in range(1, m + 1):
curr[0] = i
for j in range(1, n + 1):
cost = 0 if s[i - 1] == t[j - 1] else 1
curr[j] = min(
curr[j - 1] + 1, # insertion
prev[j] + 1, # deletion
prev[j - 1] + cost, # substitution
)
prev, curr = curr, prev
return prev[n]
print(levenshtein("kitten", "sitting")) # 3
print(levenshtein("python", "pyhton")) # 2 (transposition = 2 edits)
Damerau-Levenshtein: Adding Transpositions
Standard Levenshtein treats “ab” → “ba” as two operations (delete + insert). Damerau-Levenshtein adds transposition as a single operation, which better models real typos:
def damerau_levenshtein(s: str, t: str) -> int:
"""Optimal string alignment variant."""
m, n = len(s), len(t)
d = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(m + 1):
d[i][0] = i
for j in range(n + 1):
d[0][j] = j
for i in range(1, m + 1):
for j in range(1, n + 1):
cost = 0 if s[i - 1] == t[j - 1] else 1
d[i][j] = min(
d[i - 1][j] + 1,
d[i][j - 1] + 1,
d[i - 1][j - 1] + cost,
)
if i > 1 and j > 1 and s[i - 1] == t[j - 2] and s[i - 2] == t[j - 1]:
d[i][j] = min(d[i][j], d[i - 2][j - 2] + 1)
return d[m][n]
print(damerau_levenshtein("python", "pyhton")) # 1 (one transposition)
difflib.SequenceMatcher: The Standard Library Option
Python ships with difflib.SequenceMatcher, which uses the Ratcliff/Obershelp algorithm to find matching blocks:
from difflib import SequenceMatcher
def similarity_ratio(a: str, b: str) -> float:
return SequenceMatcher(None, a, b).ratio()
print(similarity_ratio("python", "pyhton")) # 0.667
print(similarity_ratio("hello world", "hello wrold")) # 0.909
# The autojunk parameter affects long strings
sm = SequenceMatcher(None, "a" * 200 + "b", "a" * 200 + "c")
print(sm.ratio()) # May give unexpected results due to junk heuristic
sm_no_junk = SequenceMatcher(None, "a" * 200 + "b", "a" * 200 + "c", autojunk=False)
print(sm_no_junk.ratio()) # More accurate for repetitive strings
Performance Note
SequenceMatcher is O(n²) in the worst case but uses heuristics to speed up common cases. For short strings (<1000 characters), it’s adequate. For bulk comparison, use compiled C libraries.
N-gram Similarity
N-grams capture local structure. The Dice coefficient (also called Sørensen-Dice) over bigrams is popular for name matching:
def bigram_dice(s: str, t: str) -> float:
"""Dice coefficient over character bigrams."""
if len(s) < 2 or len(t) < 2:
return 1.0 if s == t else 0.0
s_bigrams = [s[i:i+2] for i in range(len(s) - 1)]
t_bigrams = [t[i:i+2] for i in range(len(t) - 1)]
s_set = set(s_bigrams)
t_set = set(t_bigrams)
overlap = len(s_set & t_set)
return 2 * overlap / (len(s_set) + len(t_set))
print(bigram_dice("python", "pyhton")) # 0.5
print(bigram_dice("night", "nacht")) # 0.25
print(bigram_dice("python", "python")) # 1.0
Jaccard vs Dice
Jaccard uses |A ∩ B| / |A ∪ B| while Dice uses 2|A ∩ B| / (|A| + |B|). Dice always produces higher scores for partial matches, making it more lenient. Choose based on your tolerance for false positives.
Cosine Similarity with TF-IDF
For comparing longer texts or documents, token-level cosine similarity outperforms character-level metrics:
from collections import Counter
import math
def cosine_similarity(a: str, b: str) -> float:
"""Word-level cosine similarity."""
vec_a = Counter(a.lower().split())
vec_b = Counter(b.lower().split())
intersection = set(vec_a) & set(vec_b)
dot_product = sum(vec_a[w] * vec_b[w] for w in intersection)
mag_a = math.sqrt(sum(v ** 2 for v in vec_a.values()))
mag_b = math.sqrt(sum(v ** 2 for v in vec_b.values()))
if mag_a == 0 or mag_b == 0:
return 0.0
return dot_product / (mag_a * mag_b)
print(cosine_similarity(
"the quick brown fox jumps",
"the fast brown fox leaps"
)) # ~0.6 — shares 3 of 5 words
For production use, scikit-learn’s TfidfVectorizer handles tokenization, weighting, and vectorization efficiently:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity as sklearn_cosine
corpus = [
"python string matching algorithms",
"python text similarity methods",
"java database connection pooling",
]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(corpus)
similarities = sklearn_cosine(tfidf_matrix[0:1], tfidf_matrix)
print(similarities)
# [[1.0, ~0.3, 0.0]] — first doc is most similar to second
Phonetic Algorithms
Sometimes strings sound alike but look different (“Smith” vs “Smyth”). Phonetic algorithms encode words by pronunciation:
# Using the 'jellyfish' library
# pip install jellyfish
import jellyfish
# Soundex: classic 4-character code
print(jellyfish.soundex("Smith")) # S530
print(jellyfish.soundex("Smyth")) # S530 — match!
# Metaphone: more accurate than Soundex
print(jellyfish.metaphone("Smith")) # SM0
print(jellyfish.metaphone("Smyth")) # SM0
# Not fooled by visual similarity without phonetic match
print(jellyfish.soundex("Smith")) # S530
print(jellyfish.soundex("Snith")) # S530 — still matches (S-N-TH)
Benchmark: Choosing by Speed
import timeit
pairs = [("algorithm", "altruistic"), ("python", "pyhton"), ("abcdef", "azced")]
# Pure Python Levenshtein
t_lev = timeit.timeit(
lambda: [levenshtein(a, b) for a, b in pairs], number=10000
)
# difflib SequenceMatcher
t_sm = timeit.timeit(
lambda: [SequenceMatcher(None, a, b).ratio() for a, b in pairs], number=10000
)
# Bigram Dice
t_dice = timeit.timeit(
lambda: [bigram_dice(a, b) for a, b in pairs], number=10000
)
print(f"Levenshtein: {t_lev:.2f}s")
print(f"SequenceMatcher: {t_sm:.2f}s")
print(f"Bigram Dice: {t_dice:.2f}s")
# Typical: Dice < Levenshtein < SequenceMatcher
For high-volume workloads, C-accelerated libraries like rapidfuzz (drop-in replacement for fuzzywuzzy without the GPL license) provide 10-100x speedups.
Production Pattern: Candidate Filtering
Comparing every pair in a large dataset is O(n²). Use a two-stage approach:
from collections import defaultdict
def ngram_index(strings: list[str], n: int = 3) -> dict[str, set[int]]:
"""Build an inverted index of n-grams to string indices."""
index = defaultdict(set)
for i, s in enumerate(strings):
s_lower = s.lower()
for j in range(len(s_lower) - n + 1):
index[s_lower[j:j+n]].add(i)
return index
def find_candidates(query: str, index: dict, threshold: int = 2) -> set[int]:
"""Find strings sharing at least `threshold` n-grams with query."""
query_lower = query.lower()
candidates = Counter()
for j in range(len(query_lower) - 2):
ngram = query_lower[j:j+3]
for idx in index.get(ngram, set()):
candidates[idx] += 1
return {idx for idx, count in candidates.items() if count >= threshold}
# Index 100K names, then find similar ones for a query
# Only candidates pass to the expensive similarity function
This reduces comparison count from millions to hundreds, making real-time fuzzy matching feasible.
Algorithm Selection Matrix
| Scenario | Algorithm | Complexity | Library |
|---|---|---|---|
| Typo correction | Damerau-Levenshtein | O(mn) | rapidfuzz |
| Name deduplication | Jaro-Winkler | O(mn) | jellyfish |
| Document similarity | TF-IDF cosine | O(n) after vectorization | scikit-learn |
| Phonetic matching | Double Metaphone | O(n) | jellyfish |
| General purpose | SequenceMatcher | O(n²) worst | difflib (stdlib) |
| Fast fuzzy matching | Bigram Dice + index | O(n) per query | Custom / rapidfuzz |
One Thing to Remember
String similarity is a spectrum of algorithms optimized for different dimensions of “closeness” — character edits, token overlap, phonetics, or structure — and production systems combine fast indexing with precise scoring for scalable matching.
See Also
- Python Fuzzy Matching Fuzzywuzzy Find out how Python's FuzzyWuzzy library matches messy, misspelled text — like a friend who understands you even when you mumble.
- Python Regex Lookahead Lookbehind Learn how Python regex can peek ahead or behind without grabbing text — like checking what's next in line without stepping forward.
- Python Regex Named Groups Learn how Python regex named groups let you label the pieces you capture — like putting name tags on your search results.
- Python Regex Patterns Discover how Python regex patterns work like a secret code for finding hidden text treasures in any document.
- Python Regular Expressions Learn how Python can find tricky text patterns fast, like spotting every phone number hidden in a messy page.