Python String Similarity Algorithms — Deep Dive

Implement and benchmark string similarity algorithms in Python — from Levenshtein edit distance to TF-IDF cosine similarity, with performance tradeoffs and production patterns.

String similarity is foundational to search, deduplication, record linkage, and spell checking. This deep dive implements the major algorithms from scratch, compares their behavior, and covers production-grade approaches for real datasets.

Levenshtein Distance: The Dynamic Programming Classic

The Levenshtein distance between two strings is the minimum number of single-character insertions, deletions, or substitutions to transform one into the other.

def levenshtein(s: str, t: str) -> int:
    """Classic O(mn) dynamic programming implementation."""
    m, n = len(s), len(t)
    # Only need two rows at a time
    prev = list(range(n + 1))
    curr = [0] * (n + 1)

    for i in range(1, m + 1):
        curr[0] = i
        for j in range(1, n + 1):
            cost = 0 if s[i - 1] == t[j - 1] else 1
            curr[j] = min(
                curr[j - 1] + 1,      # insertion
                prev[j] + 1,          # deletion
                prev[j - 1] + cost,   # substitution
            )
        prev, curr = curr, prev

    return prev[n]

print(levenshtein("kitten", "sitting"))  # 3
print(levenshtein("python", "pyhton"))   # 2 (transposition = 2 edits)

Damerau-Levenshtein: Adding Transpositions

Standard Levenshtein treats “ab” → “ba” as two operations (delete + insert). Damerau-Levenshtein adds transposition as a single operation, which better models real typos:

def damerau_levenshtein(s: str, t: str) -> int:
    """Optimal string alignment variant."""
    m, n = len(s), len(t)
    d = [[0] * (n + 1) for _ in range(m + 1)]

    for i in range(m + 1):
        d[i][0] = i
    for j in range(n + 1):
        d[0][j] = j

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            cost = 0 if s[i - 1] == t[j - 1] else 1
            d[i][j] = min(
                d[i - 1][j] + 1,
                d[i][j - 1] + 1,
                d[i - 1][j - 1] + cost,
            )
            if i > 1 and j > 1 and s[i - 1] == t[j - 2] and s[i - 2] == t[j - 1]:
                d[i][j] = min(d[i][j], d[i - 2][j - 2] + 1)

    return d[m][n]

print(damerau_levenshtein("python", "pyhton"))  # 1 (one transposition)

difflib.SequenceMatcher: The Standard Library Option

Python ships with difflib.SequenceMatcher, which uses the Ratcliff/Obershelp algorithm to find matching blocks:

from difflib import SequenceMatcher

def similarity_ratio(a: str, b: str) -> float:
    return SequenceMatcher(None, a, b).ratio()

print(similarity_ratio("python", "pyhton"))          # 0.667
print(similarity_ratio("hello world", "hello wrold")) # 0.909

# The autojunk parameter affects long strings
sm = SequenceMatcher(None, "a" * 200 + "b", "a" * 200 + "c")
print(sm.ratio())  # May give unexpected results due to junk heuristic

sm_no_junk = SequenceMatcher(None, "a" * 200 + "b", "a" * 200 + "c", autojunk=False)
print(sm_no_junk.ratio())  # More accurate for repetitive strings

Performance Note

SequenceMatcher is O(n²) in the worst case but uses heuristics to speed up common cases. For short strings (<1000 characters), it’s adequate. For bulk comparison, use compiled C libraries.

N-gram Similarity

N-grams capture local structure. The Dice coefficient (also called Sørensen-Dice) over bigrams is popular for name matching:

def bigram_dice(s: str, t: str) -> float:
    """Dice coefficient over character bigrams."""
    if len(s) < 2 or len(t) < 2:
        return 1.0 if s == t else 0.0

    s_bigrams = [s[i:i+2] for i in range(len(s) - 1)]
    t_bigrams = [t[i:i+2] for i in range(len(t) - 1)]

    s_set = set(s_bigrams)
    t_set = set(t_bigrams)

    overlap = len(s_set & t_set)
    return 2 * overlap / (len(s_set) + len(t_set))

print(bigram_dice("python", "pyhton"))    # 0.5
print(bigram_dice("night", "nacht"))      # 0.25
print(bigram_dice("python", "python"))    # 1.0

Jaccard vs Dice

Jaccard uses |A ∩ B| / |A ∪ B| while Dice uses 2|A ∩ B| / (|A| + |B|). Dice always produces higher scores for partial matches, making it more lenient. Choose based on your tolerance for false positives.

Cosine Similarity with TF-IDF

For comparing longer texts or documents, token-level cosine similarity outperforms character-level metrics:

from collections import Counter
import math

def cosine_similarity(a: str, b: str) -> float:
    """Word-level cosine similarity."""
    vec_a = Counter(a.lower().split())
    vec_b = Counter(b.lower().split())

    intersection = set(vec_a) & set(vec_b)
    dot_product = sum(vec_a[w] * vec_b[w] for w in intersection)

    mag_a = math.sqrt(sum(v ** 2 for v in vec_a.values()))
    mag_b = math.sqrt(sum(v ** 2 for v in vec_b.values()))

    if mag_a == 0 or mag_b == 0:
        return 0.0
    return dot_product / (mag_a * mag_b)

print(cosine_similarity(
    "the quick brown fox jumps",
    "the fast brown fox leaps"
))  # ~0.6 — shares 3 of 5 words

For production use, scikit-learn’s TfidfVectorizer handles tokenization, weighting, and vectorization efficiently:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity as sklearn_cosine

corpus = [
    "python string matching algorithms",
    "python text similarity methods",
    "java database connection pooling",
]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(corpus)
similarities = sklearn_cosine(tfidf_matrix[0:1], tfidf_matrix)
print(similarities)
# [[1.0, ~0.3, 0.0]] — first doc is most similar to second

Phonetic Algorithms

Sometimes strings sound alike but look different (“Smith” vs “Smyth”). Phonetic algorithms encode words by pronunciation:

# Using the 'jellyfish' library
# pip install jellyfish
import jellyfish

# Soundex: classic 4-character code
print(jellyfish.soundex("Smith"))   # S530
print(jellyfish.soundex("Smyth"))   # S530 — match!

# Metaphone: more accurate than Soundex
print(jellyfish.metaphone("Smith"))  # SM0
print(jellyfish.metaphone("Smyth"))  # SM0

# Not fooled by visual similarity without phonetic match
print(jellyfish.soundex("Smith"))    # S530
print(jellyfish.soundex("Snith"))    # S530 — still matches (S-N-TH)

Benchmark: Choosing by Speed

import timeit

pairs = [("algorithm", "altruistic"), ("python", "pyhton"), ("abcdef", "azced")]

# Pure Python Levenshtein
t_lev = timeit.timeit(
    lambda: [levenshtein(a, b) for a, b in pairs], number=10000
)

# difflib SequenceMatcher
t_sm = timeit.timeit(
    lambda: [SequenceMatcher(None, a, b).ratio() for a, b in pairs], number=10000
)

# Bigram Dice
t_dice = timeit.timeit(
    lambda: [bigram_dice(a, b) for a, b in pairs], number=10000
)

print(f"Levenshtein: {t_lev:.2f}s")
print(f"SequenceMatcher: {t_sm:.2f}s")
print(f"Bigram Dice: {t_dice:.2f}s")
# Typical: Dice < Levenshtein < SequenceMatcher

For high-volume workloads, C-accelerated libraries like rapidfuzz (drop-in replacement for fuzzywuzzy without the GPL license) provide 10-100x speedups.

Production Pattern: Candidate Filtering

Comparing every pair in a large dataset is O(n²). Use a two-stage approach:

from collections import defaultdict

def ngram_index(strings: list[str], n: int = 3) -> dict[str, set[int]]:
    """Build an inverted index of n-grams to string indices."""
    index = defaultdict(set)
    for i, s in enumerate(strings):
        s_lower = s.lower()
        for j in range(len(s_lower) - n + 1):
            index[s_lower[j:j+n]].add(i)
    return index

def find_candidates(query: str, index: dict, threshold: int = 2) -> set[int]:
    """Find strings sharing at least `threshold` n-grams with query."""
    query_lower = query.lower()
    candidates = Counter()
    for j in range(len(query_lower) - 2):
        ngram = query_lower[j:j+3]
        for idx in index.get(ngram, set()):
            candidates[idx] += 1
    return {idx for idx, count in candidates.items() if count >= threshold}

# Index 100K names, then find similar ones for a query
# Only candidates pass to the expensive similarity function

This reduces comparison count from millions to hundreds, making real-time fuzzy matching feasible.

Algorithm Selection Matrix

Scenario	Algorithm	Complexity	Library
Typo correction	Damerau-Levenshtein	O(mn)	`rapidfuzz`
Name deduplication	Jaro-Winkler	O(mn)	`jellyfish`
Document similarity	TF-IDF cosine	O(n) after vectorization	`scikit-learn`
Phonetic matching	Double Metaphone	O(n)	`jellyfish`
General purpose	SequenceMatcher	O(n²) worst	`difflib` (stdlib)
Fast fuzzy matching	Bigram Dice + index	O(n) per query	Custom / `rapidfuzz`

One Thing to Remember

String similarity is a spectrum of algorithms optimized for different dimensions of “closeness” — character edits, token overlap, phonetics, or structure — and production systems combine fast indexing with precise scoring for scalable matching.

pythonstringssimilarityalgorithmstext-processingadvanced