Python String Similarity Algorithms — Deep Dive

String similarity is foundational to search, deduplication, record linkage, and spell checking. This deep dive implements the major algorithms from scratch, compares their behavior, and covers production-grade approaches for real datasets.

Levenshtein Distance: The Dynamic Programming Classic

The Levenshtein distance between two strings is the minimum number of single-character insertions, deletions, or substitutions to transform one into the other.

def levenshtein(s: str, t: str) -> int:
    """Classic O(mn) dynamic programming implementation."""
    m, n = len(s), len(t)
    # Only need two rows at a time
    prev = list(range(n + 1))
    curr = [0] * (n + 1)

    for i in range(1, m + 1):
        curr[0] = i
        for j in range(1, n + 1):
            cost = 0 if s[i - 1] == t[j - 1] else 1
            curr[j] = min(
                curr[j - 1] + 1,      # insertion
                prev[j] + 1,          # deletion
                prev[j - 1] + cost,   # substitution
            )
        prev, curr = curr, prev

    return prev[n]

print(levenshtein("kitten", "sitting"))  # 3
print(levenshtein("python", "pyhton"))   # 2 (transposition = 2 edits)

Damerau-Levenshtein: Adding Transpositions

Standard Levenshtein treats “ab” → “ba” as two operations (delete + insert). Damerau-Levenshtein adds transposition as a single operation, which better models real typos:

def damerau_levenshtein(s: str, t: str) -> int:
    """Optimal string alignment variant."""
    m, n = len(s), len(t)
    d = [[0] * (n + 1) for _ in range(m + 1)]

    for i in range(m + 1):
        d[i][0] = i
    for j in range(n + 1):
        d[0][j] = j

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            cost = 0 if s[i - 1] == t[j - 1] else 1
            d[i][j] = min(
                d[i - 1][j] + 1,
                d[i][j - 1] + 1,
                d[i - 1][j - 1] + cost,
            )
            if i > 1 and j > 1 and s[i - 1] == t[j - 2] and s[i - 2] == t[j - 1]:
                d[i][j] = min(d[i][j], d[i - 2][j - 2] + 1)

    return d[m][n]

print(damerau_levenshtein("python", "pyhton"))  # 1 (one transposition)

difflib.SequenceMatcher: The Standard Library Option

Python ships with difflib.SequenceMatcher, which uses the Ratcliff/Obershelp algorithm to find matching blocks:

from difflib import SequenceMatcher

def similarity_ratio(a: str, b: str) -> float:
    return SequenceMatcher(None, a, b).ratio()

print(similarity_ratio("python", "pyhton"))          # 0.667
print(similarity_ratio("hello world", "hello wrold")) # 0.909

# The autojunk parameter affects long strings
sm = SequenceMatcher(None, "a" * 200 + "b", "a" * 200 + "c")
print(sm.ratio())  # May give unexpected results due to junk heuristic

sm_no_junk = SequenceMatcher(None, "a" * 200 + "b", "a" * 200 + "c", autojunk=False)
print(sm_no_junk.ratio())  # More accurate for repetitive strings

Performance Note

SequenceMatcher is O(n²) in the worst case but uses heuristics to speed up common cases. For short strings (<1000 characters), it’s adequate. For bulk comparison, use compiled C libraries.

N-gram Similarity

N-grams capture local structure. The Dice coefficient (also called Sørensen-Dice) over bigrams is popular for name matching:

def bigram_dice(s: str, t: str) -> float:
    """Dice coefficient over character bigrams."""
    if len(s) < 2 or len(t) < 2:
        return 1.0 if s == t else 0.0

    s_bigrams = [s[i:i+2] for i in range(len(s) - 1)]
    t_bigrams = [t[i:i+2] for i in range(len(t) - 1)]

    s_set = set(s_bigrams)
    t_set = set(t_bigrams)

    overlap = len(s_set & t_set)
    return 2 * overlap / (len(s_set) + len(t_set))

print(bigram_dice("python", "pyhton"))    # 0.5
print(bigram_dice("night", "nacht"))      # 0.25
print(bigram_dice("python", "python"))    # 1.0

Jaccard vs Dice

Jaccard uses |A ∩ B| / |A ∪ B| while Dice uses 2|A ∩ B| / (|A| + |B|). Dice always produces higher scores for partial matches, making it more lenient. Choose based on your tolerance for false positives.

Cosine Similarity with TF-IDF

For comparing longer texts or documents, token-level cosine similarity outperforms character-level metrics:

from collections import Counter
import math

def cosine_similarity(a: str, b: str) -> float:
    """Word-level cosine similarity."""
    vec_a = Counter(a.lower().split())
    vec_b = Counter(b.lower().split())

    intersection = set(vec_a) & set(vec_b)
    dot_product = sum(vec_a[w] * vec_b[w] for w in intersection)

    mag_a = math.sqrt(sum(v ** 2 for v in vec_a.values()))
    mag_b = math.sqrt(sum(v ** 2 for v in vec_b.values()))

    if mag_a == 0 or mag_b == 0:
        return 0.0
    return dot_product / (mag_a * mag_b)

print(cosine_similarity(
    "the quick brown fox jumps",
    "the fast brown fox leaps"
))  # ~0.6 — shares 3 of 5 words

For production use, scikit-learn’s TfidfVectorizer handles tokenization, weighting, and vectorization efficiently:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity as sklearn_cosine

corpus = [
    "python string matching algorithms",
    "python text similarity methods",
    "java database connection pooling",
]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(corpus)
similarities = sklearn_cosine(tfidf_matrix[0:1], tfidf_matrix)
print(similarities)
# [[1.0, ~0.3, 0.0]] — first doc is most similar to second

Phonetic Algorithms

Sometimes strings sound alike but look different (“Smith” vs “Smyth”). Phonetic algorithms encode words by pronunciation:

# Using the 'jellyfish' library
# pip install jellyfish
import jellyfish

# Soundex: classic 4-character code
print(jellyfish.soundex("Smith"))   # S530
print(jellyfish.soundex("Smyth"))   # S530 — match!

# Metaphone: more accurate than Soundex
print(jellyfish.metaphone("Smith"))  # SM0
print(jellyfish.metaphone("Smyth"))  # SM0

# Not fooled by visual similarity without phonetic match
print(jellyfish.soundex("Smith"))    # S530
print(jellyfish.soundex("Snith"))    # S530 — still matches (S-N-TH)

Benchmark: Choosing by Speed

import timeit

pairs = [("algorithm", "altruistic"), ("python", "pyhton"), ("abcdef", "azced")]

# Pure Python Levenshtein
t_lev = timeit.timeit(
    lambda: [levenshtein(a, b) for a, b in pairs], number=10000
)

# difflib SequenceMatcher
t_sm = timeit.timeit(
    lambda: [SequenceMatcher(None, a, b).ratio() for a, b in pairs], number=10000
)

# Bigram Dice
t_dice = timeit.timeit(
    lambda: [bigram_dice(a, b) for a, b in pairs], number=10000
)

print(f"Levenshtein: {t_lev:.2f}s")
print(f"SequenceMatcher: {t_sm:.2f}s")
print(f"Bigram Dice: {t_dice:.2f}s")
# Typical: Dice < Levenshtein < SequenceMatcher

For high-volume workloads, C-accelerated libraries like rapidfuzz (drop-in replacement for fuzzywuzzy without the GPL license) provide 10-100x speedups.

Production Pattern: Candidate Filtering

Comparing every pair in a large dataset is O(n²). Use a two-stage approach:

from collections import defaultdict

def ngram_index(strings: list[str], n: int = 3) -> dict[str, set[int]]:
    """Build an inverted index of n-grams to string indices."""
    index = defaultdict(set)
    for i, s in enumerate(strings):
        s_lower = s.lower()
        for j in range(len(s_lower) - n + 1):
            index[s_lower[j:j+n]].add(i)
    return index

def find_candidates(query: str, index: dict, threshold: int = 2) -> set[int]:
    """Find strings sharing at least `threshold` n-grams with query."""
    query_lower = query.lower()
    candidates = Counter()
    for j in range(len(query_lower) - 2):
        ngram = query_lower[j:j+3]
        for idx in index.get(ngram, set()):
            candidates[idx] += 1
    return {idx for idx, count in candidates.items() if count >= threshold}

# Index 100K names, then find similar ones for a query
# Only candidates pass to the expensive similarity function

This reduces comparison count from millions to hundreds, making real-time fuzzy matching feasible.

Algorithm Selection Matrix

ScenarioAlgorithmComplexityLibrary
Typo correctionDamerau-LevenshteinO(mn)rapidfuzz
Name deduplicationJaro-WinklerO(mn)jellyfish
Document similarityTF-IDF cosineO(n) after vectorizationscikit-learn
Phonetic matchingDouble MetaphoneO(n)jellyfish
General purposeSequenceMatcherO(n²) worstdifflib (stdlib)
Fast fuzzy matchingBigram Dice + indexO(n) per queryCustom / rapidfuzz

One Thing to Remember

String similarity is a spectrum of algorithms optimized for different dimensions of “closeness” — character edits, token overlap, phonetics, or structure — and production systems combine fast indexing with precise scoring for scalable matching.

pythonstringssimilarityalgorithmstext-processingadvanced

See Also

  • Python Fuzzy Matching Fuzzywuzzy Find out how Python's FuzzyWuzzy library matches messy, misspelled text — like a friend who understands you even when you mumble.
  • Python Regex Lookahead Lookbehind Learn how Python regex can peek ahead or behind without grabbing text — like checking what's next in line without stepping forward.
  • Python Regex Named Groups Learn how Python regex named groups let you label the pieces you capture — like putting name tags on your search results.
  • Python Regex Patterns Discover how Python regex patterns work like a secret code for finding hidden text treasures in any document.
  • Python Regular Expressions Learn how Python can find tricky text patterns fast, like spotting every phone number hidden in a messy page.