Python difflib Text Comparison — Core Concepts

Practical guide to difflib: generating diffs, computing similarity ratios, and matching user input with get_close_matches.

What difflib provides

The difflib module has three main capabilities:

Generating diffs — show line-by-line or character-level differences between two texts
Similarity scoring — compute how alike two sequences are (0.0 to 1.0)
Fuzzy matching — find the closest matches from a list of possibilities

Generating diffs

unified_diff — the standard format

import difflib

old = ["alpha\n", "beta\n", "gamma\n", "delta\n"]
new = ["alpha\n", "BETA\n", "gamma\n", "epsilon\n"]

diff = difflib.unified_diff(old, new, fromfile="v1.txt", tofile="v2.txt")
print("".join(diff))

Output:

--- v1.txt
+++ v2.txt
@@ -1,4 +1,4 @@
 alpha
-beta
+BETA
 gamma
-delta
+epsilon

Lines starting with - were removed, + were added, and unmarked lines are context. This is the same format Git uses.

context_diff — more context

context_diff shows changes with surrounding context using *** and --- markers. It’s the classic Unix diff -c format. Unified format is more common today.

ndiff — character-level precision

diff = difflib.ndiff(old, new)
print("".join(diff))

Output includes ^ markers pointing to exact character positions that changed. Useful when differences are subtle (typos, whitespace).

Similarity scoring with SequenceMatcher

from difflib import SequenceMatcher

s = SequenceMatcher(None, "Python programming", "Pyhton programing")
print(s.ratio())  # 0.9444...

ratio() returns a float between 0.0 (completely different) and 1.0 (identical). The algorithm finds the longest common subsequences and computes 2 * matches / total_elements.

Practical uses

Duplicate detection: flag articles or records with ratio > 0.85
Typo detection: catch near-miss inputs
Version comparison: quantify how much a document changed between revisions

The junk parameter

The first argument to SequenceMatcher is a function that identifies “junk” elements — things to ignore during matching. Common use: ignore whitespace.

s = SequenceMatcher(lambda x: x == " ", "a b c", "a  b  c")
print(s.ratio())  # higher than without junk filtering

Fuzzy matching with get_close_matches

from difflib import get_close_matches

commands = ["status", "start", "stop", "restart", "stats"]
user_input = "statr"

matches = get_close_matches(user_input, commands, n=3, cutoff=0.6)
print(matches)  # ['start', 'stats', 'status']

Parameters:

n: maximum number of matches to return (default 3)
cutoff: minimum similarity ratio (default 0.6)

This is the “did you mean…?” feature in CLIs and search interfaces. It’s surprisingly effective for small to medium vocabulary sizes.

HTML diff

HtmlDiff generates a side-by-side HTML table highlighting changes:

from difflib import HtmlDiff

old = "The quick brown fox\njumps over\nthe lazy dog".splitlines()
new = "The quick red fox\nleaps over\nthe lazy cat".splitlines()

html = HtmlDiff().make_file(old, new, fromdesc="Original", todesc="Revised")
with open("diff.html", "w") as f:
    f.write(html)

Open diff.html in a browser and you get a color-coded side-by-side comparison. Useful for generating reports from automated tests or document review workflows.

Common misconception

Many developers assume difflib is only for text files. It works on any sequence of comparable items — lists of numbers, lists of objects, even lists of database rows. SequenceMatcher operates on sequences, not strings specifically.

from difflib import SequenceMatcher

old_config = [("host", "localhost"), ("port", 5432), ("ssl", True)]
new_config = [("host", "db.prod.com"), ("port", 5432), ("ssl", True)]

s = SequenceMatcher(None, old_config, new_config)
for op, i1, i2, j1, j2 in s.get_opcodes():
    if op != "equal":
        print(f"{op}: {old_config[i1:i2]} → {new_config[j1:j2]}")
# replace: [('host', 'localhost')] → [('host', 'db.prod.com')]

When to use difflib vs alternatives

Need	difflib	Alternative
Line-by-line file diff	✅	Unix `diff` command
Fuzzy string matching at scale	❌ (slow for 100k+ items)	`rapidfuzz`, `fuzzywuzzy`
Structural diff (JSON, XML)	❌	`deepdiff`, `xmldiff`
Character-level edit distance	Partial	`python-Levenshtein`
Git-style diffs	✅ (unified format)	`pygit2`, `gitpython`

difflib is great for small to medium datasets and prototyping. For high-throughput fuzzy matching, rapidfuzz (C-accelerated) is typically 10-100x faster.

The one thing to remember: difflib gives you diffs, similarity scores, and fuzzy matching in one standard library module — perfect for CLIs, reports, and prototypes without any dependencies.

pythonstandard-librarytext-processing