Python difflib Text Comparison — Core Concepts
What difflib provides
The difflib module has three main capabilities:
- Generating diffs — show line-by-line or character-level differences between two texts
- Similarity scoring — compute how alike two sequences are (0.0 to 1.0)
- Fuzzy matching — find the closest matches from a list of possibilities
Generating diffs
unified_diff — the standard format
import difflib
old = ["alpha\n", "beta\n", "gamma\n", "delta\n"]
new = ["alpha\n", "BETA\n", "gamma\n", "epsilon\n"]
diff = difflib.unified_diff(old, new, fromfile="v1.txt", tofile="v2.txt")
print("".join(diff))
Output:
--- v1.txt
+++ v2.txt
@@ -1,4 +1,4 @@
alpha
-beta
+BETA
gamma
-delta
+epsilon
Lines starting with - were removed, + were added, and unmarked lines are context. This is the same format Git uses.
context_diff — more context
context_diff shows changes with surrounding context using *** and --- markers. It’s the classic Unix diff -c format. Unified format is more common today.
ndiff — character-level precision
diff = difflib.ndiff(old, new)
print("".join(diff))
Output includes ^ markers pointing to exact character positions that changed. Useful when differences are subtle (typos, whitespace).
Similarity scoring with SequenceMatcher
from difflib import SequenceMatcher
s = SequenceMatcher(None, "Python programming", "Pyhton programing")
print(s.ratio()) # 0.9444...
ratio() returns a float between 0.0 (completely different) and 1.0 (identical). The algorithm finds the longest common subsequences and computes 2 * matches / total_elements.
Practical uses
- Duplicate detection: flag articles or records with ratio > 0.85
- Typo detection: catch near-miss inputs
- Version comparison: quantify how much a document changed between revisions
The junk parameter
The first argument to SequenceMatcher is a function that identifies “junk” elements — things to ignore during matching. Common use: ignore whitespace.
s = SequenceMatcher(lambda x: x == " ", "a b c", "a b c")
print(s.ratio()) # higher than without junk filtering
Fuzzy matching with get_close_matches
from difflib import get_close_matches
commands = ["status", "start", "stop", "restart", "stats"]
user_input = "statr"
matches = get_close_matches(user_input, commands, n=3, cutoff=0.6)
print(matches) # ['start', 'stats', 'status']
Parameters:
n: maximum number of matches to return (default 3)cutoff: minimum similarity ratio (default 0.6)
This is the “did you mean…?” feature in CLIs and search interfaces. It’s surprisingly effective for small to medium vocabulary sizes.
HTML diff
HtmlDiff generates a side-by-side HTML table highlighting changes:
from difflib import HtmlDiff
old = "The quick brown fox\njumps over\nthe lazy dog".splitlines()
new = "The quick red fox\nleaps over\nthe lazy cat".splitlines()
html = HtmlDiff().make_file(old, new, fromdesc="Original", todesc="Revised")
with open("diff.html", "w") as f:
f.write(html)
Open diff.html in a browser and you get a color-coded side-by-side comparison. Useful for generating reports from automated tests or document review workflows.
Common misconception
Many developers assume difflib is only for text files. It works on any sequence of comparable items — lists of numbers, lists of objects, even lists of database rows. SequenceMatcher operates on sequences, not strings specifically.
from difflib import SequenceMatcher
old_config = [("host", "localhost"), ("port", 5432), ("ssl", True)]
new_config = [("host", "db.prod.com"), ("port", 5432), ("ssl", True)]
s = SequenceMatcher(None, old_config, new_config)
for op, i1, i2, j1, j2 in s.get_opcodes():
if op != "equal":
print(f"{op}: {old_config[i1:i2]} → {new_config[j1:j2]}")
# replace: [('host', 'localhost')] → [('host', 'db.prod.com')]
When to use difflib vs alternatives
| Need | difflib | Alternative |
|---|---|---|
| Line-by-line file diff | ✅ | Unix diff command |
| Fuzzy string matching at scale | ❌ (slow for 100k+ items) | rapidfuzz, fuzzywuzzy |
| Structural diff (JSON, XML) | ❌ | deepdiff, xmldiff |
| Character-level edit distance | Partial | python-Levenshtein |
| Git-style diffs | ✅ (unified format) | pygit2, gitpython |
difflib is great for small to medium datasets and prototyping. For high-throughput fuzzy matching, rapidfuzz (C-accelerated) is typically 10-100x faster.
The one thing to remember: difflib gives you diffs, similarity scores, and fuzzy matching in one standard library module — perfect for CLIs, reports, and prototypes without any dependencies.
See Also
- Python Atexit How Python's atexit module lets your program clean up after itself right before it shuts down.
- Python Bisect Sorted Lists How Python's bisect module finds things in sorted lists the way you'd find a word in a dictionary — by jumping to the middle.
- Python Contextlib How Python's contextlib module makes the 'with' statement work for anything, not just files.
- Python Copy Module Why copying data in Python isn't as simple as it sounds, and how the copy module prevents sneaky bugs.
- Python Dataclass Field Metadata How Python dataclass fields can carry hidden notes — like sticky notes on a filing cabinet that tools read automatically.