Python difflib Text Comparison — Core Concepts

What difflib provides

The difflib module has three main capabilities:

  1. Generating diffs — show line-by-line or character-level differences between two texts
  2. Similarity scoring — compute how alike two sequences are (0.0 to 1.0)
  3. Fuzzy matching — find the closest matches from a list of possibilities

Generating diffs

unified_diff — the standard format

import difflib

old = ["alpha\n", "beta\n", "gamma\n", "delta\n"]
new = ["alpha\n", "BETA\n", "gamma\n", "epsilon\n"]

diff = difflib.unified_diff(old, new, fromfile="v1.txt", tofile="v2.txt")
print("".join(diff))

Output:

--- v1.txt
+++ v2.txt
@@ -1,4 +1,4 @@
 alpha
-beta
+BETA
 gamma
-delta
+epsilon

Lines starting with - were removed, + were added, and unmarked lines are context. This is the same format Git uses.

context_diff — more context

context_diff shows changes with surrounding context using *** and --- markers. It’s the classic Unix diff -c format. Unified format is more common today.

ndiff — character-level precision

diff = difflib.ndiff(old, new)
print("".join(diff))

Output includes ^ markers pointing to exact character positions that changed. Useful when differences are subtle (typos, whitespace).

Similarity scoring with SequenceMatcher

from difflib import SequenceMatcher

s = SequenceMatcher(None, "Python programming", "Pyhton programing")
print(s.ratio())  # 0.9444...

ratio() returns a float between 0.0 (completely different) and 1.0 (identical). The algorithm finds the longest common subsequences and computes 2 * matches / total_elements.

Practical uses

  • Duplicate detection: flag articles or records with ratio > 0.85
  • Typo detection: catch near-miss inputs
  • Version comparison: quantify how much a document changed between revisions

The junk parameter

The first argument to SequenceMatcher is a function that identifies “junk” elements — things to ignore during matching. Common use: ignore whitespace.

s = SequenceMatcher(lambda x: x == " ", "a b c", "a  b  c")
print(s.ratio())  # higher than without junk filtering

Fuzzy matching with get_close_matches

from difflib import get_close_matches

commands = ["status", "start", "stop", "restart", "stats"]
user_input = "statr"

matches = get_close_matches(user_input, commands, n=3, cutoff=0.6)
print(matches)  # ['start', 'stats', 'status']

Parameters:

  • n: maximum number of matches to return (default 3)
  • cutoff: minimum similarity ratio (default 0.6)

This is the “did you mean…?” feature in CLIs and search interfaces. It’s surprisingly effective for small to medium vocabulary sizes.

HTML diff

HtmlDiff generates a side-by-side HTML table highlighting changes:

from difflib import HtmlDiff

old = "The quick brown fox\njumps over\nthe lazy dog".splitlines()
new = "The quick red fox\nleaps over\nthe lazy cat".splitlines()

html = HtmlDiff().make_file(old, new, fromdesc="Original", todesc="Revised")
with open("diff.html", "w") as f:
    f.write(html)

Open diff.html in a browser and you get a color-coded side-by-side comparison. Useful for generating reports from automated tests or document review workflows.

Common misconception

Many developers assume difflib is only for text files. It works on any sequence of comparable items — lists of numbers, lists of objects, even lists of database rows. SequenceMatcher operates on sequences, not strings specifically.

from difflib import SequenceMatcher

old_config = [("host", "localhost"), ("port", 5432), ("ssl", True)]
new_config = [("host", "db.prod.com"), ("port", 5432), ("ssl", True)]

s = SequenceMatcher(None, old_config, new_config)
for op, i1, i2, j1, j2 in s.get_opcodes():
    if op != "equal":
        print(f"{op}: {old_config[i1:i2]}{new_config[j1:j2]}")
# replace: [('host', 'localhost')] → [('host', 'db.prod.com')]

When to use difflib vs alternatives

NeeddifflibAlternative
Line-by-line file diffUnix diff command
Fuzzy string matching at scale❌ (slow for 100k+ items)rapidfuzz, fuzzywuzzy
Structural diff (JSON, XML)deepdiff, xmldiff
Character-level edit distancePartialpython-Levenshtein
Git-style diffs✅ (unified format)pygit2, gitpython

difflib is great for small to medium datasets and prototyping. For high-throughput fuzzy matching, rapidfuzz (C-accelerated) is typically 10-100x faster.

The one thing to remember: difflib gives you diffs, similarity scores, and fuzzy matching in one standard library module — perfect for CLIs, reports, and prototypes without any dependencies.

pythonstandard-librarytext-processing

See Also

  • Python Atexit How Python's atexit module lets your program clean up after itself right before it shuts down.
  • Python Bisect Sorted Lists How Python's bisect module finds things in sorted lists the way you'd find a word in a dictionary — by jumping to the middle.
  • Python Contextlib How Python's contextlib module makes the 'with' statement work for anything, not just files.
  • Python Copy Module Why copying data in Python isn't as simple as it sounds, and how the copy module prevents sneaky bugs.
  • Python Dataclass Field Metadata How Python dataclass fields can carry hidden notes — like sticky notes on a filing cabinet that tools read automatically.