Data Sanitization in Python — Core Concepts

Sanitization vs. validation

These are related but different concepts. Validation checks whether data meets expectations — is this a valid email? Is this number in range? Invalid data gets rejected. Sanitization transforms data to make it safe — stripping HTML tags, escaping special characters, normalizing Unicode. Data passes through changed.

In practice, you use both: validate first (reject obviously wrong data), then sanitize what passes (clean it for safe use).

Why sanitization matters

Untrusted data flows into your application from many sources: form fields, URL parameters, API request bodies, uploaded files, webhook payloads, and even database records that were stored before sanitization was added.

If this data reaches your templates without escaping, you get cross-site scripting (XSS). If it reaches SQL queries without parameterization, you get injection attacks. If it reaches shell commands, you get command injection. Sanitization is the practice of ensuring data is safe for its specific destination.

Context-dependent escaping

The same data needs different treatment depending on where it goes:

HTML context: Characters like <, >, &, ", and ' must be escaped to prevent script injection. Python’s html.escape() handles this.

SQL context: Use parameterized queries (prepared statements) rather than string interpolation. The database driver handles escaping.

URL context: Special characters need percent-encoding. Python’s urllib.parse.quote() handles this.

JSON context: Ensure strings are properly escaped within JSON. Python’s json.dumps() handles this automatically.

Shell context: Avoid passing user input to shell commands entirely. If unavoidable, use shlex.quote() or pass arguments as a list to subprocess.run().

The critical insight: there’s no single “sanitize” function that works everywhere. Escaping for HTML doesn’t help in SQL, and vice versa.

Common sanitization techniques in Python

Stripping HTML tags:

import re
from html import escape

def strip_tags(text: str) -> str:
    """Remove all HTML tags, keeping only text content."""
    return re.sub(r"<[^>]+>", "", text)

# For allowing some HTML (e.g., bold, italic), use bleach or nh3
import nh3
clean_html = nh3.clean(
    user_input,
    tags={"b", "i", "em", "strong", "a", "p"},
    attributes={"a": {"href"}},
)

Normalizing whitespace and Unicode:

import unicodedata

def normalize_text(text: str) -> str:
    """Normalize Unicode and collapse whitespace."""
    text = unicodedata.normalize("NFC", text)
    text = " ".join(text.split())  # Collapse whitespace
    return text.strip()

Sanitizing filenames:

import re
from pathlib import PurePosixPath

def safe_filename(name: str) -> str:
    """Remove path separators and dangerous characters."""
    name = PurePosixPath(name).name  # Strip directory components
    name = re.sub(r"[^\w\s\-.]", "", name)  # Keep alphanumeric, spaces, hyphens, dots
    name = name.strip(". ")  # Remove leading dots (hidden files) and spaces
    return name or "unnamed"

Framework-level sanitization

Django auto-escapes variables in templates by default. Writing {{ user_name }} in a template automatically escapes HTML characters. You must explicitly mark content as safe with |safe if you want raw HTML rendered — and that’s a signal to review the source of that data.

Jinja2 (used by Flask) also auto-escapes when configured with autoescape=True. FastAPI with Jinja2 templates follows the same pattern.

Pydantic validates and coerces data types but doesn’t sanitize strings for HTML/SQL. You need explicit sanitization for string fields that will be rendered in HTML.

Common misconception

Many developers believe that sanitizing input once at the boundary is sufficient. This “sanitize on input” approach breaks when the same data is used in multiple contexts. A string safe for HTML display might be dangerous in a SQL query or a shell command. The better practice is output encoding — sanitize data at the point of use, appropriate to the destination context.

The defense-in-depth approach

No single layer is enough. Combine these:

  1. Validate input at the boundary (reject bad data).
  2. Use framework defaults (auto-escaping in templates, parameterized queries).
  3. Sanitize at the point of output when you’re constructing strings for specific contexts.
  4. Monitor for anomalies in stored data.

This layered approach means a failure in one layer doesn’t compromise the entire system.

The one thing to remember: Sanitize data based on where it’s going, not just where it came from — HTML escaping, SQL parameterization, and URL encoding are all different operations for different contexts.

pythonsecuritywebdata

See Also