PII Detection in Python — Deep Dive
Microsoft Presidio: the standard toolkit
Presidio is an open-source PII detection and anonymization framework from Microsoft. It combines regex patterns, NER models, and checksum validators into a configurable pipeline.
# pip install presidio-analyzer presidio-anonymizer
# python -m spacy download en_core_web_lg
from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Pattern
from presidio_anonymizer import AnonymizerEngine
analyzer = AnalyzerEngine()
text = """
Dear Support,
My name is Sarah Chen and my email is sarah.chen@example.com.
My SSN is 123-45-6789 and my credit card is 4532-1234-5678-9012.
Please update my account at 742 Evergreen Terrace, Springfield.
"""
results = analyzer.analyze(text=text, language="en")
for result in sorted(results, key=lambda r: r.start):
print(f" {result.entity_type:20} | score={result.score:.2f} | "
f"'{text[result.start:result.end]}'")
Output:
PERSON | score=0.85 | 'Sarah Chen'
EMAIL_ADDRESS | score=1.00 | 'sarah.chen@example.com'
US_SSN | score=0.85 | '123-45-6789'
CREDIT_CARD | score=1.00 | '4532-1234-5678-9012'
LOCATION | score=0.85 | '742 Evergreen Terrace, Springfield'
Custom recognizers for domain-specific PII
Standard recognizers miss industry-specific identifiers. Add custom pattern recognizers for internal ID formats, medical record numbers, or country-specific identifiers:
from presidio_analyzer import PatternRecognizer, Pattern
# Detect UK National Insurance Numbers (e.g., AB 12 34 56 C)
uk_nino_recognizer = PatternRecognizer(
supported_entity="UK_NINO",
name="UK National Insurance Number",
patterns=[
Pattern(
name="uk_nino",
regex=r"\b[A-CEGHJ-PR-TW-Z]{2}\s?\d{2}\s?\d{2}\s?\d{2}\s?[A-D]\b",
score=0.7,
),
],
context=["national insurance", "NI number", "NINO"],
)
# Detect internal employee IDs (e.g., EMP-2024-00142)
employee_id_recognizer = PatternRecognizer(
supported_entity="EMPLOYEE_ID",
name="Internal Employee ID",
patterns=[
Pattern(
name="emp_id",
regex=r"\bEMP-\d{4}-\d{5}\b",
score=0.9,
),
],
)
# Register with the analyzer
analyzer.registry.add_recognizer(uk_nino_recognizer)
analyzer.registry.add_recognizer(employee_id_recognizer)
The context parameter boosts the confidence score when surrounding words match — “NI number AB 12 34 56 C” gets a higher score than the same pattern appearing without context.
Building a log sanitizer
Application logs are a major PII leak vector. Here’s a production-grade log sanitizer that intercepts log records before they reach handlers:
import logging
import re
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
class PIILogFilter(logging.Filter):
"""Filter that redacts PII from log messages."""
def __init__(self, analyzer: AnalyzerEngine, anonymizer: AnonymizerEngine):
super().__init__()
self.analyzer = analyzer
self.anonymizer = anonymizer
self.operators = {
"DEFAULT": OperatorConfig("replace", {"new_value": "[REDACTED]"}),
"EMAIL_ADDRESS": OperatorConfig("replace", {"new_value": "[EMAIL]"}),
"PHONE_NUMBER": OperatorConfig("replace", {"new_value": "[PHONE]"}),
"CREDIT_CARD": OperatorConfig("replace", {"new_value": "[CARD]"}),
"US_SSN": OperatorConfig("replace", {"new_value": "[SSN]"}),
"PERSON": OperatorConfig("replace", {"new_value": "[NAME]"}),
}
# Fast pre-filter: skip messages unlikely to contain PII
self._quick_patterns = re.compile(
r"@|(?:\d[-.\s]?){9,}|\b[A-Z][a-z]+\s[A-Z][a-z]+\b"
)
def filter(self, record: logging.LogRecord) -> bool:
msg = record.getMessage()
# Quick check: if no PII-like patterns, skip expensive analysis
if not self._quick_patterns.search(msg):
return True
results = self.analyzer.analyze(text=msg, language="en")
if results:
anonymized = self.anonymizer.anonymize(
text=msg, analyzer_results=results, operators=self.operators
)
record.msg = anonymized.text
record.args = None # prevent re-formatting
return True
# Setup
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
handler = logging.StreamHandler()
handler.addFilter(PIILogFilter(analyzer, anonymizer))
logger = logging.getLogger("app")
logger.addHandler(handler)
# This log message contains PII that gets redacted automatically
logger.info("User Sarah Chen (sarah@example.com) reported issue with card 4532-1234-5678-9012")
# Output: "User [NAME] ([EMAIL]) reported issue with card [CARD]"
The quick regex pre-filter is important for performance — running Presidio on every log line in a high-throughput application would be expensive. The pre-filter skips messages that are obviously PII-free (plain status messages, metrics, etc.).
Database scanning for PII discovery
Discovering PII across database tables helps build a data map for GDPR compliance:
from dataclasses import dataclass
from typing import Generator
import asyncio
from sqlalchemy import text, inspect
from sqlalchemy.ext.asyncio import AsyncSession
@dataclass
class PIIFinding:
table: str
column: str
row_id: str
entity_type: str
score: float
sample: str # redacted sample for context
class DatabasePIIScanner:
def __init__(self, session: AsyncSession, analyzer: AnalyzerEngine):
self.session = session
self.analyzer = analyzer
self.batch_size = 500
async def scan_table(
self, table_name: str, columns: list[str],
id_column: str = "id", sample_limit: int = 10000,
) -> list[PIIFinding]:
findings = []
col_list = ", ".join([id_column] + columns)
query = text(
f"SELECT {col_list} FROM {table_name} LIMIT :limit"
)
result = await self.session.execute(query, {"limit": sample_limit})
rows = result.fetchall()
for row in rows:
row_id = str(row[0])
for i, col in enumerate(columns):
value = row[i + 1]
if not value or not isinstance(value, str):
continue
if len(value) < 3:
continue
results = self.analyzer.analyze(
text=value, language="en", score_threshold=0.7
)
for r in results:
findings.append(PIIFinding(
table=table_name,
column=col,
row_id=row_id,
entity_type=r.entity_type,
score=r.score,
sample=value[:50] + "..." if len(value) > 50 else value,
))
return findings
async def scan_all_text_columns(self) -> dict[str, list[PIIFinding]]:
"""Auto-discover and scan all text columns across all tables."""
insp = inspect(self.session.bind)
all_findings = {}
for table_name in insp.get_table_names():
text_columns = [
col["name"]
for col in insp.get_columns(table_name)
if str(col["type"]).startswith(("VARCHAR", "TEXT", "CHAR"))
]
if text_columns:
findings = await self.scan_table(table_name, text_columns)
if findings:
all_findings[table_name] = findings
return all_findings
Streaming PII detection for APIs
For real-time APIs, run lightweight detection on request/response bodies:
from fastapi import FastAPI, Request, Response
from starlette.middleware.base import BaseHTTPMiddleware
import json
class PIIDetectionMiddleware(BaseHTTPMiddleware):
"""Middleware that flags or blocks responses containing PII."""
def __init__(self, app, analyzer: AnalyzerEngine, mode: str = "log"):
super().__init__(app)
self.analyzer = analyzer
self.mode = mode # "log", "redact", or "block"
async def dispatch(self, request: Request, call_next):
response = await call_next(request)
# Only scan JSON responses
content_type = response.headers.get("content-type", "")
if "application/json" not in content_type:
return response
body = b""
async for chunk in response.body_iterator:
body += chunk
text_body = body.decode("utf-8", errors="ignore")
results = self.analyzer.analyze(
text=text_body, language="en", score_threshold=0.8
)
if results:
entity_types = {r.entity_type for r in results}
if self.mode == "log":
logger.warning(
f"PII detected in response: {entity_types}",
extra={"path": request.url.path},
)
elif self.mode == "block":
return Response(
content=json.dumps({"error": "Response contains PII"}),
status_code=500,
media_type="application/json",
)
return Response(
content=body,
status_code=response.status_code,
headers=dict(response.headers),
media_type=response.media_type,
)
Performance optimization
Presidio with spaCy NER models processes roughly 1,000-5,000 characters per millisecond on modern hardware. For high-throughput scenarios:
Batch processing: Analyze multiple texts in a single call rather than one at a time. Presidio’s analyzer supports batch mode.
Model selection: en_core_web_sm (15 MB) is 3x faster than en_core_web_lg (560 MB) with slightly lower accuracy on named entities. Use the smaller model for high-volume scanning and the larger model for thorough audits.
Pre-filtering: Use fast regex checks before invoking NER models. If a text block contains no @, no digit sequences, and no capitalized word pairs, it’s unlikely to contain PII.
Column-level classification: For databases, scan a sample of rows to classify columns (e.g., “this column contains emails 98% of the time”). Then apply targeted, fast pattern matching instead of full analysis on every row.
Tradeoffs
Speed vs. accuracy: Regex-only detection is 100x faster but misses names and context-dependent PII. NER-based detection catches more but adds latency. Most systems use a tiered approach — fast regex for real-time, full NER for batch.
False positives vs. false negatives: In healthcare or finance, missing PII (false negative) can mean regulatory violations. In a blog platform, excessive false positives (flagging “Dr. Who” as a person) create unnecessary friction. Tune thresholds per domain.
Multi-language support: PII patterns are language-dependent. Phone numbers, national IDs, and address formats vary by country. Presidio supports multiple languages but each needs configured recognizers. A global system needs recognizer sets per locale.
The one thing to remember: Production PII detection layers fast regex pre-filters for structured patterns with NER models for names and context-dependent entities, runs continuously across logs, databases, and API responses, and must be tuned per domain to balance false positives against missed detections.
See Also
- Python Compliance Audit Trails Why your Python app needs a tamper-proof diary that records every important action — like a security camera for your data
- Python Consent Management How Python apps ask permission like a polite guest — and remember exactly what you said yes and no to
- Python Data Anonymization How Python can disguise personal information so well that nobody — not even the original collector — can figure out who it belongs to
- Python Data Retention Policies Why your Python app needs an expiration date for data — just like the one on milk cartons — and what happens when data goes stale
- Python Differential Privacy How adding a pinch of random noise to data lets companies learn from millions of people without knowing anything about any single person