PII Detection in Python — Deep Dive

Build a production PII detection pipeline in Python using Microsoft Presidio, custom regex recognizers, spaCy NER models, and automated redaction for logs and databases

Microsoft Presidio: the standard toolkit

Presidio is an open-source PII detection and anonymization framework from Microsoft. It combines regex patterns, NER models, and checksum validators into a configurable pipeline.

# pip install presidio-analyzer presidio-anonymizer
# python -m spacy download en_core_web_lg

from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Pattern
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()

text = """
Dear Support,
My name is Sarah Chen and my email is sarah.chen@example.com.
My SSN is 123-45-6789 and my credit card is 4532-1234-5678-9012.
Please update my account at 742 Evergreen Terrace, Springfield.
"""

results = analyzer.analyze(text=text, language="en")

for result in sorted(results, key=lambda r: r.start):
    print(f"  {result.entity_type:20} | score={result.score:.2f} | "
          f"'{text[result.start:result.end]}'")

Output:

  PERSON               | score=0.85 | 'Sarah Chen'
  EMAIL_ADDRESS        | score=1.00 | 'sarah.chen@example.com'
  US_SSN               | score=0.85 | '123-45-6789'
  CREDIT_CARD          | score=1.00 | '4532-1234-5678-9012'
  LOCATION             | score=0.85 | '742 Evergreen Terrace, Springfield'

Custom recognizers for domain-specific PII

Standard recognizers miss industry-specific identifiers. Add custom pattern recognizers for internal ID formats, medical record numbers, or country-specific identifiers:

from presidio_analyzer import PatternRecognizer, Pattern

# Detect UK National Insurance Numbers (e.g., AB 12 34 56 C)
uk_nino_recognizer = PatternRecognizer(
    supported_entity="UK_NINO",
    name="UK National Insurance Number",
    patterns=[
        Pattern(
            name="uk_nino",
            regex=r"\b[A-CEGHJ-PR-TW-Z]{2}\s?\d{2}\s?\d{2}\s?\d{2}\s?[A-D]\b",
            score=0.7,
        ),
    ],
    context=["national insurance", "NI number", "NINO"],
)

# Detect internal employee IDs (e.g., EMP-2024-00142)
employee_id_recognizer = PatternRecognizer(
    supported_entity="EMPLOYEE_ID",
    name="Internal Employee ID",
    patterns=[
        Pattern(
            name="emp_id",
            regex=r"\bEMP-\d{4}-\d{5}\b",
            score=0.9,
        ),
    ],
)

# Register with the analyzer
analyzer.registry.add_recognizer(uk_nino_recognizer)
analyzer.registry.add_recognizer(employee_id_recognizer)

The context parameter boosts the confidence score when surrounding words match — “NI number AB 12 34 56 C” gets a higher score than the same pattern appearing without context.

Building a log sanitizer

Application logs are a major PII leak vector. Here’s a production-grade log sanitizer that intercepts log records before they reach handlers:

import logging
import re
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

class PIILogFilter(logging.Filter):
    """Filter that redacts PII from log messages."""
    
    def __init__(self, analyzer: AnalyzerEngine, anonymizer: AnonymizerEngine):
        super().__init__()
        self.analyzer = analyzer
        self.anonymizer = anonymizer
        self.operators = {
            "DEFAULT": OperatorConfig("replace", {"new_value": "[REDACTED]"}),
            "EMAIL_ADDRESS": OperatorConfig("replace", {"new_value": "[EMAIL]"}),
            "PHONE_NUMBER": OperatorConfig("replace", {"new_value": "[PHONE]"}),
            "CREDIT_CARD": OperatorConfig("replace", {"new_value": "[CARD]"}),
            "US_SSN": OperatorConfig("replace", {"new_value": "[SSN]"}),
            "PERSON": OperatorConfig("replace", {"new_value": "[NAME]"}),
        }
        # Fast pre-filter: skip messages unlikely to contain PII
        self._quick_patterns = re.compile(
            r"@|(?:\d[-.\s]?){9,}|\b[A-Z][a-z]+\s[A-Z][a-z]+\b"
        )
    
    def filter(self, record: logging.LogRecord) -> bool:
        msg = record.getMessage()
        
        # Quick check: if no PII-like patterns, skip expensive analysis
        if not self._quick_patterns.search(msg):
            return True
        
        results = self.analyzer.analyze(text=msg, language="en")
        if results:
            anonymized = self.anonymizer.anonymize(
                text=msg, analyzer_results=results, operators=self.operators
            )
            record.msg = anonymized.text
            record.args = None  # prevent re-formatting
        return True

# Setup
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

handler = logging.StreamHandler()
handler.addFilter(PIILogFilter(analyzer, anonymizer))

logger = logging.getLogger("app")
logger.addHandler(handler)

# This log message contains PII that gets redacted automatically
logger.info("User Sarah Chen (sarah@example.com) reported issue with card 4532-1234-5678-9012")
# Output: "User [NAME] ([EMAIL]) reported issue with card [CARD]"

The quick regex pre-filter is important for performance — running Presidio on every log line in a high-throughput application would be expensive. The pre-filter skips messages that are obviously PII-free (plain status messages, metrics, etc.).

Database scanning for PII discovery

Discovering PII across database tables helps build a data map for GDPR compliance:

from dataclasses import dataclass
from typing import Generator
import asyncio
from sqlalchemy import text, inspect
from sqlalchemy.ext.asyncio import AsyncSession

@dataclass
class PIIFinding:
    table: str
    column: str
    row_id: str
    entity_type: str
    score: float
    sample: str  # redacted sample for context

class DatabasePIIScanner:
    def __init__(self, session: AsyncSession, analyzer: AnalyzerEngine):
        self.session = session
        self.analyzer = analyzer
        self.batch_size = 500
    
    async def scan_table(
        self, table_name: str, columns: list[str],
        id_column: str = "id", sample_limit: int = 10000,
    ) -> list[PIIFinding]:
        findings = []
        col_list = ", ".join([id_column] + columns)
        
        query = text(
            f"SELECT {col_list} FROM {table_name} LIMIT :limit"
        )
        result = await self.session.execute(query, {"limit": sample_limit})
        rows = result.fetchall()
        
        for row in rows:
            row_id = str(row[0])
            for i, col in enumerate(columns):
                value = row[i + 1]
                if not value or not isinstance(value, str):
                    continue
                if len(value) < 3:
                    continue
                
                results = self.analyzer.analyze(
                    text=value, language="en", score_threshold=0.7
                )
                
                for r in results:
                    findings.append(PIIFinding(
                        table=table_name,
                        column=col,
                        row_id=row_id,
                        entity_type=r.entity_type,
                        score=r.score,
                        sample=value[:50] + "..." if len(value) > 50 else value,
                    ))
        
        return findings
    
    async def scan_all_text_columns(self) -> dict[str, list[PIIFinding]]:
        """Auto-discover and scan all text columns across all tables."""
        insp = inspect(self.session.bind)
        all_findings = {}
        
        for table_name in insp.get_table_names():
            text_columns = [
                col["name"]
                for col in insp.get_columns(table_name)
                if str(col["type"]).startswith(("VARCHAR", "TEXT", "CHAR"))
            ]
            
            if text_columns:
                findings = await self.scan_table(table_name, text_columns)
                if findings:
                    all_findings[table_name] = findings
        
        return all_findings

Streaming PII detection for APIs

For real-time APIs, run lightweight detection on request/response bodies:

from fastapi import FastAPI, Request, Response
from starlette.middleware.base import BaseHTTPMiddleware
import json

class PIIDetectionMiddleware(BaseHTTPMiddleware):
    """Middleware that flags or blocks responses containing PII."""
    
    def __init__(self, app, analyzer: AnalyzerEngine, mode: str = "log"):
        super().__init__(app)
        self.analyzer = analyzer
        self.mode = mode  # "log", "redact", or "block"
    
    async def dispatch(self, request: Request, call_next):
        response = await call_next(request)
        
        # Only scan JSON responses
        content_type = response.headers.get("content-type", "")
        if "application/json" not in content_type:
            return response
        
        body = b""
        async for chunk in response.body_iterator:
            body += chunk
        
        text_body = body.decode("utf-8", errors="ignore")
        results = self.analyzer.analyze(
            text=text_body, language="en", score_threshold=0.8
        )
        
        if results:
            entity_types = {r.entity_type for r in results}
            
            if self.mode == "log":
                logger.warning(
                    f"PII detected in response: {entity_types}",
                    extra={"path": request.url.path},
                )
            elif self.mode == "block":
                return Response(
                    content=json.dumps({"error": "Response contains PII"}),
                    status_code=500,
                    media_type="application/json",
                )
        
        return Response(
            content=body,
            status_code=response.status_code,
            headers=dict(response.headers),
            media_type=response.media_type,
        )

Performance optimization

Presidio with spaCy NER models processes roughly 1,000-5,000 characters per millisecond on modern hardware. For high-throughput scenarios:

Batch processing: Analyze multiple texts in a single call rather than one at a time. Presidio’s analyzer supports batch mode.

Model selection: en_core_web_sm (15 MB) is 3x faster than en_core_web_lg (560 MB) with slightly lower accuracy on named entities. Use the smaller model for high-volume scanning and the larger model for thorough audits.

Pre-filtering: Use fast regex checks before invoking NER models. If a text block contains no @, no digit sequences, and no capitalized word pairs, it’s unlikely to contain PII.

Column-level classification: For databases, scan a sample of rows to classify columns (e.g., “this column contains emails 98% of the time”). Then apply targeted, fast pattern matching instead of full analysis on every row.

Tradeoffs

Speed vs. accuracy: Regex-only detection is 100x faster but misses names and context-dependent PII. NER-based detection catches more but adds latency. Most systems use a tiered approach — fast regex for real-time, full NER for batch.

False positives vs. false negatives: In healthcare or finance, missing PII (false negative) can mean regulatory violations. In a blog platform, excessive false positives (flagging “Dr. Who” as a person) create unnecessary friction. Tune thresholds per domain.

Multi-language support: PII patterns are language-dependent. Phone numbers, national IDs, and address formats vary by country. Presidio supports multiple languages but each needs configured recognizers. A global system needs recognizer sets per locale.

The one thing to remember: Production PII detection layers fast regex pre-filters for structured patterns with NER models for names and context-dependent entities, runs continuously across logs, databases, and API responses, and must be tuned per domain to balance false positives against missed detections.

pythonprivacypiidata-protection