Guardrails for AI in Python — Deep Dive

Build layered AI guardrail systems in Python with prompt injection detection, PII redaction, schema enforcement with retries, content classifiers, and production monitoring for guardrail effectiveness.

Production AI applications need multiple guardrail layers working together. Each layer catches a different class of failure, and the system degrades gracefully when individual checks miss something. This guide covers how to architect these layers in Python.

1) Input guardrail pipeline

Process user input through a sequence of checks before it reaches the LLM:

from dataclasses import dataclass
from enum import Enum

class InputVerdict(Enum):
    PASS = "pass"
    BLOCK = "block"
    MODIFY = "modify"

@dataclass
class InputCheckResult:
    verdict: InputVerdict
    reason: str
    modified_input: str | None = None

class InputGuardrail:
    def __init__(self):
        self.checks = []

    def add_check(self, check_fn):
        self.checks.append(check_fn)
        return self

    def run(self, user_input: str) -> InputCheckResult:
        current_input = user_input
        for check in self.checks:
            result = check(current_input)
            if result.verdict == InputVerdict.BLOCK:
                return result
            if result.verdict == InputVerdict.MODIFY and result.modified_input:
                current_input = result.modified_input
        return InputCheckResult(verdict=InputVerdict.PASS, reason="All checks passed",
                                modified_input=current_input)

Prompt injection detection

Prompt injection is the top input-side threat. Detect it with a combination of heuristics and classification:

import re

INJECTION_PATTERNS = [
    r"ignore\s+(all\s+)?previous\s+instructions",
    r"you\s+are\s+now\s+",
    r"system\s*:\s*",
    r"<\|?(system|im_start)\|?>",
    r"pretend\s+you\s+are",
    r"roleplay\s+as",
]

def check_prompt_injection(user_input: str) -> InputCheckResult:
    input_lower = user_input.lower()
    for pattern in INJECTION_PATTERNS:
        if re.search(pattern, input_lower):
            return InputCheckResult(
                verdict=InputVerdict.BLOCK,
                reason=f"Potential prompt injection detected: {pattern}",
            )
    return InputCheckResult(verdict=InputVerdict.PASS, reason="No injection detected")

For production systems, supplement regex with a fine-tuned classifier. Models like deberta-v3-base fine-tuned on prompt injection datasets achieve 95%+ detection rates with minimal false positives.

PII redaction

import re

PII_PATTERNS = {
    "email": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",
    "phone": r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b",
    "ssn": r"\b\d{3}-\d{2}-\d{4}\b",
    "credit_card": r"\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b",
}

def redact_pii(user_input: str) -> InputCheckResult:
    modified = user_input
    found = []
    for pii_type, pattern in PII_PATTERNS.items():
        if re.search(pattern, modified):
            modified = re.sub(pattern, f"[REDACTED_{pii_type.upper()}]", modified)
            found.append(pii_type)
    if found:
        return InputCheckResult(
            verdict=InputVerdict.MODIFY,
            reason=f"Redacted PII: {', '.join(found)}",
            modified_input=modified,
        )
    return InputCheckResult(verdict=InputVerdict.PASS, reason="No PII found")

For production, use specialized libraries like presidio (Microsoft) which handles entity recognition across languages and supports custom recognizers.

2) Output schema enforcement with retries

Use Pydantic to enforce structure and retry on failure:

from pydantic import BaseModel, field_validator
from openai import OpenAI

client = OpenAI()

class ProductRecommendation(BaseModel):
    product_name: str
    reason: str
    confidence: float
    price_range: str

    @field_validator("confidence")
    @classmethod
    def confidence_range(cls, v):
        if not 0 <= v <= 1:
            raise ValueError("Confidence must be between 0 and 1")
        return v

    @field_validator("price_range")
    @classmethod
    def valid_price_range(cls, v):
        if not re.match(r"\$\d+-\$\d+", v):
            raise ValueError("Price range must be like '$10-$50'")
        return v

def get_recommendation(query: str, max_retries: int = 3) -> ProductRecommendation:
    messages = [
        {"role": "system", "content": "Recommend a product. Return JSON with: product_name, reason, confidence (0-1), price_range ($X-$Y)."},
        {"role": "user", "content": query},
    ]

    for attempt in range(max_retries):
        resp = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            response_format={"type": "json_object"},
        )
        raw = resp.choices[0].message.content
        try:
            return ProductRecommendation.model_validate_json(raw)
        except Exception as e:
            messages.append({"role": "assistant", "content": raw})
            messages.append({"role": "user", "content": f"Output validation failed: {e}. Fix and return valid JSON."})

    raise ValueError(f"Failed to get valid output after {max_retries} retries")

3) Content safety classification

Use a dedicated model for toxicity and safety classification:

from transformers import pipeline

safety_classifier = pipeline(
    "text-classification",
    model="unitary/toxic-bert",
    device=0,  # GPU
)

SAFETY_THRESHOLD = 0.7

def check_content_safety(output: str) -> tuple[bool, float]:
    result = safety_classifier(output[:512])[0]
    is_toxic = result["label"] == "toxic" and result["score"] > SAFETY_THRESHOLD
    return not is_toxic, result["score"]

For multi-category safety (hate speech, self-harm, sexual content, violence), use OpenAI’s moderation endpoint or Meta’s Llama Guard:

def check_moderation(text: str) -> dict:
    response = client.moderations.create(input=text)
    result = response.results[0]
    flagged_categories = {
        cat: score
        for cat, score in result.category_scores.model_dump().items()
        if getattr(result.categories, cat)
    }
    return {
        "safe": not result.flagged,
        "flagged_categories": flagged_categories,
    }

4) Factual grounding checks

For RAG applications, verify that the response is grounded in retrieved context:

def check_grounding(response: str, contexts: list[str], threshold: float = 0.7) -> dict:
    context_text = "\n---\n".join(contexts)
    prompt = f"""Analyze whether each claim in the response is supported by the context.

Context:
{context_text[:3000]}

Response:
{response}

For each distinct claim, state whether it is SUPPORTED, NOT_SUPPORTED, or UNCLEAR.
Return JSON: {{"claims": [{{"claim": "...", "verdict": "...", "evidence": "..."}}], "grounding_score": <0.0-1.0>}}"""

    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        response_format={"type": "json_object"},
    )
    result = json.loads(resp.choices[0].message.content)
    is_grounded = result["grounding_score"] >= threshold
    return {"grounded": is_grounded, "details": result}

5) Guardrails AI framework integration

The Guardrails AI library provides a declarative approach:

from guardrails import Guard
from guardrails.hub import ValidJson, ToxicLanguage, RestrictToTopic

guard = Guard().use_many(
    ValidJson(on_fail="reask"),
    ToxicLanguage(threshold=0.8, on_fail="filter"),
    RestrictToTopic(
        valid_topics=["customer support", "product information"],
        invalid_topics=["politics", "religion"],
        on_fail="refrain",
    ),
)

result = guard(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Tell me about your return policy"}],
)

if result.validation_passed:
    print(result.validated_output)
else:
    print(f"Blocked: {result.error}")

The on_fail parameter controls behavior: reask retries with feedback, filter removes offending content, refrain returns nothing, and exception raises an error.

6) NeMo Guardrails for dialog control

NVIDIA’s NeMo Guardrails uses Colang to define conversation flows:

define user ask about competitor
  "What about {competitor_name}?"
  "How does {competitor_name} compare?"
  "Is {competitor_name} better?"

define bot refuse competitor comparison
  "I can help you with questions about our products. For information about other companies, I'd suggest checking their websites directly."

define flow competitor deflection
  user ask about competitor
  bot refuse competitor comparison

This approach works well for customer-facing chatbots where conversation boundaries must be strictly enforced.

7) Monitoring guardrail effectiveness

Track these metrics in production:

from dataclasses import dataclass, field
from collections import defaultdict

@dataclass
class GuardrailMetrics:
    total_requests: int = 0
    blocked_inputs: int = 0
    blocked_outputs: int = 0
    retries_triggered: int = 0
    fallbacks_served: int = 0
    block_reasons: dict = field(default_factory=lambda: defaultdict(int))

    def block_rate(self) -> float:
        if self.total_requests == 0:
            return 0.0
        return (self.blocked_inputs + self.blocked_outputs) / self.total_requests

    def retry_rate(self) -> float:
        if self.total_requests == 0:
            return 0.0
        return self.retries_triggered / self.total_requests

Alert on:

Block rate above baseline (may indicate an attack or a model regression).
Retry rate spiking (model quality has degraded).
Specific block reasons trending upward.
Latency increase from guardrail processing.

8) Performance considerations

Guardrails add latency. Budget for it:

Rule-based checks: <5ms per check.
Local classifier (toxic-bert): 10-50ms on GPU.
API-based moderation (OpenAI): 100-300ms.
LLM-as-judge grounding check: 500-2000ms.

Run fast checks first and skip expensive checks when fast checks already block. Use async execution for independent checks. Cache repeated input patterns.

The one thing to remember: Production guardrails are layered defense systems — combine fast rule-based input checks, schema enforcement with retries, content safety classifiers, and grounding verification, then monitor effectiveness to catch what individual layers miss.

pythonguardrailsllm-safetyai-safetyproduction