Guardrails for AI in Python — Deep Dive
Production AI applications need multiple guardrail layers working together. Each layer catches a different class of failure, and the system degrades gracefully when individual checks miss something. This guide covers how to architect these layers in Python.
1) Input guardrail pipeline
Process user input through a sequence of checks before it reaches the LLM:
from dataclasses import dataclass
from enum import Enum
class InputVerdict(Enum):
PASS = "pass"
BLOCK = "block"
MODIFY = "modify"
@dataclass
class InputCheckResult:
verdict: InputVerdict
reason: str
modified_input: str | None = None
class InputGuardrail:
def __init__(self):
self.checks = []
def add_check(self, check_fn):
self.checks.append(check_fn)
return self
def run(self, user_input: str) -> InputCheckResult:
current_input = user_input
for check in self.checks:
result = check(current_input)
if result.verdict == InputVerdict.BLOCK:
return result
if result.verdict == InputVerdict.MODIFY and result.modified_input:
current_input = result.modified_input
return InputCheckResult(verdict=InputVerdict.PASS, reason="All checks passed",
modified_input=current_input)
Prompt injection detection
Prompt injection is the top input-side threat. Detect it with a combination of heuristics and classification:
import re
INJECTION_PATTERNS = [
r"ignore\s+(all\s+)?previous\s+instructions",
r"you\s+are\s+now\s+",
r"system\s*:\s*",
r"<\|?(system|im_start)\|?>",
r"pretend\s+you\s+are",
r"roleplay\s+as",
]
def check_prompt_injection(user_input: str) -> InputCheckResult:
input_lower = user_input.lower()
for pattern in INJECTION_PATTERNS:
if re.search(pattern, input_lower):
return InputCheckResult(
verdict=InputVerdict.BLOCK,
reason=f"Potential prompt injection detected: {pattern}",
)
return InputCheckResult(verdict=InputVerdict.PASS, reason="No injection detected")
For production systems, supplement regex with a fine-tuned classifier. Models like deberta-v3-base fine-tuned on prompt injection datasets achieve 95%+ detection rates with minimal false positives.
PII redaction
import re
PII_PATTERNS = {
"email": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",
"phone": r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b",
"ssn": r"\b\d{3}-\d{2}-\d{4}\b",
"credit_card": r"\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b",
}
def redact_pii(user_input: str) -> InputCheckResult:
modified = user_input
found = []
for pii_type, pattern in PII_PATTERNS.items():
if re.search(pattern, modified):
modified = re.sub(pattern, f"[REDACTED_{pii_type.upper()}]", modified)
found.append(pii_type)
if found:
return InputCheckResult(
verdict=InputVerdict.MODIFY,
reason=f"Redacted PII: {', '.join(found)}",
modified_input=modified,
)
return InputCheckResult(verdict=InputVerdict.PASS, reason="No PII found")
For production, use specialized libraries like presidio (Microsoft) which handles entity recognition across languages and supports custom recognizers.
2) Output schema enforcement with retries
Use Pydantic to enforce structure and retry on failure:
from pydantic import BaseModel, field_validator
from openai import OpenAI
client = OpenAI()
class ProductRecommendation(BaseModel):
product_name: str
reason: str
confidence: float
price_range: str
@field_validator("confidence")
@classmethod
def confidence_range(cls, v):
if not 0 <= v <= 1:
raise ValueError("Confidence must be between 0 and 1")
return v
@field_validator("price_range")
@classmethod
def valid_price_range(cls, v):
if not re.match(r"\$\d+-\$\d+", v):
raise ValueError("Price range must be like '$10-$50'")
return v
def get_recommendation(query: str, max_retries: int = 3) -> ProductRecommendation:
messages = [
{"role": "system", "content": "Recommend a product. Return JSON with: product_name, reason, confidence (0-1), price_range ($X-$Y)."},
{"role": "user", "content": query},
]
for attempt in range(max_retries):
resp = client.chat.completions.create(
model="gpt-4o",
messages=messages,
response_format={"type": "json_object"},
)
raw = resp.choices[0].message.content
try:
return ProductRecommendation.model_validate_json(raw)
except Exception as e:
messages.append({"role": "assistant", "content": raw})
messages.append({"role": "user", "content": f"Output validation failed: {e}. Fix and return valid JSON."})
raise ValueError(f"Failed to get valid output after {max_retries} retries")
3) Content safety classification
Use a dedicated model for toxicity and safety classification:
from transformers import pipeline
safety_classifier = pipeline(
"text-classification",
model="unitary/toxic-bert",
device=0, # GPU
)
SAFETY_THRESHOLD = 0.7
def check_content_safety(output: str) -> tuple[bool, float]:
result = safety_classifier(output[:512])[0]
is_toxic = result["label"] == "toxic" and result["score"] > SAFETY_THRESHOLD
return not is_toxic, result["score"]
For multi-category safety (hate speech, self-harm, sexual content, violence), use OpenAI’s moderation endpoint or Meta’s Llama Guard:
def check_moderation(text: str) -> dict:
response = client.moderations.create(input=text)
result = response.results[0]
flagged_categories = {
cat: score
for cat, score in result.category_scores.model_dump().items()
if getattr(result.categories, cat)
}
return {
"safe": not result.flagged,
"flagged_categories": flagged_categories,
}
4) Factual grounding checks
For RAG applications, verify that the response is grounded in retrieved context:
def check_grounding(response: str, contexts: list[str], threshold: float = 0.7) -> dict:
context_text = "\n---\n".join(contexts)
prompt = f"""Analyze whether each claim in the response is supported by the context.
Context:
{context_text[:3000]}
Response:
{response}
For each distinct claim, state whether it is SUPPORTED, NOT_SUPPORTED, or UNCLEAR.
Return JSON: {{"claims": [{{"claim": "...", "verdict": "...", "evidence": "..."}}], "grounding_score": <0.0-1.0>}}"""
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0,
response_format={"type": "json_object"},
)
result = json.loads(resp.choices[0].message.content)
is_grounded = result["grounding_score"] >= threshold
return {"grounded": is_grounded, "details": result}
5) Guardrails AI framework integration
The Guardrails AI library provides a declarative approach:
from guardrails import Guard
from guardrails.hub import ValidJson, ToxicLanguage, RestrictToTopic
guard = Guard().use_many(
ValidJson(on_fail="reask"),
ToxicLanguage(threshold=0.8, on_fail="filter"),
RestrictToTopic(
valid_topics=["customer support", "product information"],
invalid_topics=["politics", "religion"],
on_fail="refrain",
),
)
result = guard(
model="gpt-4o",
messages=[{"role": "user", "content": "Tell me about your return policy"}],
)
if result.validation_passed:
print(result.validated_output)
else:
print(f"Blocked: {result.error}")
The on_fail parameter controls behavior: reask retries with feedback, filter removes offending content, refrain returns nothing, and exception raises an error.
6) NeMo Guardrails for dialog control
NVIDIA’s NeMo Guardrails uses Colang to define conversation flows:
define user ask about competitor
"What about {competitor_name}?"
"How does {competitor_name} compare?"
"Is {competitor_name} better?"
define bot refuse competitor comparison
"I can help you with questions about our products. For information about other companies, I'd suggest checking their websites directly."
define flow competitor deflection
user ask about competitor
bot refuse competitor comparison
This approach works well for customer-facing chatbots where conversation boundaries must be strictly enforced.
7) Monitoring guardrail effectiveness
Track these metrics in production:
from dataclasses import dataclass, field
from collections import defaultdict
@dataclass
class GuardrailMetrics:
total_requests: int = 0
blocked_inputs: int = 0
blocked_outputs: int = 0
retries_triggered: int = 0
fallbacks_served: int = 0
block_reasons: dict = field(default_factory=lambda: defaultdict(int))
def block_rate(self) -> float:
if self.total_requests == 0:
return 0.0
return (self.blocked_inputs + self.blocked_outputs) / self.total_requests
def retry_rate(self) -> float:
if self.total_requests == 0:
return 0.0
return self.retries_triggered / self.total_requests
Alert on:
- Block rate above baseline (may indicate an attack or a model regression).
- Retry rate spiking (model quality has degraded).
- Specific block reasons trending upward.
- Latency increase from guardrail processing.
8) Performance considerations
Guardrails add latency. Budget for it:
- Rule-based checks: <5ms per check.
- Local classifier (toxic-bert): 10-50ms on GPU.
- API-based moderation (OpenAI): 100-300ms.
- LLM-as-judge grounding check: 500-2000ms.
Run fast checks first and skip expensive checks when fast checks already block. Use async execution for independent checks. Cache repeated input patterns.
The one thing to remember: Production guardrails are layered defense systems — combine fast rule-based input checks, schema enforcement with retries, content safety classifiers, and grounding verification, then monitor effectiveness to catch what individual layers miss.
See Also
- Python Agent Frameworks An agent framework gives AI the ability to plan, use tools, and work through problems step by step — like upgrading a calculator into a research assistant.
- Python Embedding Pipelines An embedding pipeline turns words into numbers that capture meaning — like translating every sentence into coordinates on a giant map of ideas.
- Python Llm Evaluation Harness An LLM evaluation harness is like a report card for AI — it runs tests and grades how well the model answers questions so you know if it is actually improving.
- Python Llm Function Calling Function calling lets an AI ask your Python code for help — like a chef who can read a recipe but needs someone else to actually open the fridge.
- Python Prompt Chaining Think of prompt chaining as a relay race where each runner hands a baton to the next — except the runners are AI prompts building on each other's work.