Python Response Generation — Deep Dive

Build production response generation systems in Python — template engines, retrieval rankers, grounded LLM generation, and safety guardrails.

Response Generation Architecture

Response generation sits at the end of the chatbot pipeline. It receives an action (what to say) and context (conversation history, slot values, API results) and produces the actual text the user sees. In production, this layer must be fast, reliable, and safe — because it is the only thing the user directly experiences.

Template-Based Systems

Advanced Jinja2 Patterns

Production template systems use Jinja2’s full feature set:

from jinja2 import Environment, FileSystemLoader, select_autoescape

env = Environment(
    loader=FileSystemLoader("templates/"),
    autoescape=select_autoescape(),
    trim_blocks=True,
    lstrip_blocks=True,
)

# templates/booking_confirmed.j2
TEMPLATE = """
{% set greeting = ["Great news!", "Awesome!", "All set!"] | random %}
{{ greeting }}

{% if passengers == 1 %}
Your flight to {{ destination }} on {{ date }} is confirmed.
{% else %}
{{ passengers }} seats to {{ destination }} on {{ date }} — confirmed!
{% endif %}

Booking reference: {{ booking_id }}

{% if special_requests %}
We've noted your requests:
{% for req in special_requests %}
  • {{ req }}
{% endfor %}
{% endif %}
"""

Template Registry Pattern

Manage hundreds of templates with a registry:

from dataclasses import dataclass
from jinja2 import Environment, BaseLoader

@dataclass
class ResponseTemplate:
    action: str
    template: str
    variations: list[str]
    channel_overrides: dict[str, str]  # channel -> template

class TemplateRegistry:
    def __init__(self):
        self.templates: dict[str, ResponseTemplate] = {}
        self.env = Environment(loader=BaseLoader())

    def register(self, action: str, template: str,
                 variations: list[str] | None = None,
                 channel_overrides: dict[str, str] | None = None):
        self.templates[action] = ResponseTemplate(
            action=action,
            template=template,
            variations=variations or [],
            channel_overrides=channel_overrides or {},
        )

    def render(self, action: str, context: dict,
               channel: str = "default") -> str:
        entry = self.templates.get(action)
        if not entry:
            return "I'm not sure how to respond to that."

        # Channel-specific override
        if channel in entry.channel_overrides:
            tmpl_str = entry.channel_overrides[channel]
        elif entry.variations:
            import random
            tmpl_str = random.choice([entry.template] + entry.variations)
        else:
            tmpl_str = entry.template

        template = self.env.from_string(tmpl_str)
        return template.render(**context)

Channel-Specific Formatting

Different platforms need different formatting:

class ChannelFormatter:
    @staticmethod
    def format_for_channel(text: str, channel: str) -> dict:
        if channel == "slack":
            return {"text": text, "mrkdwn": True}
        elif channel == "telegram":
            return {"text": text, "parse_mode": "HTML"}
        elif channel == "whatsapp":
            # WhatsApp doesn't support markdown
            text = text.replace("**", "*")  # bold
            text = text.replace("• ", "- ")  # bullets
            return {"text": text}
        return {"text": text}

Retrieval-Based Generation

Candidate Scoring with Sentence Transformers

from sentence_transformers import SentenceTransformer, util
import torch

class RetrievalResponder:
    def __init__(self, responses: list[dict], model_name: str = "all-MiniLM-L6-v2"):
        self.encoder = SentenceTransformer(model_name)
        self.responses = responses
        self.response_texts = [r["text"] for r in responses]
        self.embeddings = self.encoder.encode(
            self.response_texts, convert_to_tensor=True
        )

    def get_response(self, context: str, top_k: int = 3) -> list[dict]:
        query_emb = self.encoder.encode(context, convert_to_tensor=True)
        scores = util.cos_sim(query_emb, self.embeddings)[0]
        top_indices = torch.topk(scores, k=min(top_k, len(self.responses)))

        results = []
        for score, idx in zip(top_indices.values, top_indices.indices):
            results.append({
                "text": self.responses[idx.item()]["text"],
                "score": score.item(),
                "metadata": self.responses[idx.item()].get("metadata", {}),
            })
        return results

Hybrid Retrieval + Template

Use retrieval for the conversational part and templates for structured data:

class HybridResponder:
    def __init__(self, retrieval: RetrievalResponder, registry: TemplateRegistry):
        self.retrieval = retrieval
        self.registry = registry

    def respond(self, action: str, context: dict, conversation_text: str) -> str:
        # Structured data via template
        factual_part = self.registry.render(action, context)

        # Conversational wrapper via retrieval
        candidates = self.retrieval.get_response(conversation_text, top_k=1)
        if candidates and candidates[0]["score"] > 0.7:
            conversational_part = candidates[0]["text"]
            return f"{conversational_part}\n\n{factual_part}"

        return factual_part

LLM-Based Generation

Grounded Generation Pattern

The key to reliable LLM responses is grounding — providing verified data and instructing the model to use only that data:

import openai

class GroundedGenerator:
    def __init__(self, model: str = "gpt-4o-mini"):
        self.model = model
        self.client = openai.OpenAI()

    def generate(self, action: str, structured_data: dict,
                 conversation_history: list[dict],
                 persona: str = "friendly customer service agent") -> str:

        system_prompt = f"""You are a {persona} for an airline.
Generate a natural response for the action: {action}

VERIFIED DATA (use ONLY these facts):
{self._format_data(structured_data)}

RULES:
- Include all verified data points in your response
- Do NOT invent any facts, numbers, or dates not in the verified data
- Keep the response concise (2-4 sentences)
- Match the conversation's tone
- Do not start with "I" """

        messages = [
            {"role": "system", "content": system_prompt},
            *conversation_history[-5:],  # Last 5 turns for context
            {"role": "user", "content": f"Generate response for: {action}"},
        ]

        response = self.client.chat.completions.create(
            model=self.model,
            messages=messages,
            temperature=0.7,
            max_tokens=200,
        )
        return response.choices[0].message.content

    def _format_data(self, data: dict) -> str:
        return "\n".join(f"- {k}: {v}" for k, v in data.items())

Output Validation

Never send LLM output to users without validation:

import re

class ResponseValidator:
    def __init__(self, structured_data: dict):
        self.data = structured_data

    def validate(self, response: str) -> tuple[bool, list[str]]:
        issues = []

        # Check all critical data points are present
        for key in ["booking_id", "date", "destination"]:
            if key in self.data:
                value = str(self.data[key])
                if value.lower() not in response.lower():
                    issues.append(f"Missing critical data: {key}={value}")

        # Check for hallucinated numbers not in source data
        numbers_in_response = set(re.findall(r"\b\d+\b", response))
        numbers_in_data = set()
        for v in self.data.values():
            numbers_in_data.update(re.findall(r"\b\d+\b", str(v)))
        hallucinated = numbers_in_response - numbers_in_data - {"1", "2", "3"}
        if hallucinated:
            issues.append(f"Potentially hallucinated numbers: {hallucinated}")

        # Check for banned phrases
        banned = ["as an ai", "i cannot", "i don't have access"]
        for phrase in banned:
            if phrase in response.lower():
                issues.append(f"Banned phrase detected: '{phrase}'")

        return len(issues) == 0, issues

Fallback to Templates

When LLM generation fails validation, fall back to templates:

class SafeResponder:
    def __init__(self, llm: GroundedGenerator, templates: TemplateRegistry,
                 validator_class=ResponseValidator):
        self.llm = llm
        self.templates = templates
        self.validator_class = validator_class

    def respond(self, action: str, context: dict,
                history: list[dict], channel: str = "default") -> str:
        # Try LLM first
        try:
            llm_response = self.llm.generate(action, context, history)
            validator = self.validator_class(context)
            valid, issues = validator.validate(llm_response)
            if valid:
                return llm_response
            # Log issues for monitoring
            logger.warning(f"LLM response failed validation: {issues}")
        except Exception as e:
            logger.error(f"LLM generation failed: {e}")

        # Fallback to template
        return self.templates.render(action, context, channel)

Response Enrichment

Rich Messages

Beyond plain text, chatbots send buttons, cards, carousels, and quick replies:

from dataclasses import dataclass

@dataclass
class Button:
    title: str
    payload: str

@dataclass
class Card:
    title: str
    subtitle: str
    image_url: str | None = None
    buttons: list[Button] | None = None

@dataclass
class BotResponse:
    text: str | None = None
    buttons: list[Button] | None = None
    cards: list[Card] | None = None
    quick_replies: list[str] | None = None

    def to_channel_format(self, channel: str) -> dict:
        if channel == "slack":
            return self._to_slack()
        elif channel == "telegram":
            return self._to_telegram()
        return {"text": self.text}

    def _to_slack(self) -> dict:
        blocks = []
        if self.text:
            blocks.append({"type": "section", "text": {"type": "mrkdwn", "text": self.text}})
        if self.buttons:
            blocks.append({
                "type": "actions",
                "elements": [
                    {"type": "button", "text": {"type": "plain_text", "text": b.title},
                     "value": b.payload}
                    for b in self.buttons
                ],
            })
        return {"blocks": blocks}

Monitoring Response Quality

Track these metrics in production:

Response length distribution — Sudden changes indicate template or prompt issues
Fallback rate — How often the LLM fails validation and templates take over
User satisfaction signals — Thumbs up/down, conversation completion rates
Generation latency — P50 and P99 for LLM responses vs. templates

import time
from dataclasses import dataclass

@dataclass
class ResponseMetrics:
    action: str
    generation_method: str  # "template", "retrieval", "llm"
    latency_ms: float
    was_validated: bool
    passed_validation: bool
    response_length: int
    channel: str

Performance Comparison

Method	Latency	Cost/Message	Quality	Safety
Template	<1ms	$0	Predictable	Very High
Retrieval	5-15ms	$0	Natural	High
LLM (GPT-4o-mini)	200-500ms	$0.0001-0.001	Very Natural	Medium
LLM (GPT-4o)	500-2000ms	$0.005-0.02	Best	Medium

The production pattern: templates for transactional messages (confirmations, errors, data readouts), retrieval for FAQ and common scenarios, LLM for open-ended conversation and tone adaptation — with validation and template fallback as safety nets.

The one thing to remember: Production response generation layers templates (for safety and speed), retrieval (for natural pre-written answers), and LLM generation (for flexibility) — with output validation as the mandatory safety net before any generated text reaches the user.

pythonresponse-generationchatbotsnlpnlgllm