Prompt Chaining in Python — Deep Dive

Build production prompt chains in Python with structured output parsing, retry logic, parallel fan-out, cost tracking, and real-world patterns from LangChain and plain-function architectures.

Prompt chaining is fundamentally a dataflow problem: you define a directed graph of LLM calls and deterministic transforms, then execute it while managing latency, cost, and error propagation. Python’s flexibility makes it the default language for this, whether you use a framework or roll your own.

1) Designing chain topology

Before writing code, sketch the dependency graph. Each node is either an LLM call or a deterministic transform (parsing, validation, DB lookup). Edges carry typed data.

Key questions:

Which steps are independent and can run concurrently?
Where do you need human-in-the-loop checkpoints?
What is the maximum acceptable latency for the full chain?

A common real-world topology for a customer-support bot: classify intent → retrieve relevant docs → generate draft answer → check policy compliance → format response. Steps 2 and 4 can involve tool calls rather than LLM calls.

2) Structured output between steps

The glue between steps must be reliable. Unstructured text passed between prompts leads to parsing failures downstream.

from pydantic import BaseModel
from openai import OpenAI

client = OpenAI()

class ExtractedFacts(BaseModel):
    facts: list[str]
    confidence: float

def extract_step(article: str) -> ExtractedFacts:
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Extract key facts as JSON."},
            {"role": "user", "content": article},
        ],
        response_format={"type": "json_object"},
    )
    return ExtractedFacts.model_validate_json(resp.choices[0].message.content)

Using Pydantic between steps gives you validation at every boundary. If the model returns malformed JSON, you catch it immediately rather than propagating garbage downstream.

3) Retry and error recovery

LLM calls fail in two ways: infrastructure errors (timeouts, rate limits) and semantic errors (wrong format, hallucinated content). Handle them differently:

import tenacity

@tenacity.retry(
    stop=tenacity.stop_after_attempt(3),
    wait=tenacity.wait_exponential(min=1, max=10),
    retry=tenacity.retry_if_exception_type((TimeoutError, ConnectionError)),
)
def robust_llm_call(prompt: str) -> str:
    # infrastructure retries handled by tenacity
    return call_model(prompt)

def step_with_self_correction(prompt: str, validator, max_fixes: int = 2) -> str:
    result = robust_llm_call(prompt)
    for _ in range(max_fixes):
        errors = validator(result)
        if not errors:
            return result
        fix_prompt = f"Your previous output had these issues: {errors}. Fix them.\n\nOriginal output:\n{result}"
        result = robust_llm_call(fix_prompt)
    raise ValueError(f"Step failed validation after {max_fixes} correction attempts")

Self-correction loops are powerful but cap iterations. Each retry costs tokens and latency. Log every retry for cost auditing.

4) Parallel fan-out with asyncio

When steps are independent, run them concurrently:

import asyncio
from openai import AsyncOpenAI

aclient = AsyncOpenAI()

async def summarize_chunk(chunk: str) -> str:
    resp = await aclient.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"Summarize:\n{chunk}"}],
    )
    return resp.choices[0].message.content

async def map_reduce_summarize(chunks: list[str]) -> str:
    summaries = await asyncio.gather(*[summarize_chunk(c) for c in chunks])
    combined = "\n\n".join(summaries)
    resp = await aclient.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"Synthesize these summaries into one:\n{combined}"}],
    )
    return resp.choices[0].message.content

Concurrency can cut wall-clock time by 3-5x on map-reduce patterns. Use asyncio.Semaphore to respect API rate limits.

5) Cost and latency tracking

Production chains need observability. Track per-step metrics:

import time

class ChainMetrics:
    def __init__(self):
        self.steps: list[dict] = []

    def record(self, step_name: str, tokens_in: int, tokens_out: int, duration: float):
        self.steps.append({
            "step": step_name,
            "tokens_in": tokens_in,
            "tokens_out": tokens_out,
            "duration_s": round(duration, 3),
            "cost_usd": self._estimate_cost(tokens_in, tokens_out),
        })

    def _estimate_cost(self, t_in: int, t_out: int) -> float:
        # adjust rates per model
        return (t_in * 2.5 + t_out * 10.0) / 1_000_000

    @property
    def total_cost(self) -> float:
        return sum(s["cost_usd"] for s in self.steps)

Wrap every LLM call in a timing context manager and feed results to your metrics collector. This data drives decisions about which steps to cache, which to run on cheaper models, and where latency budgets are blown.

6) Framework vs. plain functions

Plain functions — best when your chain is stable, team is small, and you want full control. A list of callables with a runner loop is easy to debug and test.

LangChain LCEL — useful when you need built-in streaming, tracing (LangSmith), and a large ecosystem of pre-built retrievers and tools. Overhead is justified for complex RAG chains.

Mirascope / Instructor — lighter-weight libraries focused on structured extraction. Good middle ground when you want Pydantic integration without a full framework.

The choice depends on team size and chain complexity. For chains under five steps, plain functions are almost always simpler.

7) Testing chains

Test each step in isolation with recorded LLM responses (use VCR.py or pytest-recording). For integration tests, run the full chain against a cheaper model (gpt-4o-mini) to verify topology without burning expensive tokens.

Snapshot-test the intermediate data structures between steps. If a model upgrade changes the format of step 2’s output, your step 3 tests catch it before production does.

8) Real-world tradeoffs

Concern	Single prompt	Chain
Latency	One round trip	Multiple, but each shorter
Cost	One call, potentially large	Multiple smaller calls; may cost more total
Debuggability	Opaque	Step-level visibility
Reliability	All-or-nothing	Retry per step
Flexibility	Must re-prompt for changes	Swap individual steps

Teams at companies like Anthropic, Notion, and Replit have publicly discussed moving from monolithic prompts to chains as their products matured. The pattern scales better because each step is independently testable and replaceable.

The one thing to remember: Production prompt chains are dataflow graphs with typed boundaries between steps — invest in structured output parsing, per-step retries, parallel execution where possible, and cost tracking to keep chains reliable and affordable.

pythonprompt-engineeringllm-appsproduction