Prompt Chaining in Python — Deep Dive
Prompt chaining is fundamentally a dataflow problem: you define a directed graph of LLM calls and deterministic transforms, then execute it while managing latency, cost, and error propagation. Python’s flexibility makes it the default language for this, whether you use a framework or roll your own.
1) Designing chain topology
Before writing code, sketch the dependency graph. Each node is either an LLM call or a deterministic transform (parsing, validation, DB lookup). Edges carry typed data.
Key questions:
- Which steps are independent and can run concurrently?
- Where do you need human-in-the-loop checkpoints?
- What is the maximum acceptable latency for the full chain?
A common real-world topology for a customer-support bot: classify intent → retrieve relevant docs → generate draft answer → check policy compliance → format response. Steps 2 and 4 can involve tool calls rather than LLM calls.
2) Structured output between steps
The glue between steps must be reliable. Unstructured text passed between prompts leads to parsing failures downstream.
from pydantic import BaseModel
from openai import OpenAI
client = OpenAI()
class ExtractedFacts(BaseModel):
facts: list[str]
confidence: float
def extract_step(article: str) -> ExtractedFacts:
resp = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Extract key facts as JSON."},
{"role": "user", "content": article},
],
response_format={"type": "json_object"},
)
return ExtractedFacts.model_validate_json(resp.choices[0].message.content)
Using Pydantic between steps gives you validation at every boundary. If the model returns malformed JSON, you catch it immediately rather than propagating garbage downstream.
3) Retry and error recovery
LLM calls fail in two ways: infrastructure errors (timeouts, rate limits) and semantic errors (wrong format, hallucinated content). Handle them differently:
import tenacity
@tenacity.retry(
stop=tenacity.stop_after_attempt(3),
wait=tenacity.wait_exponential(min=1, max=10),
retry=tenacity.retry_if_exception_type((TimeoutError, ConnectionError)),
)
def robust_llm_call(prompt: str) -> str:
# infrastructure retries handled by tenacity
return call_model(prompt)
def step_with_self_correction(prompt: str, validator, max_fixes: int = 2) -> str:
result = robust_llm_call(prompt)
for _ in range(max_fixes):
errors = validator(result)
if not errors:
return result
fix_prompt = f"Your previous output had these issues: {errors}. Fix them.\n\nOriginal output:\n{result}"
result = robust_llm_call(fix_prompt)
raise ValueError(f"Step failed validation after {max_fixes} correction attempts")
Self-correction loops are powerful but cap iterations. Each retry costs tokens and latency. Log every retry for cost auditing.
4) Parallel fan-out with asyncio
When steps are independent, run them concurrently:
import asyncio
from openai import AsyncOpenAI
aclient = AsyncOpenAI()
async def summarize_chunk(chunk: str) -> str:
resp = await aclient.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"Summarize:\n{chunk}"}],
)
return resp.choices[0].message.content
async def map_reduce_summarize(chunks: list[str]) -> str:
summaries = await asyncio.gather(*[summarize_chunk(c) for c in chunks])
combined = "\n\n".join(summaries)
resp = await aclient.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": f"Synthesize these summaries into one:\n{combined}"}],
)
return resp.choices[0].message.content
Concurrency can cut wall-clock time by 3-5x on map-reduce patterns. Use asyncio.Semaphore to respect API rate limits.
5) Cost and latency tracking
Production chains need observability. Track per-step metrics:
import time
class ChainMetrics:
def __init__(self):
self.steps: list[dict] = []
def record(self, step_name: str, tokens_in: int, tokens_out: int, duration: float):
self.steps.append({
"step": step_name,
"tokens_in": tokens_in,
"tokens_out": tokens_out,
"duration_s": round(duration, 3),
"cost_usd": self._estimate_cost(tokens_in, tokens_out),
})
def _estimate_cost(self, t_in: int, t_out: int) -> float:
# adjust rates per model
return (t_in * 2.5 + t_out * 10.0) / 1_000_000
@property
def total_cost(self) -> float:
return sum(s["cost_usd"] for s in self.steps)
Wrap every LLM call in a timing context manager and feed results to your metrics collector. This data drives decisions about which steps to cache, which to run on cheaper models, and where latency budgets are blown.
6) Framework vs. plain functions
Plain functions — best when your chain is stable, team is small, and you want full control. A list of callables with a runner loop is easy to debug and test.
LangChain LCEL — useful when you need built-in streaming, tracing (LangSmith), and a large ecosystem of pre-built retrievers and tools. Overhead is justified for complex RAG chains.
Mirascope / Instructor — lighter-weight libraries focused on structured extraction. Good middle ground when you want Pydantic integration without a full framework.
The choice depends on team size and chain complexity. For chains under five steps, plain functions are almost always simpler.
7) Testing chains
Test each step in isolation with recorded LLM responses (use VCR.py or pytest-recording). For integration tests, run the full chain against a cheaper model (gpt-4o-mini) to verify topology without burning expensive tokens.
Snapshot-test the intermediate data structures between steps. If a model upgrade changes the format of step 2’s output, your step 3 tests catch it before production does.
8) Real-world tradeoffs
| Concern | Single prompt | Chain |
|---|---|---|
| Latency | One round trip | Multiple, but each shorter |
| Cost | One call, potentially large | Multiple smaller calls; may cost more total |
| Debuggability | Opaque | Step-level visibility |
| Reliability | All-or-nothing | Retry per step |
| Flexibility | Must re-prompt for changes | Swap individual steps |
Teams at companies like Anthropic, Notion, and Replit have publicly discussed moving from monolithic prompts to chains as their products matured. The pattern scales better because each step is independently testable and replaceable.
The one thing to remember: Production prompt chains are dataflow graphs with typed boundaries between steps — invest in structured output parsing, per-step retries, parallel execution where possible, and cost tracking to keep chains reliable and affordable.
See Also
- Python Agent Frameworks An agent framework gives AI the ability to plan, use tools, and work through problems step by step — like upgrading a calculator into a research assistant.
- Python Embedding Pipelines An embedding pipeline turns words into numbers that capture meaning — like translating every sentence into coordinates on a giant map of ideas.
- Python Guardrails Ai Guardrails are safety bumpers for AI — they check what the model says before it reaches users, like a spellchecker but for facts, tone, and dangerous content.
- Python Llm Evaluation Harness An LLM evaluation harness is like a report card for AI — it runs tests and grades how well the model answers questions so you know if it is actually improving.
- Python Llm Function Calling Function calling lets an AI ask your Python code for help — like a chef who can read a recipe but needs someone else to actually open the fridge.