Agent Frameworks in Python — Deep Dive
Agent frameworks in Python abstract the loop of reasoning, tool use, and observation. But production agents need more than a loop — they need reliability, observability, cost control, and graceful failure handling. This guide covers how to build agents that work in the real world.
1) Building a custom ReAct agent
Before using a framework, understand the core pattern:
from openai import OpenAI
import json
client = OpenAI()
def react_agent(query: str, tools: dict, system_prompt: str, max_steps: int = 8) -> str:
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": query},
]
tool_defs = [
{"type": "function", "function": {"name": name, **spec["schema"]}}
for name, spec in tools.items()
]
for step in range(max_steps):
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tool_defs if tool_defs else None,
)
msg = response.choices[0].message
messages.append(msg.model_dump())
if not msg.tool_calls:
return msg.content # Final answer
for tc in msg.tool_calls:
tool = tools.get(tc.function.name)
if not tool:
result = f"Error: unknown tool {tc.function.name}"
else:
try:
args = json.loads(tc.function.arguments)
result = str(tool["fn"](**args))
except Exception as e:
result = f"Error: {e}"
messages.append({
"role": "tool",
"tool_call_id": tc.id,
"content": result,
})
return "Agent reached maximum steps without completing the task."
This 40-line agent handles the core loop. Everything frameworks add (memory, state, multi-agent) builds on this foundation.
2) LangGraph: agents as state machines
LangGraph models agents as directed graphs where nodes are functions and edges are conditional transitions:
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
from operator import add
class AgentState(TypedDict):
messages: Annotated[list, add]
current_step: str
research_results: list[str]
draft: str
def research_node(state: AgentState) -> dict:
"""Gather information using search tools."""
query = state["messages"][-1]["content"]
results = search_tool(query)
return {"research_results": results, "current_step": "draft"}
def draft_node(state: AgentState) -> dict:
"""Write a draft based on research."""
context = "\n".join(state["research_results"])
draft = generate_draft(context, state["messages"][-1]["content"])
return {"draft": draft, "current_step": "review"}
def review_node(state: AgentState) -> dict:
"""Review the draft and decide if it needs more research."""
quality = evaluate_quality(state["draft"])
if quality < 0.7:
return {"current_step": "research"} # Loop back
return {"current_step": "done"}
def should_continue(state: AgentState) -> str:
if state["current_step"] == "done":
return END
return state["current_step"]
graph = StateGraph(AgentState)
graph.add_node("research", research_node)
graph.add_node("draft", draft_node)
graph.add_node("review", review_node)
graph.add_edge("research", "draft")
graph.add_edge("draft", "review")
graph.add_conditional_edges("review", should_continue)
graph.set_entry_point("research")
agent = graph.compile()
LangGraph’s explicit state management makes complex workflows debuggable. You can checkpoint state, replay from any node, and add human-in-the-loop approvals at specific edges.
3) Multi-agent patterns with CrewAI
When a task needs different skills, multiple agents can collaborate:
from crewai import Agent, Task, Crew
researcher = Agent(
role="Research Analyst",
goal="Find accurate, recent data on the topic",
backstory="Senior analyst at a research firm with 10 years of experience",
tools=[search_tool, web_scraper],
llm="gpt-4o",
)
writer = Agent(
role="Technical Writer",
goal="Write clear, accurate content based on research",
backstory="Technical writer who specializes in making complex topics accessible",
tools=[],
llm="gpt-4o",
)
research_task = Task(
description="Research {topic} and compile key findings with sources",
agent=researcher,
expected_output="Bullet-pointed research findings with URLs",
)
writing_task = Task(
description="Write a 500-word article based on the research findings",
agent=writer,
expected_output="Published-quality article with introduction, body, and conclusion",
context=[research_task],
)
crew = Crew(agents=[researcher, writer], tasks=[research_task, writing_task])
result = crew.kickoff(inputs={"topic": "quantum computing advances in 2026"})
Multi-agent systems shine when tasks naturally decompose into roles. They struggle when roles overlap or when agents need tight coordination.
4) Tool governance and safety
Production agents need guardrails on tool access:
from dataclasses import dataclass
from enum import Enum
class ToolPermission(Enum):
READ = "read"
WRITE = "write"
EXECUTE = "execute"
@dataclass
class ToolPolicy:
name: str
permissions: set[ToolPermission]
rate_limit: int # max calls per minute
requires_approval: bool = False
allowed_arguments: dict | None = None # allowlist
class GovernedToolRegistry:
def __init__(self):
self.tools: dict[str, dict] = {}
self.policies: dict[str, ToolPolicy] = {}
self.call_log: list[dict] = []
def register(self, name: str, fn, schema: dict, policy: ToolPolicy):
self.tools[name] = {"fn": fn, "schema": schema}
self.policies[name] = policy
def execute(self, name: str, arguments: dict, context: dict) -> str:
policy = self.policies.get(name)
if not policy:
return "Error: tool not registered"
# Rate limit check
recent_calls = sum(
1 for log in self.call_log
if log["tool"] == name and time.time() - log["time"] < 60
)
if recent_calls >= policy.rate_limit:
return f"Rate limit exceeded for {name}"
# Approval check
if policy.requires_approval:
return f"APPROVAL_REQUIRED: {name}({arguments})"
# Execute
result = self.tools[name]["fn"](**arguments)
self.call_log.append({"tool": name, "args": arguments, "time": time.time()})
return str(result)
Key governance rules:
- Read operations are generally safe. Write operations need higher scrutiny.
- Financial transactions and data deletions should require human approval.
- Rate-limit all external API calls to prevent runaway costs.
- Log every tool call for audit trails.
5) Memory architecture
Production agents need layered memory:
class AgentMemory:
def __init__(self, vector_store, kv_store):
self.conversation_history: list[dict] = [] # short-term
self.vector_store = vector_store # semantic long-term
self.kv_store = kv_store # exact long-term
def add_message(self, role: str, content: str):
self.conversation_history.append({"role": role, "content": content})
# Also store in long-term for future sessions
self.vector_store.add(content, metadata={"role": role, "time": time.time()})
def recall(self, query: str, k: int = 5) -> list[str]:
"""Semantic recall from long-term memory."""
return self.vector_store.search(query, top_k=k)
def get_fact(self, key: str) -> str | None:
"""Exact recall of stored facts."""
return self.kv_store.get(key)
def store_fact(self, key: str, value: str):
"""Store a specific fact for exact recall."""
self.kv_store.set(key, value)
def get_context_window(self, max_tokens: int = 4000) -> list[dict]:
"""Return recent history that fits in the context window."""
result = []
token_count = 0
for msg in reversed(self.conversation_history):
msg_tokens = len(msg["content"]) // 4 # rough estimate
if token_count + msg_tokens > max_tokens:
break
result.insert(0, msg)
token_count += msg_tokens
return result
6) Reliability patterns
Agents fail in production. Build resilience:
Timeout budgets — allocate a total time budget and per-step limits. If research takes too long, skip to drafting with available data.
Fallback chains — if the primary model fails, fall back to a simpler model or a pre-computed response.
Checkpointing — save agent state after each step. On failure, resume from the last checkpoint instead of starting over.
Dead letter queues — when an agent fails after all retries, save the task to a queue for human review rather than losing it.
7) Cost management
Agents are expensive. A complex task might make 10-20 LLM calls with tool results in context. Control costs by:
- Setting maximum step limits per task.
- Using cheaper models for planning and routing, expensive models only for final generation.
- Caching tool results within a session.
- Monitoring per-task cost and alerting on outliers.
- Implementing budget caps per user or per task type.
8) When not to use agent frameworks
Agent frameworks add complexity. Avoid them when:
- A single LLM call solves the problem reliably.
- The task has a fixed, known sequence of steps (use prompt chaining instead).
- Latency requirements are under 2 seconds (agents typically take 10-60 seconds).
- The cost of errors is very high and you need deterministic behavior.
Start with the simplest approach that works. Add agent capabilities incrementally as you identify specific tasks that benefit from dynamic tool selection and planning.
The one thing to remember: Agent frameworks turn LLMs into autonomous problem solvers with tools and memory — but production agents need governance, reliability patterns, cost controls, and the discipline to use simpler approaches when agents are overkill.
See Also
- Python Embedding Pipelines An embedding pipeline turns words into numbers that capture meaning — like translating every sentence into coordinates on a giant map of ideas.
- Python Guardrails Ai Guardrails are safety bumpers for AI — they check what the model says before it reaches users, like a spellchecker but for facts, tone, and dangerous content.
- Python Llm Evaluation Harness An LLM evaluation harness is like a report card for AI — it runs tests and grades how well the model answers questions so you know if it is actually improving.
- Python Llm Function Calling Function calling lets an AI ask your Python code for help — like a chef who can read a recipe but needs someone else to actually open the fridge.
- Python Prompt Chaining Think of prompt chaining as a relay race where each runner hands a baton to the next — except the runners are AI prompts building on each other's work.