AI Agents Architecture — Deep Dive

MCTS for agent planning, context compression strategies, agent memory architectures, multi-agent protocols, tool reliability engineering, and failure mode taxonomy.

LATS: Language Agent Tree Search

Liu et al. (2023) “LATS: Language Agent Tree Search” combined Monte Carlo Tree Search (MCTS) with LLM-based agents for principled exploration of multi-step reasoning and action spaces.

MCTS adapted for agents:

Selection: Navigate the tree from root using UCT (Upper Confidence Bound applied to Trees): $$UCT(s, a) = Q(s, a) + c\sqrt{\frac{\ln N(s)}{N(s, a)}}$$

Where $Q(s, a)$ is the estimated value of taking action $a$ in state $s$, $N(s)$ is visit count for state $s$, and $c$ balances exploration vs. exploitation.

Expansion: At a leaf node, generate $k$ candidate actions using the LLM. Execute each to produce new states.

Evaluation: Use the LLM to score each reached state with a value function:

Prompt: "Given the goal and current state, rate progress 0-10. Current state: [state]"

Backpropagation: Update $Q$ values along the path from the evaluated node to root.

Key differences from standard MCTS: LLM generates actions (not a fixed action space), LLM provides value estimates (not a reward function), and the “simulation” phase involves executing real tool calls.

LATS with Claude-3-Opus showed significant improvements over single-path agents on SWEBench (real GitHub issues) and ALFWorld (household task planning): 40–60% relative improvement on complex multi-step tasks.

Context Compression for Long-Running Agents

Long agent runs accumulate context — observations, tool outputs, reasoning traces. With 100+ steps, this exceeds any context window.

Rolling summarization: Periodically summarize old context:

[Summary of steps 1-20]: Successfully gathered requirements, 
identified 3 API endpoints needed, implemented auth module.
[Current context, steps 21-25]: Implementing data fetching...

The summary loses some detail but preserves key facts. The agent writes the summary using a separate LLM call, ensuring critical information is retained.

Episodic memory with retrieval: Store all past observations in a vector database. At each step, retrieve the $k$ most relevant past observations via semantic search. This allows “remembering” any past information without it always being in context.

Critical path preservation: Track which observations are “on the critical path” — directly used in reasoning that led to the current state. Preserve these verbatim; compress or drop others.

Hierarchical context: Distinguish levels of abstraction:

Level 3 (retain): Current task state, last action result, key facts established
Level 2 (summarize): Recent reasoning traces
Level 1 (drop): Verbose intermediate tool outputs

Agent Memory Architecture: A Multi-Layer View

Drawing from Cognitive Science (Anderson’s ACT-R cognitive architecture), agent memory can be structured as:

Procedural memory (the agent’s “skills”):

Stored in the LLM’s weights (fine-tuned for agent behavior)
Tool usage patterns, code generation style, planning strategies
Persistent across sessions — doesn’t need external storage

Semantic memory (factual knowledge):

LLM’s parametric knowledge
Augmented with vector database retrieval (RAG)
Updated via: periodic fine-tuning, retrieval from curated knowledge bases

Episodic memory (specific past events):

Stored in structured format: {task, actions, outcomes, lessons_learned}
Retrieved by semantic similarity to current task
Enables “learning from experience” without fine-tuning

Working memory (current context):

The context window + current tool states
Cleared between sessions (unless persisted)
Bottleneck: limited size, high cost per token

Implementation pattern:

class AgentMemory:
    def __init__(self):
        self.episodic_db = VectorDB()     # Past episodes
        self.semantic_kb = VectorDB()      # Knowledge base
        self.current_context = []          # Working memory
    
    def add_observation(self, obs):
        self.current_context.append(obs)
        if self.context_size() > THRESHOLD:
            self.compress_context()
    
    def retrieve_relevant(self, query):
        episodes = self.episodic_db.search(query, k=3)
        knowledge = self.semantic_kb.search(query, k=5)
        return episodes + knowledge

Multi-Agent Communication Protocols

Multi-agent systems need structured communication to avoid message confusion and enable coordination.

OpenAI’s Assistants API threads: Each assistant has a persistent thread (conversation history). Agents can spawn subagents, which create their own threads and report results back. Thread isolation prevents context pollution between agents.

Directed messaging: Messages routed to specific agents by role:

{
  "from": "orchestrator",
  "to": "code_agent",
  "content": "Implement the following function: ...",
  "message_id": "msg_123",
  "parent_id": "task_456",
  "priority": "high"
}

Shared workspace model (CrewAI, AutoGen): Agents share a common state object (e.g., a file system, a shared memory store). Each agent reads/writes the shared state. Requires conflict resolution when multiple agents modify the same state.

Agent protocols: A2A (Agent-to-Agent) protocol (Google, 2025) and MCP (Model Context Protocol, Anthropic/Anthropic, 2024) define standard interfaces for agents to communicate with tools and with each other, enabling interoperability across frameworks.

Tool Reliability Engineering

Production agent systems fail primarily because tools fail. Handling this robustly requires:

Retry with exponential backoff: Most transient tool failures (network, rate limits) resolve with retry:

@retry(max_attempts=3, backoff_factor=2)
def call_tool(tool_name, params):
    return tools[tool_name].execute(params)

Tool result validation: Validate tool outputs match expected schema before passing to the LLM. Malformed outputs can cause the LLM to hallucinate or misinterpret results.

Graceful degradation: “I tried to search the web but the search API is unavailable. I’ll answer from my training knowledge, but note this information may be outdated.”

Idempotency: For tools that have real-world effects (sending email, writing files), ensure calls are idempotent or explicitly confirm with users before repeated attempts.

Tool timeouts: Set explicit timeouts per tool category. Code execution: 30s. Web search: 5s. Database query: 10s. Never allow a single tool call to block indefinitely.

Benchmarks and Evaluation

SWE-bench (2023): 2,294 real GitHub issues from popular Python repos. Agent must: understand the issue, navigate the codebase, write a fix, verify it passes tests. As of early 2024, best agents (Claude + extensive scaffolding) solved ~17% of issues.

ALFWorld: Text-based household task planning (pick up knife, put it in drawer). Tests multi-step planning in a simulated environment.

WebArena (2023): Web navigation benchmark — complete real tasks on fully functional websites (shopping, forum, email). Tests multi-step web interaction.

AgentBench: Diverse tasks across code generation, database querying, web browsing, game playing. Measures average normalized performance.

GAIA (2023): “General AI Assistants” benchmark — tasks requiring careful reasoning, multi-step actions, and factual accuracy. Human performance ~92%; best agents ~60–75% as of 2024.

One thing to remember: Building reliable agents is a systems engineering problem as much as an ML problem — the agent’s intelligence is bounded by the reliability and expressiveness of its tools, the richness of its memory architecture, and the robustness of its failure handling, not just the capability of the underlying LLM.

ai-agentslatscontext-compressionagent-memorytool-reliabilityagent-benchmarks