Python Conversation Memory — Deep Dive
Why Memory Architecture Matters
A chatbot without memory is a stateless function: text in, text out. Every production chatbot needs memory, but the right memory strategy depends on conversation length, latency budget, cost constraints, and whether you are building a task-oriented bot or an open-ended conversational agent.
Memory Types in Practice
Buffer Memory
The simplest approach: store every turn and pass the full history to the model or dialog manager.
from dataclasses import dataclass, field
@dataclass
class BufferMemory:
turns: list[dict] = field(default_factory=list)
def add(self, role: str, content: str):
self.turns.append({"role": role, "content": content})
def get_history(self) -> list[dict]:
return self.turns.copy()
def to_prompt(self) -> str:
return "\n".join(f"{t['role']}: {t['content']}" for t in self.turns)
When to use: Short conversations (under 20 turns) where token cost is not a concern. This is the default for most chatbot prototypes.
Limitation: Token count grows linearly. A 100-turn conversation with an average of 50 tokens per turn consumes 5,000 tokens just for history — leaving less room for the model’s response.
Sliding Window Memory
Keep only the most recent N turns:
@dataclass
class WindowMemory:
max_turns: int = 20
turns: list[dict] = field(default_factory=list)
def add(self, role: str, content: str):
self.turns.append({"role": role, "content": content})
if len(self.turns) > self.max_turns:
self.turns = self.turns[-self.max_turns:]
When to use: Medium-length conversations where you need bounded memory. Customer service bots typically use windows of 10-20 turns.
Limitation: Information from early turns is permanently lost. If the user mentioned their name in turn 1, it disappears after turn 21.
Summary Memory
Periodically compress older turns into a summary, keeping recent turns intact:
import openai
class SummaryMemory:
def __init__(self, recent_window: int = 10, model: str = "gpt-4o-mini"):
self.summary: str = ""
self.recent: list[dict] = []
self.recent_window = recent_window
self.model = model
def add(self, role: str, content: str):
self.recent.append({"role": role, "content": content})
if len(self.recent) > self.recent_window * 2:
self._compress()
def _compress(self):
old_turns = self.recent[:-self.recent_window]
old_text = "\n".join(f"{t['role']}: {t['content']}" for t in old_turns)
prompt = (
f"Current summary:\n{self.summary}\n\n"
f"New conversation turns:\n{old_text}\n\n"
"Write an updated summary capturing all important information, "
"decisions, and context. Be concise."
)
response = openai.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
max_tokens=300,
)
self.summary = response.choices[0].message.content
self.recent = self.recent[-self.recent_window:]
def to_prompt(self) -> str:
parts = []
if self.summary:
parts.append(f"Conversation summary: {self.summary}")
parts.extend(f"{t['role']}: {t['content']}" for t in self.recent)
return "\n".join(parts)
When to use: Long conversations (30+ turns) where you need to preserve key information without unbounded growth. Common in coaching bots, therapy bots, and complex customer journeys.
Tradeoff: Summarization adds latency and cost (an extra LLM call), and information is lossy — the summary may drop details that later become important.
Entity Memory
Track specific entities across the conversation, independent of turn history:
class EntityMemory:
def __init__(self):
self.entities: dict[str, dict] = {}
def update(self, entity_name: str, attributes: dict):
if entity_name not in self.entities:
self.entities[entity_name] = {}
self.entities[entity_name].update(attributes)
def get(self, entity_name: str) -> dict:
return self.entities.get(entity_name, {})
def to_context(self) -> str:
lines = []
for name, attrs in self.entities.items():
details = ", ".join(f"{k}: {v}" for k, v in attrs.items())
lines.append(f"{name}: {details}")
return "Known entities:\n" + "\n".join(lines)
Usage: After NLU extracts entities from each turn, update the entity memory. Before generating a response, inject the entity context into the prompt.
Vector-Retrieval Memory
Store all turns as embeddings in a vector database. Before each response, retrieve the most relevant past turns:
from sentence_transformers import SentenceTransformer
import numpy as np
class VectorMemory:
def __init__(self, model_name: str = "all-MiniLM-L6-v2", top_k: int = 5):
self.encoder = SentenceTransformer(model_name)
self.turns: list[dict] = []
self.embeddings: list[np.ndarray] = []
self.top_k = top_k
def add(self, role: str, content: str):
self.turns.append({"role": role, "content": content})
embedding = self.encoder.encode(content)
self.embeddings.append(embedding)
def retrieve(self, query: str) -> list[dict]:
if not self.embeddings:
return []
query_emb = self.encoder.encode(query)
similarities = np.dot(self.embeddings, query_emb) / (
np.linalg.norm(self.embeddings, axis=1) * np.linalg.norm(query_emb)
)
top_indices = np.argsort(similarities)[-self.top_k:][::-1]
return [self.turns[i] for i in top_indices]
def to_prompt(self, current_message: str) -> str:
relevant = self.retrieve(current_message)
context = "\n".join(f"{t['role']}: {t['content']}" for t in relevant)
return f"Relevant past context:\n{context}"
When to use: Very long conversations or multi-session agents where the bot needs to recall specific past interactions. Useful for personal assistants and support bots with returning users.
Tradeoff: Retrieval may miss important context that is not semantically similar to the current query. Combining vector retrieval with a recent-turn window mitigates this.
Hybrid Memory Architecture
Production systems combine multiple memory types:
class HybridMemory:
def __init__(self):
self.window = WindowMemory(max_turns=10)
self.entity = EntityMemory()
self.vector = VectorMemory(top_k=3)
self.summary = ""
def add_turn(self, role: str, content: str, entities: dict | None = None):
self.window.add(role, content)
self.vector.add(role, content)
if entities:
for name, attrs in entities.items():
self.entity.update(name, attrs)
def build_context(self, current_message: str) -> str:
parts = []
if self.summary:
parts.append(f"Summary: {self.summary}")
entity_ctx = self.entity.to_context()
if entity_ctx:
parts.append(entity_ctx)
relevant = self.vector.to_prompt(current_message)
if relevant:
parts.append(relevant)
recent = "\n".join(
f"{t['role']}: {t['content']}" for t in self.window.get_history()
)
parts.append(f"Recent conversation:\n{recent}")
return "\n\n".join(parts)
Persistence Layer
Redis for Session Memory
import redis
import json
class RedisSessionStore:
def __init__(self, host: str = "localhost", ttl: int = 3600):
self.client = redis.Redis(host=host, decode_responses=True)
self.ttl = ttl
def save(self, session_id: str, memory: dict):
self.client.setex(
f"session:{session_id}",
self.ttl,
json.dumps(memory, default=str),
)
def load(self, session_id: str) -> dict | None:
data = self.client.get(f"session:{session_id}")
return json.loads(data) if data else None
def delete(self, session_id: str):
self.client.delete(f"session:{session_id}")
PostgreSQL for Long-Term Memory
import asyncpg
class PGMemoryStore:
def __init__(self, pool: asyncpg.Pool):
self.pool = pool
async def save_turn(self, user_id: str, session_id: str, role: str,
content: str, entities: dict):
await self.pool.execute("""
INSERT INTO conversation_turns
(user_id, session_id, role, content, entities, created_at)
VALUES ($1, $2, $3, $4, $5::jsonb, NOW())
""", user_id, session_id, role, content, json.dumps(entities))
async def get_user_history(self, user_id: str, limit: int = 50) -> list[dict]:
rows = await self.pool.fetch("""
SELECT role, content, entities, created_at
FROM conversation_turns
WHERE user_id = $1
ORDER BY created_at DESC LIMIT $2
""", user_id, limit)
return [dict(r) for r in reversed(rows)]
Memory in LangChain
LangChain provides ready-made memory classes that plug into chains:
from langchain.memory import ConversationSummaryBufferMemory
from langchain_openai import ChatOpenAI
memory = ConversationSummaryBufferMemory(
llm=ChatOpenAI(model="gpt-4o-mini"),
max_token_limit=1000,
return_messages=True,
)
This automatically summarizes old messages when the token count exceeds the limit, keeping recent messages intact. It is the fastest way to add memory to an LLM chatbot, though production systems often outgrow its defaults.
Privacy and Data Retention
Conversation memory stores personal data. Production systems must:
- Encrypt at rest: Use Redis encryption or database-level encryption for conversation data.
- Implement TTLs: Automatically expire session data after a defined period.
- Honor deletion requests: Provide a mechanism to delete all conversation data for a user (GDPR right to erasure).
- Anonymize logs: Strip PII before writing conversation data to analytics pipelines.
Performance Considerations
| Memory Type | Read Latency | Write Latency | Memory Cost | Token Cost |
|---|---|---|---|---|
| Buffer | O(1) | O(1) | Linear | Linear |
| Sliding Window | O(1) | O(1) | Bounded | Bounded |
| Summary | O(1) | O(n) + LLM | Bounded | Bounded + summarization |
| Entity | O(1) | O(1) | Proportional to entities | Low |
| Vector Retrieval | O(log n) | O(1) | Linear | Fixed (top-k) |
For high-throughput bots (1000+ concurrent conversations), pre-compute embeddings asynchronously and cache entity memory in Redis to avoid database round-trips per turn.
The one thing to remember: Production conversation memory is a hybrid system — combining a recent-turn window for immediate context, entity tracking for structured data, and either summarization or vector retrieval for long-term recall — all backed by persistent storage with proper TTLs and privacy controls.
See Also
- Python Chatbot Architecture Discover how Python chatbots are built from simple building blocks that listen, think, and reply — like a friendly robot pen-pal.
- Python Dialog Management See how chatbots remember where they are in a conversation — like a waiter who never forgets your order.
- Python Intent Classification Find out how chatbots figure out what you actually want when you type a message — even if you say it in a weird way.
- Python Rasa Framework Meet Rasa — the free toolkit that lets anyone build a chatbot that actually understands conversations, not just keywords.
- Python Response Generation Learn how chatbots craft their replies — from filling in the blanks to writing sentences from scratch like a tiny author.