Python Conversation Memory — Deep Dive

Implement conversation memory systems in Python — from sliding window buffers to summary compression and vector-retrieval augmented memory.

Why Memory Architecture Matters

A chatbot without memory is a stateless function: text in, text out. Every production chatbot needs memory, but the right memory strategy depends on conversation length, latency budget, cost constraints, and whether you are building a task-oriented bot or an open-ended conversational agent.

Memory Types in Practice

Buffer Memory

The simplest approach: store every turn and pass the full history to the model or dialog manager.

from dataclasses import dataclass, field

@dataclass
class BufferMemory:
    turns: list[dict] = field(default_factory=list)

    def add(self, role: str, content: str):
        self.turns.append({"role": role, "content": content})

    def get_history(self) -> list[dict]:
        return self.turns.copy()

    def to_prompt(self) -> str:
        return "\n".join(f"{t['role']}: {t['content']}" for t in self.turns)

When to use: Short conversations (under 20 turns) where token cost is not a concern. This is the default for most chatbot prototypes.

Limitation: Token count grows linearly. A 100-turn conversation with an average of 50 tokens per turn consumes 5,000 tokens just for history — leaving less room for the model’s response.

Sliding Window Memory

Keep only the most recent N turns:

@dataclass
class WindowMemory:
    max_turns: int = 20
    turns: list[dict] = field(default_factory=list)

    def add(self, role: str, content: str):
        self.turns.append({"role": role, "content": content})
        if len(self.turns) > self.max_turns:
            self.turns = self.turns[-self.max_turns:]

When to use: Medium-length conversations where you need bounded memory. Customer service bots typically use windows of 10-20 turns.

Limitation: Information from early turns is permanently lost. If the user mentioned their name in turn 1, it disappears after turn 21.

Summary Memory

Periodically compress older turns into a summary, keeping recent turns intact:

import openai

class SummaryMemory:
    def __init__(self, recent_window: int = 10, model: str = "gpt-4o-mini"):
        self.summary: str = ""
        self.recent: list[dict] = []
        self.recent_window = recent_window
        self.model = model

    def add(self, role: str, content: str):
        self.recent.append({"role": role, "content": content})
        if len(self.recent) > self.recent_window * 2:
            self._compress()

    def _compress(self):
        old_turns = self.recent[:-self.recent_window]
        old_text = "\n".join(f"{t['role']}: {t['content']}" for t in old_turns)

        prompt = (
            f"Current summary:\n{self.summary}\n\n"
            f"New conversation turns:\n{old_text}\n\n"
            "Write an updated summary capturing all important information, "
            "decisions, and context. Be concise."
        )

        response = openai.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=300,
        )
        self.summary = response.choices[0].message.content
        self.recent = self.recent[-self.recent_window:]

    def to_prompt(self) -> str:
        parts = []
        if self.summary:
            parts.append(f"Conversation summary: {self.summary}")
        parts.extend(f"{t['role']}: {t['content']}" for t in self.recent)
        return "\n".join(parts)

When to use: Long conversations (30+ turns) where you need to preserve key information without unbounded growth. Common in coaching bots, therapy bots, and complex customer journeys.

Tradeoff: Summarization adds latency and cost (an extra LLM call), and information is lossy — the summary may drop details that later become important.

Entity Memory

Track specific entities across the conversation, independent of turn history:

class EntityMemory:
    def __init__(self):
        self.entities: dict[str, dict] = {}

    def update(self, entity_name: str, attributes: dict):
        if entity_name not in self.entities:
            self.entities[entity_name] = {}
        self.entities[entity_name].update(attributes)

    def get(self, entity_name: str) -> dict:
        return self.entities.get(entity_name, {})

    def to_context(self) -> str:
        lines = []
        for name, attrs in self.entities.items():
            details = ", ".join(f"{k}: {v}" for k, v in attrs.items())
            lines.append(f"{name}: {details}")
        return "Known entities:\n" + "\n".join(lines)

Usage: After NLU extracts entities from each turn, update the entity memory. Before generating a response, inject the entity context into the prompt.

Vector-Retrieval Memory

Store all turns as embeddings in a vector database. Before each response, retrieve the most relevant past turns:

from sentence_transformers import SentenceTransformer
import numpy as np

class VectorMemory:
    def __init__(self, model_name: str = "all-MiniLM-L6-v2", top_k: int = 5):
        self.encoder = SentenceTransformer(model_name)
        self.turns: list[dict] = []
        self.embeddings: list[np.ndarray] = []
        self.top_k = top_k

    def add(self, role: str, content: str):
        self.turns.append({"role": role, "content": content})
        embedding = self.encoder.encode(content)
        self.embeddings.append(embedding)

    def retrieve(self, query: str) -> list[dict]:
        if not self.embeddings:
            return []
        query_emb = self.encoder.encode(query)
        similarities = np.dot(self.embeddings, query_emb) / (
            np.linalg.norm(self.embeddings, axis=1) * np.linalg.norm(query_emb)
        )
        top_indices = np.argsort(similarities)[-self.top_k:][::-1]
        return [self.turns[i] for i in top_indices]

    def to_prompt(self, current_message: str) -> str:
        relevant = self.retrieve(current_message)
        context = "\n".join(f"{t['role']}: {t['content']}" for t in relevant)
        return f"Relevant past context:\n{context}"

When to use: Very long conversations or multi-session agents where the bot needs to recall specific past interactions. Useful for personal assistants and support bots with returning users.

Tradeoff: Retrieval may miss important context that is not semantically similar to the current query. Combining vector retrieval with a recent-turn window mitigates this.

Hybrid Memory Architecture

Production systems combine multiple memory types:

class HybridMemory:
    def __init__(self):
        self.window = WindowMemory(max_turns=10)
        self.entity = EntityMemory()
        self.vector = VectorMemory(top_k=3)
        self.summary = ""

    def add_turn(self, role: str, content: str, entities: dict | None = None):
        self.window.add(role, content)
        self.vector.add(role, content)
        if entities:
            for name, attrs in entities.items():
                self.entity.update(name, attrs)

    def build_context(self, current_message: str) -> str:
        parts = []
        if self.summary:
            parts.append(f"Summary: {self.summary}")
        entity_ctx = self.entity.to_context()
        if entity_ctx:
            parts.append(entity_ctx)
        relevant = self.vector.to_prompt(current_message)
        if relevant:
            parts.append(relevant)
        recent = "\n".join(
            f"{t['role']}: {t['content']}" for t in self.window.get_history()
        )
        parts.append(f"Recent conversation:\n{recent}")
        return "\n\n".join(parts)

Persistence Layer

Redis for Session Memory

import redis
import json

class RedisSessionStore:
    def __init__(self, host: str = "localhost", ttl: int = 3600):
        self.client = redis.Redis(host=host, decode_responses=True)
        self.ttl = ttl

    def save(self, session_id: str, memory: dict):
        self.client.setex(
            f"session:{session_id}",
            self.ttl,
            json.dumps(memory, default=str),
        )

    def load(self, session_id: str) -> dict | None:
        data = self.client.get(f"session:{session_id}")
        return json.loads(data) if data else None

    def delete(self, session_id: str):
        self.client.delete(f"session:{session_id}")

PostgreSQL for Long-Term Memory

import asyncpg

class PGMemoryStore:
    def __init__(self, pool: asyncpg.Pool):
        self.pool = pool

    async def save_turn(self, user_id: str, session_id: str, role: str,
                        content: str, entities: dict):
        await self.pool.execute("""
            INSERT INTO conversation_turns
                (user_id, session_id, role, content, entities, created_at)
            VALUES ($1, $2, $3, $4, $5::jsonb, NOW())
        """, user_id, session_id, role, content, json.dumps(entities))

    async def get_user_history(self, user_id: str, limit: int = 50) -> list[dict]:
        rows = await self.pool.fetch("""
            SELECT role, content, entities, created_at
            FROM conversation_turns
            WHERE user_id = $1
            ORDER BY created_at DESC LIMIT $2
        """, user_id, limit)
        return [dict(r) for r in reversed(rows)]

Memory in LangChain

LangChain provides ready-made memory classes that plug into chains:

from langchain.memory import ConversationSummaryBufferMemory
from langchain_openai import ChatOpenAI

memory = ConversationSummaryBufferMemory(
    llm=ChatOpenAI(model="gpt-4o-mini"),
    max_token_limit=1000,
    return_messages=True,
)

This automatically summarizes old messages when the token count exceeds the limit, keeping recent messages intact. It is the fastest way to add memory to an LLM chatbot, though production systems often outgrow its defaults.

Privacy and Data Retention

Conversation memory stores personal data. Production systems must:

Encrypt at rest: Use Redis encryption or database-level encryption for conversation data.
Implement TTLs: Automatically expire session data after a defined period.
Honor deletion requests: Provide a mechanism to delete all conversation data for a user (GDPR right to erasure).
Anonymize logs: Strip PII before writing conversation data to analytics pipelines.

Performance Considerations

Memory Type	Read Latency	Write Latency	Memory Cost	Token Cost
Buffer	O(1)	O(1)	Linear	Linear
Sliding Window	O(1)	O(1)	Bounded	Bounded
Summary	O(1)	O(n) + LLM	Bounded	Bounded + summarization
Entity	O(1)	O(1)	Proportional to entities	Low
Vector Retrieval	O(log n)	O(1)	Linear	Fixed (top-k)

For high-throughput bots (1000+ concurrent conversations), pre-compute embeddings asynchronously and cache entity memory in Redis to avoid database round-trips per turn.

The one thing to remember: Production conversation memory is a hybrid system — combining a recent-turn window for immediate context, entity tracking for structured data, and either summarization or vector retrieval for long-term recall — all backed by persistent storage with proper TTLs and privacy controls.

pythonconversation-memorychatbotsnlplangchainvector-databases