Vector Store Patterns in Python — Deep Dive
Vector stores sit at the center of most retrieval-augmented generation (RAG) systems. Getting the architecture right affects latency, relevance, cost, and maintainability. This guide covers production patterns that go beyond basic insert-and-query.
1) Index algorithm selection
HNSW (Hierarchical Navigable Small World) is the default for most stores. It builds a multi-layer graph where each node connects to nearby vectors. Query time is O(log n) with high recall.
Key tuning parameters:
ef_construction— higher values build a better graph but slow indexing. Start at 200 for production.M— number of connections per node. Higher M improves recall but increases memory. 16-32 is typical.ef_search— higher values at query time improve recall at the cost of latency. Tune based on your latency budget.
import qdrant_client
from qdrant_client.models import VectorParams, HnswConfigDiff
client = qdrant_client.QdrantClient(url="http://localhost:6333")
client.create_collection(
collection_name="articles",
vectors_config=VectorParams(size=1536, distance="Cosine"),
hnsw_config=HnswConfigDiff(
m=16,
ef_construct=200,
full_scan_threshold=10000,
),
)
IVF (Inverted File) partitions vectors into clusters. Faster to build than HNSW but lower recall unless you search many clusters. Used by FAISS for billion-scale datasets.
Flat (brute force) scans every vector. Only viable under 50k vectors, but gives perfect recall. Good for testing.
2) Multi-tenancy patterns
Production systems serve multiple users or organizations. Three approaches:
Collection per tenant — each tenant gets a separate collection. Clean isolation but management overhead scales linearly.
Metadata filtering — single collection with a tenant_id field. Filter at query time. Simple to manage but requires index support for efficient filtering.
Namespace partitioning — some stores (Pinecone) support namespaces as a first-class concept. Vectors in different namespaces are fully isolated at the storage level.
# Metadata filtering approach with Chroma
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection("shared_docs")
# Ingest with tenant metadata
collection.add(
ids=["doc1"],
documents=["Revenue grew 15% in Q3"],
metadatas=[{"tenant_id": "acme_corp", "doc_type": "financial"}],
)
# Query with tenant filter
results = collection.query(
query_texts=["quarterly revenue growth"],
n_results=5,
where={"tenant_id": "acme_corp"},
)
3) Hybrid retrieval with reciprocal rank fusion
Pure vector search misses exact keyword matches. Pure keyword search misses semantic similarity. Combine both:
def reciprocal_rank_fusion(
vector_results: list[str],
keyword_results: list[str],
k: int = 60,
) -> list[str]:
scores: dict[str, float] = {}
for rank, doc_id in enumerate(vector_results):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
for rank, doc_id in enumerate(keyword_results):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
return sorted(scores, key=scores.get, reverse=True)
RRF is parameter-light and works well in practice. The constant k controls how much weight top positions get; 60 is a common default from the original paper.
4) Re-ranking pipeline
Initial retrieval returns candidates. A re-ranker scores them more carefully:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def retrieve_and_rerank(query: str, collection, top_k: int = 20, final_k: int = 5):
# Stage 1: fast vector retrieval
candidates = collection.query(query_texts=[query], n_results=top_k)
docs = candidates["documents"][0]
# Stage 2: cross-encoder re-ranking
pairs = [(query, doc) for doc in docs]
scores = reranker.predict(pairs)
ranked = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, _ in ranked[:final_k]]
Cross-encoders are 10-100x slower than bi-encoders but significantly more accurate. The two-stage pattern (fast retrieval → precise re-ranking) gives you both speed and relevance.
5) Embedding model selection and benchmarking
Do not assume one embedding model fits all use cases. Benchmark on your data:
import numpy as np
from sklearn.metrics import ndcg_score
def evaluate_retrieval(queries, relevant_docs, collection, model_name):
hits = 0
for query, expected in zip(queries, relevant_docs):
results = collection.query(query_texts=[query], n_results=10)
retrieved = results["ids"][0]
if any(doc_id in retrieved for doc_id in expected):
hits += 1
return hits / len(queries) # recall@10
Current strong options for English: OpenAI text-embedding-3-large, Cohere embed-v3, or open-source bge-large-en-v1.5. For multilingual workloads, multilingual-e5-large performs well across 100+ languages.
6) Cost and scaling considerations
Vector stores have three cost axes: storage (per vector), compute (per query), and ingestion (per upsert).
- FAISS on a single machine handles up to ~10M vectors at 1536 dimensions in ~25GB RAM. Free but requires ops.
- Pinecone charges per pod and per query. At scale (>1M vectors), costs can exceed $100/month but you get zero ops.
- pgvector is practically free if you already run PostgreSQL, but query performance degrades above ~500k vectors without careful indexing.
- Qdrant on disk supports large collections with limited RAM by memory-mapping the index. Good price/performance for 1M-100M vectors.
7) Data lifecycle management
Production stores need:
- TTL or expiration — remove stale vectors. Some stores support TTL natively; others require a background cleanup job.
- Version tracking — store the embedding model version as metadata. When you upgrade models, re-embed and replace.
- Backup and restore — FAISS indexes are files you can snapshot. Managed services provide their own backup mechanisms.
- Monitoring — track query latency p50/p99, recall metrics on a test set, and index size growth.
8) Common production pitfalls
- Mixing embedding models — vectors from different models are incompatible. Always re-embed when switching models.
- Ignoring chunk overlap — without overlap, relevant information at chunk boundaries gets lost.
- No metadata filtering — relying purely on vector similarity leads to false matches from wrong contexts.
- Stale indexes — HNSW indexes do not automatically rebalance after large deletes. Rebuild periodically.
The one thing to remember: Production vector store architecture is a multi-stage pipeline — embedding model choice, index tuning, hybrid retrieval, re-ranking, and data lifecycle management all contribute to retrieval quality and must be benchmarked on your actual data.
See Also
- Python Agent Frameworks An agent framework gives AI the ability to plan, use tools, and work through problems step by step — like upgrading a calculator into a research assistant.
- Python Embedding Pipelines An embedding pipeline turns words into numbers that capture meaning — like translating every sentence into coordinates on a giant map of ideas.
- Python Guardrails Ai Guardrails are safety bumpers for AI — they check what the model says before it reaches users, like a spellchecker but for facts, tone, and dangerous content.
- Python Llm Evaluation Harness An LLM evaluation harness is like a report card for AI — it runs tests and grades how well the model answers questions so you know if it is actually improving.
- Python Llm Function Calling Function calling lets an AI ask your Python code for help — like a chef who can read a recipe but needs someone else to actually open the fridge.