Vector Store Patterns in Python — Deep Dive

Vector stores sit at the center of most retrieval-augmented generation (RAG) systems. Getting the architecture right affects latency, relevance, cost, and maintainability. This guide covers production patterns that go beyond basic insert-and-query.

1) Index algorithm selection

HNSW (Hierarchical Navigable Small World) is the default for most stores. It builds a multi-layer graph where each node connects to nearby vectors. Query time is O(log n) with high recall.

Key tuning parameters:

  • ef_construction — higher values build a better graph but slow indexing. Start at 200 for production.
  • M — number of connections per node. Higher M improves recall but increases memory. 16-32 is typical.
  • ef_search — higher values at query time improve recall at the cost of latency. Tune based on your latency budget.
import qdrant_client
from qdrant_client.models import VectorParams, HnswConfigDiff

client = qdrant_client.QdrantClient(url="http://localhost:6333")
client.create_collection(
    collection_name="articles",
    vectors_config=VectorParams(size=1536, distance="Cosine"),
    hnsw_config=HnswConfigDiff(
        m=16,
        ef_construct=200,
        full_scan_threshold=10000,
    ),
)

IVF (Inverted File) partitions vectors into clusters. Faster to build than HNSW but lower recall unless you search many clusters. Used by FAISS for billion-scale datasets.

Flat (brute force) scans every vector. Only viable under 50k vectors, but gives perfect recall. Good for testing.

2) Multi-tenancy patterns

Production systems serve multiple users or organizations. Three approaches:

Collection per tenant — each tenant gets a separate collection. Clean isolation but management overhead scales linearly.

Metadata filtering — single collection with a tenant_id field. Filter at query time. Simple to manage but requires index support for efficient filtering.

Namespace partitioning — some stores (Pinecone) support namespaces as a first-class concept. Vectors in different namespaces are fully isolated at the storage level.

# Metadata filtering approach with Chroma
import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection("shared_docs")

# Ingest with tenant metadata
collection.add(
    ids=["doc1"],
    documents=["Revenue grew 15% in Q3"],
    metadatas=[{"tenant_id": "acme_corp", "doc_type": "financial"}],
)

# Query with tenant filter
results = collection.query(
    query_texts=["quarterly revenue growth"],
    n_results=5,
    where={"tenant_id": "acme_corp"},
)

3) Hybrid retrieval with reciprocal rank fusion

Pure vector search misses exact keyword matches. Pure keyword search misses semantic similarity. Combine both:

def reciprocal_rank_fusion(
    vector_results: list[str],
    keyword_results: list[str],
    k: int = 60,
) -> list[str]:
    scores: dict[str, float] = {}
    for rank, doc_id in enumerate(vector_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    for rank, doc_id in enumerate(keyword_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    return sorted(scores, key=scores.get, reverse=True)

RRF is parameter-light and works well in practice. The constant k controls how much weight top positions get; 60 is a common default from the original paper.

4) Re-ranking pipeline

Initial retrieval returns candidates. A re-ranker scores them more carefully:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def retrieve_and_rerank(query: str, collection, top_k: int = 20, final_k: int = 5):
    # Stage 1: fast vector retrieval
    candidates = collection.query(query_texts=[query], n_results=top_k)
    docs = candidates["documents"][0]

    # Stage 2: cross-encoder re-ranking
    pairs = [(query, doc) for doc in docs]
    scores = reranker.predict(pairs)

    ranked = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, _ in ranked[:final_k]]

Cross-encoders are 10-100x slower than bi-encoders but significantly more accurate. The two-stage pattern (fast retrieval → precise re-ranking) gives you both speed and relevance.

5) Embedding model selection and benchmarking

Do not assume one embedding model fits all use cases. Benchmark on your data:

import numpy as np
from sklearn.metrics import ndcg_score

def evaluate_retrieval(queries, relevant_docs, collection, model_name):
    hits = 0
    for query, expected in zip(queries, relevant_docs):
        results = collection.query(query_texts=[query], n_results=10)
        retrieved = results["ids"][0]
        if any(doc_id in retrieved for doc_id in expected):
            hits += 1
    return hits / len(queries)  # recall@10

Current strong options for English: OpenAI text-embedding-3-large, Cohere embed-v3, or open-source bge-large-en-v1.5. For multilingual workloads, multilingual-e5-large performs well across 100+ languages.

6) Cost and scaling considerations

Vector stores have three cost axes: storage (per vector), compute (per query), and ingestion (per upsert).

  • FAISS on a single machine handles up to ~10M vectors at 1536 dimensions in ~25GB RAM. Free but requires ops.
  • Pinecone charges per pod and per query. At scale (>1M vectors), costs can exceed $100/month but you get zero ops.
  • pgvector is practically free if you already run PostgreSQL, but query performance degrades above ~500k vectors without careful indexing.
  • Qdrant on disk supports large collections with limited RAM by memory-mapping the index. Good price/performance for 1M-100M vectors.

7) Data lifecycle management

Production stores need:

  • TTL or expiration — remove stale vectors. Some stores support TTL natively; others require a background cleanup job.
  • Version tracking — store the embedding model version as metadata. When you upgrade models, re-embed and replace.
  • Backup and restore — FAISS indexes are files you can snapshot. Managed services provide their own backup mechanisms.
  • Monitoring — track query latency p50/p99, recall metrics on a test set, and index size growth.

8) Common production pitfalls

  • Mixing embedding models — vectors from different models are incompatible. Always re-embed when switching models.
  • Ignoring chunk overlap — without overlap, relevant information at chunk boundaries gets lost.
  • No metadata filtering — relying purely on vector similarity leads to false matches from wrong contexts.
  • Stale indexes — HNSW indexes do not automatically rebalance after large deletes. Rebuild periodically.

The one thing to remember: Production vector store architecture is a multi-stage pipeline — embedding model choice, index tuning, hybrid retrieval, re-ranking, and data lifecycle management all contribute to retrieval quality and must be benchmarked on your actual data.

pythonvector-databasesembeddingsragproduction

See Also

  • Python Agent Frameworks An agent framework gives AI the ability to plan, use tools, and work through problems step by step — like upgrading a calculator into a research assistant.
  • Python Embedding Pipelines An embedding pipeline turns words into numbers that capture meaning — like translating every sentence into coordinates on a giant map of ideas.
  • Python Guardrails Ai Guardrails are safety bumpers for AI — they check what the model says before it reaches users, like a spellchecker but for facts, tone, and dangerous content.
  • Python Llm Evaluation Harness An LLM evaluation harness is like a report card for AI — it runs tests and grades how well the model answers questions so you know if it is actually improving.
  • Python Llm Function Calling Function calling lets an AI ask your Python code for help — like a chef who can read a recipe but needs someone else to actually open the fridge.