Vector Databases — Deep Dive
Why This Problem Is Hard
Given N vectors in d-dimensional space, find the k nearest vectors to a query. Sounds like a solved problem. It’s not — and understanding why it’s hard shapes every architectural decision in the space.
The curse of dimensionality: in high-dimensional space (512, 768, 1536 dimensions), the concept of “close” breaks down. The ratio of the nearest neighbor distance to the farthest neighbor distance approaches 1 as dimensionality increases. Everything becomes equally distant. Spatial indexing structures like KD-trees, which work well in 2D-10D, degrade to linear scan in >20D.
Exact k-NN in 1536 dimensions on 100 million vectors isn’t viable at query time — not because computers are slow, but because there’s no mathematical shortcut that gets you the provably correct answer faster than checking every vector.
The entire field of ANN (approximate nearest neighbor) is about intelligently giving up exactness to gain speed.
HNSW: The Algorithm Running Most Production Search
Hierarchical Navigable Small World graphs, published by Yury Malkov and Dmitry Yashunin in 2016, are the reason modern vector search is fast. Understanding HNSW means understanding your database’s performance characteristics.
The Core Idea
HNSW is based on skip lists layered over proximity graphs.
A proximity graph connects each node to its nearest neighbors. Searching it means: start somewhere, greedily hop to whichever neighbor is closest to the query, repeat until you can’t improve. This is NSW (Navigable Small World). Problem: long-distance traversals are slow because every hop covers small distance.
HNSW’s fix: build multiple layers. The top layer has few nodes and only long-range connections (think: express highway). Lower layers are denser, with short-range connections. Search starts at the top layer, descends when the search stops improving at that level.
Layer 2 (sparse): 5 nodes, connects distant clusters
Layer 1 (medium): 50 nodes, regional connections
Layer 0 (dense): 500 nodes, local neighborhood
Insertion is probabilistic — each new vector is assigned a maximum layer drawn from an exponentially decaying distribution. Most vectors only live in layer 0. ~1/e end up in layer 1. ~1/e² in layer 2.
The Parameters That Actually Matter
- M (default 16): max connections per node. Higher M = better recall, more memory, slower build. For most embedding models, M=16 to 32 is the sweet spot.
- ef_construction (default 200): size of the candidate list during build. Higher = better index quality, slower inserts. Set once at index build time.
- ef (search-time): candidate list size during query. Unlike ef_construction, you can change this at query time. Larger ef = better recall, slower search. This is your recall/latency knob.
In Qdrant’s benchmarks, M=16 ef=128 gives ~99% recall at ~3ms latency on 1M vectors. Halving ef drops latency to ~1.5ms at ~96% recall. Those tradeoffs are real and you pick them explicitly.
HNSW’s Weakness
HNSW builds everything in memory. An index for 100M vectors at 1536 dimensions needs roughly 400GB RAM just for the vectors (4 bytes × 1536 × 100M), plus overhead for graph edges. This is why Pinecone started their serverless product — they had to split hot/cold storage to make large-scale HNSW economical.
Quantization: Making Vectors Small
Quantization reduces memory by representing each dimension with fewer bits.
Scalar quantization (SQ8): Instead of float32 (4 bytes/dim), use int8 (1 byte/dim). You map the floating point range to [−128, 127]. 4x memory reduction with <1% recall loss in practice. Almost everyone should enable this.
Product quantization (PQ): Split the vector into M sub-vectors, quantize each sub-vector independently using a learned codebook. A 1536-dim float32 vector (6144 bytes) becomes M × log2(k) bits — often 96-192 bytes. 30-60x compression. Recall loss is significant (~5-10%) but acceptable for first-pass retrieval in a re-ranking pipeline.
Binary quantization: Each dimension becomes a single bit. Used by Weaviate for HNSW with binary index. Only works well for embeddings trained to be binary-quantizable (Cohere’s embed-v3, Matryoshka embeddings). 32x compression, ~95% recall with oversampling.
The real-world pattern: use binary quantization or PQ for the initial ANN search (fast, cheap), then re-score the top-K candidates using the original float32 vectors (accurate). Qdrant calls this “quantization with rescoring” and it’s how you serve billion-scale search on commodity hardware.
Filtered ANN: Still Unsolved
This is the unglamorous hard problem nobody talks about enough.
You want: “Find the 10 documents most similar to this query, where author = 'alice' and date > 2024-01-01.”
Pre-filtering approach: apply the metadata filter first, then search the ANN index over the filtered subset. Problem: your HNSW index was built over all vectors. Searching a sparse subset breaks graph connectivity — you fall into disconnected regions and miss good results.
Post-filtering approach: run ANN search for top-K, then apply filters. Problem: if only 0.1% of your corpus matches the filter, you need to retrieve top-10000 to get 10 filtered results.
Solutions being used in production:
Qdrant’s HNSW with payload index: Maintains a separate inverted index on metadata fields. During graph traversal, nodes that don’t match the filter are skipped while still using them for navigation. Best of both worlds but complex to implement.
Weaviate’s hybrid storage: Uses both inverted index (for exact match + BM25) and HNSW. Filter evaluation happens at the inverted index level; vector search happens in parallel; results are merged.
Pinecone namespaces: Shard your data by the most common filter (e.g., one namespace per customer). Crude but effective — search within a shard is effectively pre-filtered without graph fragmentation.
Milvus partition pruning: Partition the collection on high-cardinality filter fields, search only relevant partitions.
None of these is perfect. Filtered ANN remains a research problem for the general case.
Embedding Model Choice Matters More Than Database Choice
Most engineers obsess over which vector database to use. The dirtiest secret in the space: your embedding model choice has a bigger impact on search quality than your database choice.
MTEB (Massive Text Embedding Benchmark, maintained by Hugging Face) is the standard benchmark. As of early 2026:
text-embedding-3-large(OpenAI, 3072 dims): strong general purpose, expensive at scaleembed-v3(Cohere): best-in-class for retrieval, supports binary quantization nativelygte-large-en-v1.5(Alibaba, open-source): 435M params, competitive with commercial modelsnomic-embed-text-v1.5(open-source): Matryoshka training — you can slice dimensions. Use 256 dims for 6x speedup with ~3% recall loss.
Matryoshka Representation Learning (MRL) is worth knowing: models trained with MRL embed information in a “nested” structure — the first 64 dimensions capture the most important semantics, the first 128 capture more, etc. You can truncate the embedding at query time. This is how OpenAI’s text-embedding-3 series works and why you can specify dimensions=256 in the API call.
Chunking Strategy Impacts Everything
People get the database right and blow up their search quality with bad chunking.
Fixed-size chunking (e.g., 512 tokens with 50-token overlap): simple, works okay. The overlap prevents cutting off semantically connected sentences.
Semantic chunking: split at sentence boundaries, then merge sentences into chunks until a similarity threshold drops. Langchain and LlamaIndex both implement this. Produces variable-size chunks that respect natural boundaries. 10-15% recall improvement on typical RAG benchmarks vs. fixed-size.
Document hierarchy (parent-child): embed small chunks (128 tokens) for retrieval precision, but retrieve the parent chunk (512 tokens) for the LLM context. Reduces noise in the retrieved context. Implemented natively in LlamaIndex as ParentDocumentRetriever.
The real failure mode: embedding entire long documents as a single vector. A 10,000-word document’s embedding is an average of everything in it — the specific relevant paragraph gets diluted to nothing. Always chunk.
Production Patterns
Multi-vector search
Instead of one vector per document, store multiple: one for the title, one per paragraph, one for the summary. Query returns whichever vector matched, plus the parent document. Used by ColBERT — every token gets its own vector, then MaxSim scoring finds the document where the best token matches are.
Hybrid search in practice
BM25 + vector search combined via Reciprocal Rank Fusion (RRF):
combined_score = Σ 1 / (k + rank_i)
where k=60 is a constant and rank_i is the item’s rank in each retrieval system. Simple, parameter-free, and outperforms both pure vector and pure keyword on most benchmarks by 5-10%.
When to reach for a vector database vs. not
Use a vector database when:
- Corpus > 100K documents
- Queries are semantic (meaning-based, not exact match)
- Real-time latency requirements (<100ms)
- You need metadata filtering at scale
Use pgvector when:
- Corpus < 100K documents
- You want to avoid operational complexity
- You’re already on Postgres and latency is acceptable
- You need SQL joins with your vectors
Use Elasticsearch/OpenSearch when:
- You’re already running Elasticsearch for keyword search
- You want hybrid search without a new service
- Your team knows ES operations
One Thing to Remember
HNSW is brilliant but it’s still a graph in RAM. When people talk about “billion-scale vector search,” they’re lying to you unless they’re also talking about quantization, disk-based indices, or sharding strategies — because the math doesn’t work otherwise.
See Also
- Cloud Computing Cloud computing explained without jargon: why your photos, files, and favorite apps actually live on someone else's computer — and why that's a good thing.
- Containerization Why does software that works on your computer break on everyone else's? Containers fix that — and they're why Netflix can deploy 100 updates a day without the site going down.
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
- Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
- Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.