Embeddings — Deep Dive

From Word2Vec's skip-gram objective to contrastive learning, matryoshka embeddings, and HNSW indexing — how embedding systems actually work at the level that matters for building with them.

The Mechanics You Actually Need to Know

Most articles about embeddings explain the intuition (similar things cluster together) and then stop. This one starts where those stop.

Word2Vec: The Training Objective That Changed Everything

The original Word2Vec paper (Mikolov et al., Google, 2013) proposed two architectures. Understanding them matters because they reveal why geometry emerges in the first place.

Skip-gram: Given a center word, predict the surrounding context words. CBOW (Continuous Bag of Words): Given context words, predict the center word.

Skip-gram trains better on rare words and is still widely used conceptually. The training process looks like this:

# Pseudocode for skip-gram training
for each word w in corpus:
    for each context word c within window of w:
        maximize P(c | w) using softmax over vocabulary
        update embedding vectors via gradient descent

The catch: softmax over a full vocabulary (say, 500,000 words) requires computing a normalization constant that sums over all words — which is catastrophically slow. The practical solution was negative sampling: instead of predicting all context words correctly, train the model to distinguish real context words from randomly sampled “noise” words.

# Negative sampling objective
maximize: log σ(v_c · v_w) + k·E[log σ(-v_noise · v_w)]

Where σ is the sigmoid function, v_c and v_w are the context and center word vectors, and k is the number of negative samples (typically 5-20). This turns an intractable softmax into a binary classification problem. It’s fast, and empirically it works.

The vectors that minimize this objective end up encoding semantic and syntactic relationships because the only way to predict context well is to group words that share contexts — which means grouping words with similar meanings and grammatical roles.

GloVe’s Different Angle

Stanford’s GloVe (Pennington et al., 2014) took a completely different approach. Instead of training on a sliding window, it builds a global co-occurrence matrix X, where X_ij counts how often word j appears in the context of word i across the entire corpus.

The training objective minimizes:

J = Σ f(X_ij) (v_i · v_j + b_i + b_j - log X_ij)²

Where f(X_ij) is a weighting function that down-weights very frequent co-occurrences (stop words). The key insight: word vector dot products should approximate the log of co-occurrence probability. This gives GloVe embeddings a nice property — the vector differences encode log probability ratios, which is why the king-queen arithmetic works so cleanly.

In practice, Word2Vec and GloVe produce embeddings of similar quality. The field moved on anyway.

Contextual Embeddings: ELMo and Then Transformers

The fundamental limit of Word2Vec and GloVe is that each word gets one embedding regardless of context. “I deposited money at the bank” and “I fished on the river bank” give the word “bank” identical vectors. This is obviously wrong.

ELMo (Peters et al., 2018, AllenNLP) solved this with a bidirectional LSTM that generates context-dependent embeddings. A word’s ELMo embedding is a function of the entire sentence. The name stands for “Embeddings from Language Models” and the core innovation was training on language modeling (predict the next word) and using the internal representations as embeddings.

Then BERT (Devlin et al., Google, 2018) arrived and ELMo became a footnote. BERT uses the transformer encoder architecture with two pre-training objectives:

Masked Language Modeling (MLM): Replace 15% of tokens with [MASK] and train the model to predict them.
Next Sentence Prediction (NSP): Given two sentences, predict whether B follows A in the original text.

The [CLS] token’s final hidden state became the standard “sentence embedding” — though later research (Reimers and Gurevych, 2019) showed this was actually a poor sentence embedding. The [CLS] token is good at classification tasks but terrible at semantic similarity.

Sentence-BERT and Contrastive Learning

Sentence-BERT (SBERT, Reimers and Gurevych, 2019) fixed this with siamese networks and a triplet loss:

L = max(||s_a - s_p||² - ||s_a - s_n||² + margin, 0)

Where s_a is an anchor sentence, s_p is a semantically similar (positive) sentence, and s_n is a dissimilar (negative) one. The model learns to push similar sentences together and dissimilar ones apart in vector space.

This contrastive approach is now the dominant paradigm for embedding models. OpenAI’s text-embedding-ada-002 (2022) and text-embedding-3-* models (2024) use variants of this. Cohere’s embed models, Voyage AI, and the open-source all-MiniLM-L6-v2 from Hugging Face all share this foundation.

The choice of positive/negative pairs is everything. Hard negatives — pairs that are superficially similar but semantically different — produce much better embeddings than random negatives. This is why training data curation is a bigger competitive moat than architecture for embedding providers.

Matryoshka Representation Learning (MRL)

A 2022 paper from Kusupati et al. introduced a training trick that’s quietly become standard: Matryoshka embeddings.

The idea: train the model so that the first d dimensions of a 1536-dimension embedding are themselves a useful d-dimensional embedding, for any d you choose. Like Russian nesting dolls.

# MRL trains on multiple granularities simultaneously
loss = Σ_d∈{8,16,32,64,128,256,512,1024} L(f(x)[:d], f(y)[:d])

Why this matters practically: you can truncate embeddings to reduce storage and computation costs with minimal accuracy loss. OpenAI’s text-embedding-3-large supports truncation to 256 or 1536 dimensions — that’s MRL. For high-volume applications, the cost difference is significant: storing 100M vectors at 256 dimensions instead of 3072 is a 12× storage reduction.

Vector Databases and HNSW Indexing

Once you have embeddings, you need to search them. The naive approach — compute cosine similarity between your query vector and every vector in the database — is O(n) and becomes unusable around 10M vectors.

HNSW (Hierarchical Navigable Small World graphs), introduced by Malkov and Yashunin in 2018, is the index structure that made large-scale vector search practical. It builds a multi-layer graph:

Top layers: sparse, long-range connections (fast navigation)
Bottom layers: dense, local connections (precise search)

Search starts at the top, greedily descends to the query’s neighborhood, then does a fine-grained search at the bottom. Complexity drops from O(n) to roughly O(log n) for insertion and O(log n) for search. The tradeoff is memory — HNSW stores the graph structure, which adds 64-128 bytes per vector on top of the vector itself.

Pinecone, Weaviate, Qdrant, and pgvector all use HNSW (or variants) under the hood. The parameters you tune:

M: connections per node (higher = better recall, more memory)
ef_construction: search breadth during build (higher = better quality index, slower build)
ef_search: search breadth at query time (tune this to trade recall vs. latency)

For most production use cases, M=16, ef_construction=200, ef_search=100 is a reasonable starting point. At 99% recall, HNSW on a 1M-vector dataset typically achieves <10ms latency on a single CPU core.

Quantization: Making It Cheaper

A float32 vector with 1536 dimensions takes 6,144 bytes (6KB). At 100M vectors, that’s ~600GB. In practice, most production systems use quantization:

Scalar quantization (SQ8): Map each float32 value to a uint8 (0-255). 4× compression, <1% accuracy loss.

Product quantization (PQ): Split the vector into subvectors, cluster each subvector, encode as cluster IDs. 32-64× compression, ~3-5% accuracy loss.

Binary quantization: Threshold each dimension to 0 or 1. 32× compression, 10-20% accuracy loss — but fast Hamming distance calculation. OpenAI published a blog post in 2024 noting that their embedding models retain ~96% of accuracy when binary-quantized, which surprised the field.

Evaluating Embedding Quality

The standard benchmarks are MTEB (Massive Text Embedding Benchmark, Muennighoff et al., 2022), which covers 56 datasets across 8 task types. The MTEB leaderboard is the de facto ranking.

But MTEB has a known problem: models can be fine-tuned on MTEB datasets and achieve inflated scores that don’t generalize. When evaluating for production:

Create domain-specific test sets — 200-500 query/document pairs from your actual data
Measure recall@k — what fraction of relevant documents appear in the top-k results
Test for failure modes — long documents (most models cap at 512 or 8192 tokens), multilingual content, numbers and dates

Embeddings as Features: The Fine-Tuning Consideration

Pre-trained embedding models are general-purpose. For specialized domains (legal, medical, code), fine-tuning typically yields significant gains. The standard approach:

Start from a strong base model (e.g., all-mpnet-base-v2 or similar)
Collect domain-specific positive pairs (documents that should be similar)
Mine hard negatives (similar-looking but semantically different)
Fine-tune with contrastive loss for 1-3 epochs
Evaluate on domain-specific test set

Practically, this requires a few thousand labeled pairs and a few hours on a single GPU. The gains over a general model are often 5-15% on domain recall.

One Thing to Remember

The quality of embeddings for your use case depends more on training data alignment and hard negative mining than on model size or architecture — and a fine-tuned small model will usually beat a general-purpose large model on your specific domain.

techaiembeddingsnlpvectorstransformersvector-databases