Contrastive Learning — Core Concepts
What Contrastive Learning Optimizes
Contrastive learning defines a learning signal through comparison: similar pairs should be represented nearby in embedding space; dissimilar pairs should be represented far apart.
Formally, given an encoder $f: \mathcal{X} \rightarrow \mathbb{R}^d$, contrastive learning optimizes: $$\mathcal{L} = \mathbb{E}[\ell(f(x_+), f(x_-))]$$
Where $x_+$ is a positive example (similar to the anchor) and $x_-$ is a negative example (dissimilar). The loss $\ell$ penalizes similar embeddings for negatives and dissimilar embeddings for positives.
The earliest version: metric learning with triplet loss (Weinberger & Saul, 2009). Given anchor $a$, positive $p$, negative $n$: $$\mathcal{L}_{triplet} = \max(0, |f(a) - f(p)|^2 - |f(a) - f(n)|^2 + m)$$
The model must place the positive at least $m$ (margin) closer than the negative. Used in early FaceNet (Google, 2015) for face verification.
SimCLR: The Modern Framework
Chen et al. (2020) “A Simple Framework for Contrastive Learning of Visual Representations” simplified and scaled contrastive SSL to match supervised performance.
Data augmentation pipeline: For each image $x$, apply two random augmentations sampled from $\mathcal{T}$: $\tilde{x}_i = t_i(x)$, $\tilde{x}_j = t_j(x)$ where $t \sim \mathcal{T}$.
Augmentations included: random crop + resize, random color jitter, random grayscale, Gaussian blur. The strength of augmentations matters significantly — too weak and the task is trivial; too strong and positive pairs become unrecognizable.
Projection head: Representations $h = f_{encoder}(\tilde{x})$ are projected to a lower-dimensional space $z = g(h)$ through an MLP. The contrastive loss operates in $z$ space. After training, $g$ is discarded and $h$ (pre-projection) is used for downstream tasks.
This is counterintuitive — why discard the projection head? The projection head may throw away information useful for downstream tasks while keeping only information useful for the contrastive pretext task. The representation layer $h$ retains broader information.
NT-Xent (Normalized Temperature-scaled Cross Entropy): For a minibatch of $N$ images (generating $2N$ augmented views), for each positive pair $(i, j)$: $$\mathcal{L}{i,j} = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum{k=1}^{2N} \mathbf{1}_{k \neq i} \exp(\text{sim}(z_i, z_k)/\tau)}$$
The $2N - 2$ other views serve as negatives. Large batches are critical — more negatives provide better contrast. SimCLR used batch size 4096 with 128-dimensional projections and temperature $\tau = 0.07$.
MoCo: Momentum Contrast
A fundamental limitation of SimCLR: effective contrastive learning needs many negatives, but storing them all in memory requires enormous batch sizes — expensive and memory-limited.
He et al. (2020) “Momentum Contrast for Unsupervised Visual Representation Learning” addressed this with a key-value queue.
Architecture: Two encoders — online (query) $f_q$ and momentum (key) $f_k$. A FIFO queue stores encoded keys from recent batches (up to 65,536 keys).
Training: For each batch:
- Query encoder processes augmented view $\tilde{x}_q$ → query $q = f_q(\tilde{x}_q)$
- Momentum encoder processes augmented view $\tilde{x}k$ → key $k+ = f_k(\tilde{x}_k)$
- Contrast $q$ against $k_+$ (positive) and all 65,536 keys in queue (negatives)
- Update queue (enqueue current batch, dequeue oldest)
- Update momentum encoder: $\theta_k \leftarrow m \theta_k + (1-m) \theta_q$ (no gradient)
The momentum update (typically $m=0.999$) makes the key encoder evolve slowly. This ensures the queue contains consistent representations from a slowly evolving encoder — if the encoder changed rapidly, old keys would be stale.
MoCo v3 combined the MoCo momentum approach with ViT backbones and SimCLR-style augmentations, achieving state-of-the-art SSL performance with batch sizes as small as 4096.
Hard Negative Mining
As contrastive training progresses, most negatives become “easy” — the encoder correctly places them far from the anchor without learning anything new. Performance plateaus because the gradient from easy negatives is negligible.
Supervised hard negatives: In metric learning for face recognition, hard negatives are faces of different people that look similar (semi-hard and hard triplets). Mining these in advance (offline) or within the batch (online) accelerates training.
Unsupervised hard negatives (MoCHi, 2020): Synthesize hard negatives by interpolating between existing negative embeddings — creating embeddings near the query’s neighborhood that are guaranteed to be negatives.
Debiased contrastive loss (Chuang et al., 2020): Standard random negatives include false negatives (different augmentations of the same class treated as negatives). Debiased loss estimates and corrects for this using the class frequency prior.
Applications Beyond Vision SSL
Contrastive learning underlies many production systems:
Sentence embeddings (SimCSE, Gao et al., 2021): Use dropout as data augmentation — the same sentence passed through BERT twice with different dropout masks creates positive pairs. Fine-tunes an LLM for sentence similarity using NT-Xent. The resulting embeddings improve semantic search quality significantly.
Recommendation systems: User-item interactions create positive pairs (user $u$ clicked item $i$); random items are negatives. Contrastive learning produces user and item embeddings for retrieval.
Multi-view learning: Sensor fusion in autonomous vehicles — different sensors (camera, LiDAR, radar) provide multiple “views” of the same scene. Contrastive learning aligns representations across sensor modalities.
CLIP (OpenAI, 2021): Image-text contrastive learning at scale — matching images with captions. 400M pairs, 43 billion parameters total. Generated the cross-modal embeddings that enabled zero-shot image classification and cross-modal search.
One thing to remember: Contrastive learning’s insight — that comparison is a more natural supervisory signal than categorical labels — enabled SSL at scale, and the same principle (push similar things together, pull dissimilar things apart) generalizes across any data type with a natural notion of similarity.
See Also
- Data Augmentation How AI systems make do with less data by creating variations of what they have — the training trick that prevented ImageNet models from memorizing training examples.
- Few Shot Learning How AI learned to learn from just a handful of examples — the technique that lets AI generalize like humans instead of needing millions of training samples.
- Lora Fine Tuning How AI companies adapt massive models to specific tasks by training only a tiny fraction of the parameters — the technique making custom AI affordable.
- Reinforcement Learning Fundamentals How AI learns from trial, error, and rewards — the technique that beat the world chess champion, solved protein folding, and is now teaching robots to walk.
- Self Supervised Learning How AI learned to teach itself from unlabeled data — the technique that let GPT and BERT learn from the entire internet without any human labeling.