Self-Supervised Learning — Core Concepts
What Supervised Learning Required
Supervised learning maps inputs to labeled outputs. For most tasks, human-labeled datasets are the bottleneck: ImageNet required 14 million images with human-annotated labels. SQUAD for reading comprehension required thousands of hours of crowd-worker annotation. Medical imaging datasets require expert radiologists.
Self-supervised learning bypasses this by generating supervision signals from the structure inherent in unlabeled data. The key insight: many data types have natural “prediction targets” that don’t require human annotation.
Pretext Tasks: The Core Mechanism
A pretext task is a self-defined learning objective designed to require the model to learn useful representations.
Language: Masked Language Modeling (MLM)
BERT (Devlin et al., 2018) randomly masks 15% of tokens in a sentence and trains the model to predict the original tokens:
- 80% of masked tokens: replaced with [MASK]
- 10%: replaced with a random word
- 10%: left unchanged
The random replacement strategy prevents the model from learning to recognize only [MASK] tokens. The model must use both left and right context (bidirectional) to predict masked tokens.
Why it works: To predict a masked word, the model must understand the sentence’s meaning. Predicting “The trophy didn’t fit in the [MASK] because it was too big” as “suitcase” requires understanding pronoun resolution — that “it” refers to the trophy, not the suitcase.
Language: Autoregressive (Causal) Language Modeling
GPT models predict each token from the preceding tokens only (left-to-right): $$\mathcal{L} = -\sum_t \log P(x_t | x_{<t})$$
Simpler than MLM (no masking strategy needed) but restricted to unidirectional context. Despite this limitation, autoregressive LMs at scale (GPT-3, GPT-4) demonstrate stronger generative capabilities — predicting what comes next naturally leads to generative models.
Vision: Self-Supervised Pretext Tasks
Early computer vision pretext tasks:
- Rotation prediction: Rotate an image by 0°/90°/180°/270°, predict the rotation
- Jigsaw puzzles: Rearrange 9 patches, predict the correct arrangement
- Colorization: Given grayscale image, predict color channels
These worked modestly but fell far short of supervised ImageNet pretraining quality. The breakthrough came from contrastive approaches.
Contrastive Self-Supervised Learning
SimCLR (Chen et al., 2020) established the framework that enabled vision SSL to match supervised performance.
Data augmentation pairs: For each image, apply two random augmentations (random crop + color jitter + grayscale + blur). The two augmented versions are “positive pairs” — they should have similar representations.
Contrastive loss (NT-Xent): In a batch of $N$ images ($2N$ augmented views), for each positive pair $(i, j)$: $$\mathcal{L}{i,j} = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum{k=1}^{2N} \mathbf{1}_{k \neq i} \exp(\text{sim}(z_i, z_k)/\tau)}$$
The $2N - 2$ other views serve as negatives. The encoder learns to make positive pairs similar and negative pairs dissimilar in embedding space.
SimCLR required large batch sizes (4096+) and many negatives for good performance. MoCo (He et al., 2020) introduced a momentum encoder and memory queue to provide many negatives without requiring large batches.
Beyond Contrastive: Avoiding Collapse Without Negatives
A failure mode called representation collapse: the model learns the trivial solution of mapping all inputs to the same representation (then all pairs are “similar”). Contrastive methods avoid this by pushing negatives apart.
Newer methods avoid collapse without explicit negatives:
BYOL (Bootstrap Your Own Latent, 2020): Two networks — online and target (momentum average of online). The online network predicts the target network’s representation of the same image with different augmentation. No negatives needed. Works because the momentum encoder evolves slowly, providing stable targets.
DINO (Self-DIstillation with NO Labels, 2021): Self-distillation with a teacher = momentum EMA of student. Uses multi-crop (global and local crops at different scales). Produces attention maps that cleanly segment semantic objects without any segmentation labels — a remarkable emergent property.
SimSiam (2021): Stop-gradient operation instead of momentum encoder. Theoretical analysis: stop-gradient prevents the encoder from finding a collapsing solution by breaking the gradient flow.
Self-Supervised Learning for Speech and Video
wav2vec 2.0 (Facebook, 2020): Masks portions of speech, quantizes speech representations, trains model to identify which quantized representation corresponds to the masked region (contrastive over quantized codebook). Pre-trained on 960 hours of LibriSpeech unlabeled audio. Fine-tuned on 10 minutes of labeled speech → near state-of-the-art word error rate.
VideoMAE (2022): Masked Autoencoder for video — masks 90% of video patches (much higher than for images because consecutive frames are similar), reconstructs original pixels. The high masking rate forces the model to learn temporal dynamics rather than just copying nearby frames.
Data2Vec (Facebook, 2022): Unified SSL framework for text, audio, and vision using the same algorithm — mask-and-predict within each modality.
From Pretraining to Fine-Tuning
Self-supervised pretraining produces representations that transfer remarkably well. The pretrained encoder is then fine-tuned on downstream tasks with small labeled datasets.
For BERT: pretraining on 3.3B tokens → fine-tune with 1000 labeled examples → outperform earlier models trained on the full labeled dataset.
For SimCLR: pretrain on ImageNet without labels → linear probe (train only a linear classifier on frozen features) with ImageNet labels → 76.5% top-1 accuracy, competitive with supervised training.
One thing to remember: Self-supervised learning’s power comes from scale — the pretext task doesn’t need to be the final task, it just needs to force the model to learn useful representations from enormous amounts of freely available unlabeled data.
See Also
- Contrastive Learning How AI learns what things are like each other — and what they're not — without any labels, creating the representations behind image search and face recognition.
- Data Augmentation How AI systems make do with less data by creating variations of what they have — the training trick that prevented ImageNet models from memorizing training examples.
- Few Shot Learning How AI learned to learn from just a handful of examples — the technique that lets AI generalize like humans instead of needing millions of training samples.
- Lora Fine Tuning How AI companies adapt massive models to specific tasks by training only a tiny fraction of the parameters — the technique making custom AI affordable.
- Reinforcement Learning Fundamentals How AI learns from trial, error, and rewards — the technique that beat the world chess champion, solved protein folding, and is now teaching robots to walk.