Self-Supervised Learning — Core Concepts

Masked language modeling, contrastive learning, and autoregressive pretraining — the training paradigms behind BERT, GPT, SimCLR, and DINO.

What Supervised Learning Required

Supervised learning maps inputs to labeled outputs. For most tasks, human-labeled datasets are the bottleneck: ImageNet required 14 million images with human-annotated labels. SQUAD for reading comprehension required thousands of hours of crowd-worker annotation. Medical imaging datasets require expert radiologists.

Self-supervised learning bypasses this by generating supervision signals from the structure inherent in unlabeled data. The key insight: many data types have natural “prediction targets” that don’t require human annotation.

Pretext Tasks: The Core Mechanism

A pretext task is a self-defined learning objective designed to require the model to learn useful representations.

Language: Masked Language Modeling (MLM)

BERT (Devlin et al., 2018) randomly masks 15% of tokens in a sentence and trains the model to predict the original tokens:

80% of masked tokens: replaced with [MASK]
10%: replaced with a random word
10%: left unchanged

The random replacement strategy prevents the model from learning to recognize only [MASK] tokens. The model must use both left and right context (bidirectional) to predict masked tokens.

Why it works: To predict a masked word, the model must understand the sentence’s meaning. Predicting “The trophy didn’t fit in the [MASK] because it was too big” as “suitcase” requires understanding pronoun resolution — that “it” refers to the trophy, not the suitcase.

Language: Autoregressive (Causal) Language Modeling

GPT models predict each token from the preceding tokens only (left-to-right): $$\mathcal{L} = -\sum_t \log P(x_t | x_{<t})$$

Simpler than MLM (no masking strategy needed) but restricted to unidirectional context. Despite this limitation, autoregressive LMs at scale (GPT-3, GPT-4) demonstrate stronger generative capabilities — predicting what comes next naturally leads to generative models.

Vision: Self-Supervised Pretext Tasks

Early computer vision pretext tasks:

Rotation prediction: Rotate an image by 0°/90°/180°/270°, predict the rotation
Jigsaw puzzles: Rearrange 9 patches, predict the correct arrangement
Colorization: Given grayscale image, predict color channels

These worked modestly but fell far short of supervised ImageNet pretraining quality. The breakthrough came from contrastive approaches.

Contrastive Self-Supervised Learning

SimCLR (Chen et al., 2020) established the framework that enabled vision SSL to match supervised performance.

Data augmentation pairs: For each image, apply two random augmentations (random crop + color jitter + grayscale + blur). The two augmented versions are “positive pairs” — they should have similar representations.

Contrastive loss (NT-Xent): In a batch of $N$ images ($2N$ augmented views), for each positive pair $(i, j)$: $$\mathcal{L}{i,j} = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum{k=1}^{2N} \mathbf{1}_{k \neq i} \exp(\text{sim}(z_i, z_k)/\tau)}$$

The $2N - 2$ other views serve as negatives. The encoder learns to make positive pairs similar and negative pairs dissimilar in embedding space.

SimCLR required large batch sizes (4096+) and many negatives for good performance. MoCo (He et al., 2020) introduced a momentum encoder and memory queue to provide many negatives without requiring large batches.

Beyond Contrastive: Avoiding Collapse Without Negatives

A failure mode called representation collapse: the model learns the trivial solution of mapping all inputs to the same representation (then all pairs are “similar”). Contrastive methods avoid this by pushing negatives apart.

Newer methods avoid collapse without explicit negatives:

BYOL (Bootstrap Your Own Latent, 2020): Two networks — online and target (momentum average of online). The online network predicts the target network’s representation of the same image with different augmentation. No negatives needed. Works because the momentum encoder evolves slowly, providing stable targets.

DINO (Self-DIstillation with NO Labels, 2021): Self-distillation with a teacher = momentum EMA of student. Uses multi-crop (global and local crops at different scales). Produces attention maps that cleanly segment semantic objects without any segmentation labels — a remarkable emergent property.

SimSiam (2021): Stop-gradient operation instead of momentum encoder. Theoretical analysis: stop-gradient prevents the encoder from finding a collapsing solution by breaking the gradient flow.

Self-Supervised Learning for Speech and Video

wav2vec 2.0 (Facebook, 2020): Masks portions of speech, quantizes speech representations, trains model to identify which quantized representation corresponds to the masked region (contrastive over quantized codebook). Pre-trained on 960 hours of LibriSpeech unlabeled audio. Fine-tuned on 10 minutes of labeled speech → near state-of-the-art word error rate.

VideoMAE (2022): Masked Autoencoder for video — masks 90% of video patches (much higher than for images because consecutive frames are similar), reconstructs original pixels. The high masking rate forces the model to learn temporal dynamics rather than just copying nearby frames.

Data2Vec (Facebook, 2022): Unified SSL framework for text, audio, and vision using the same algorithm — mask-and-predict within each modality.

From Pretraining to Fine-Tuning

Self-supervised pretraining produces representations that transfer remarkably well. The pretrained encoder is then fine-tuned on downstream tasks with small labeled datasets.

For BERT: pretraining on 3.3B tokens → fine-tune with 1000 labeled examples → outperform earlier models trained on the full labeled dataset.

For SimCLR: pretrain on ImageNet without labels → linear probe (train only a linear classifier on frozen features) with ImageNet labels → 76.5% top-1 accuracy, competitive with supervised training.

One thing to remember: Self-supervised learning’s power comes from scale — the pretext task doesn’t need to be the final task, it just needs to force the model to learn useful representations from enormous amounts of freely available unlabeled data.

self-supervised-learningmasked-language-modelingcontrastive-learningbertgptsimclr