Self-Supervised Learning — Deep Dive

Masked Autoencoders: The Asymmetric Approach

He et al. (2022) “Masked Autoencoders Are Scalable Vision Learners” (MAE) showed that a simple idea — mask most of an image and reconstruct it — works better than contrastive methods for vision SSL when combined with a Vision Transformer backbone.

Key design choices:

  • High masking ratio (75%, vs. 15% for BERT): Random patches are masked. The high ratio makes the task nontrivial — you can’t fill in missing patches from immediately adjacent ones.
  • Asymmetric encoder-decoder: The encoder sees only visible patches (25% of image). The decoder is lightweight and sees encoded visible patches + mask tokens. Only the encoder is used after pretraining.
  • Pixel reconstruction: Reconstruct raw pixel values in masked patches (MSE loss), not semantic features.

Why does pixel reconstruction work when early autoencoders failed? Scale + ViT architecture. ViT’s attention mechanism allows each visible patch to attend to every other visible patch globally, building rich semantic representations before reconstruction.

The asymmetric architecture gives a 3–4x training speedup: the encoder processes only 25% of tokens, which is the computationally expensive part.

MAE vs. contrastive comparison (VTAB, linear probe):

  • MAE ViT-L: 85.9% on 21 downstream tasks
  • MoCo v3 ViT-L: 85.5%
  • BEiT ViT-L: 85.2%

MAE’s advantages grow with model size — it scales better than contrastive approaches, making it preferred for large ViT models.

VICReg and Barlow Twins: Variance-Invariance-Covariance

VICReg (Bardes et al., 2022) and Barlow Twins (Zbontar et al., 2021) represent a different approach to avoiding collapse: directly regularizing the embedding space statistics.

Barlow Twins: Force the cross-correlation matrix between two augmented view embeddings to be close to the identity matrix: $$\mathcal{L}{BT} = \sum_i (1 - C{ii})^2 + \lambda \sum_i \sum_{j \neq i} C_{ij}^2$$

Where $C_{ij} = \frac{\sum_b z_i^A z_j^B}{\sqrt{\sum_b (z_i^A)^2} \sqrt{\sum_b (z_j^B)^2}}$.

The first term enforces invariance (diagonal = 1: each component of the two views should be highly correlated). The second term enforces redundancy reduction (off-diagonal ≈ 0: different components should be uncorrelated — encoding different information).

VICReg (Variance-Invariance-Covariance Regularization): $$\mathcal{L}_{VICReg} = \lambda s(Z, Z’) + \mu[v(Z) + v(Z’)] + \nu[c(Z) + c(Z’)]$$

  • $s$: invariance — MSE between $Z$ and $Z’$ (positive pair similarity)
  • $v$: variance — hinge loss preventing variance in each dimension from collapsing below a threshold
  • $c$: covariance — off-diagonal covariance terms should be zero (decorrelation)

These methods don’t require large batches (unlike SimCLR) or momentum encoders (unlike BYOL/DINO), making them more accessible for compute-constrained settings.

Scaling Laws for Self-Supervised Pretraining

Kaplan et al. (2020) established power-law scaling for supervised language model training. Similar laws apply to SSL:

For autoregressive LM pretraining: $\mathcal{L}(N, D) \approx \frac{A}{N^\alpha} + \frac{B}{D^\beta}$ where $N$ is parameters and $D$ is training tokens.

Hoffmann et al. (2022) “Training Compute-Optimal Large Language Models” (Chinchilla) showed that for a given compute budget $C$, the optimal allocation balances model size and dataset size: $N_{opt} \propto C^{0.49}$, $D_{opt} \propto C^{0.51}$.

Implication: most large models were undertrained — they were too large relative to their training data. GPT-3 (175B params, 300B tokens) would be better as ~70B params on ~600B tokens for the same compute. Chinchilla (70B, 1.4T tokens) validated this.

For vision SSL, the scaling behavior is similar: larger ViTs trained on more data consistently improve, with no clear plateau yet observed.

Emergent Capabilities in Self-Supervised Models

Some of SSL’s most striking properties emerge from scale rather than explicit training objectives.

DINO’s semantic segmentation: DINO’s self-attention maps cleanly segment objects by semantic category — background, object parts — without ever being trained on segmentation labels. This emerged purely from the self-supervised training objective. The attention heads specialized to different semantic aspects of the scene.

In-context learning (GPT-3): Autoregressive LMs at sufficient scale learn to perform new tasks from a few examples in the prompt — without any parameter updates. This wasn’t designed; it emerged from training on enough diverse text.

World models from video: V-JEPA (LeCun et al., Meta, 2024) trained on video with a JEPA (Joint-Embedding Predictive Architecture) framework — predict representations of future frames from past frames in a learned latent space (not pixel space). The model developed rich internal representations of physical dynamics without explicit physics supervision.

Linear representation hypothesis: Recent interpretability work (Park et al., 2023) showed that concepts are represented as directions in the embedding space of SSL-pretrained models. Semantic arithmetic (king - man + woman ≈ queen) is a manifestation of this.

Joint-Embedding Predictive Architectures (JEPAs)

LeCun has argued that the future of SSL lies not in contrastive methods or reconstruction methods, but in predictive architectures in latent space.

The key distinction:

  • Reconstruction (MAE, language MLM): Predict pixels or tokens — includes irrelevant details
  • Contrastive (SimCLR): No explicit prediction target
  • JEPA: Predict the representation of a masked region, not the raw content

$$s_{xy} = D(f_\theta(x), g_\phi(y))$$

The context encoder $f_\theta$ processes visible regions; a predictor $g_\phi$ predicts the representation of masked regions; a target encoder (stop-gradient) provides the prediction targets. By predicting in abstract representation space, the model is encouraged to extract only semantically meaningful features.

I-JEPA (Image JEPA, 2023) and V-JEPA (Video JEPA, 2024) from Meta demonstrated competitive performance while being significantly more efficient than pixel-reconstruction approaches — and producing representations with stronger abstraction.

One thing to remember: The conceptual arc of self-supervised learning is moving from pretext tasks (designed by humans) toward principled objectives (designed to force learning of hierarchical, predictive models of the world) — and the scaling laws suggest this direction will continue to dominate for the foreseeable future.

self-supervised-learningmaebarlow-twinsvicregscaling-lawsemergent-capabilities

See Also

  • Contrastive Learning How AI learns what things are like each other — and what they're not — without any labels, creating the representations behind image search and face recognition.
  • Data Augmentation How AI systems make do with less data by creating variations of what they have — the training trick that prevented ImageNet models from memorizing training examples.
  • Few Shot Learning How AI learned to learn from just a handful of examples — the technique that lets AI generalize like humans instead of needing millions of training samples.
  • Lora Fine Tuning How AI companies adapt massive models to specific tasks by training only a tiny fraction of the parameters — the technique making custom AI affordable.
  • Reinforcement Learning Fundamentals How AI learns from trial, error, and rewards — the technique that beat the world chess champion, solved protein folding, and is now teaching robots to walk.