Knowledge Distillation — Core Concepts

The Soft Target Insight

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean published “Distilling the Knowledge in a Neural Network” in 2015. The core observation: when a trained model outputs class probabilities, even the small probabilities contain useful information.

Consider a model trained to classify handwritten digits. When shown a “2”, it might output: [0, 0, 0.95, 0, 0, 0, 0, 0, 0.04, 0.01] — 95% for “2”, 4% for “8” (which looks similar in shape), 1% for “3”. The 4% for “8” isn’t wrong — it’s the model encoding its understanding that 2s and 8s are more similar than 2s and 5s.

A hard label says “this is a 2.” A soft probability distribution says “this is a 2, but it somewhat resembles an 8, and almost nothing like a 5.” The second is far more informative for training a student model.

The Distillation Loss

Standard training minimizes cross-entropy between predictions and hard one-hot labels: $$\mathcal{L}_{hard} = -\sum_i y_i \log p_i$$

Distillation training minimizes a weighted combination: $$\mathcal{L}{distill} = \alpha \cdot \mathcal{L}{hard} + (1-\alpha) \cdot T^2 \cdot \mathcal{L}_{soft}$$

Where $\mathcal{L}_{soft}$ is the KL divergence between the student’s soft predictions and the teacher’s soft predictions, both computed with temperature $T$.

Temperature scaling: The softmax at temperature $T$ sharpens or softens probabilities: $$p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$$

At $T=1$ (standard): probabilities are peaked, low-probability classes get very small values. At $T=4$ (high temperature): probabilities are softer, more of the teacher’s knowledge about class relationships is accessible.

During distillation, both teacher and student use temperature $T > 1$ for the soft target loss. At inference, $T=1$ is used normally. The $T^2$ factor in the loss compensates for the scale change (soft targets at high temperature have smaller gradients; $T^2$ corrects this).

Beyond Output Logits: Feature Distillation

Logit-level distillation transfers knowledge about final predictions. Feature distillation transfers knowledge about intermediate representations.

FitNets (Romero et al., 2014): Train the student’s intermediate activations to match the teacher’s. Add a “hint layer” loss: $$\mathcal{L}{hint} = |f{teacher}(x) - g(f_{student}(x))|^2$$

Where $g$ is a regressor that handles dimension mismatches between teacher and student.

Attention transfer (Zagoruyko & Komodakis, 2016): Transfer the spatial attention maps (where the model focuses in an image), not the full feature activations. Activations can be large; attention maps are compact.

Relational knowledge distillation (Park et al., 2019): Rather than matching individual features, match the relationships between examples. Transfer the distance matrix between samples in feature space.

DistilBERT: A Case Study

Hugging Face’s DistilBERT (Sanh et al., 2019) distilled BERT-base (110M parameters) to a 6-layer, 66M parameter model. The distillation procedure:

  1. Soft target loss: Student’s token-level softmax probabilities trained against teacher’s (with temperature 4)
  2. Masked language modeling loss: Standard MLM loss on hard labels (same task as BERT pretraining)
  3. Cosine embedding loss: Align student’s hidden states with teacher’s hidden states (cosine similarity)

Results: 40% smaller, 60% faster, 97% of BERT’s performance on GLUE benchmark. For practical NLP applications (sentiment analysis, text classification, NER), DistilBERT is often the right choice — near-BERT quality at mobile-friendly inference speeds.

GPT-2 distillation followed similar patterns. The challenge for generative models: the teacher’s output is a sequence, not a single softmax. Each token’s probability distribution is distilled separately, which works well but means the distillation loss is computationally similar to pretraining.

Practical Design Choices

Capacity gap: If the teacher is much larger than the student (e.g., GPT-4 → phone-sized model), direct distillation can fail — the student can’t accurately represent the teacher’s distributions. Progressive distillation (using intermediate-sized models as stepping stones) helps.

Data requirements: Distillation generally needs the same data as pretraining the teacher. Some approaches use teacher-generated data (teacher labels unlabeled data for the student) — this is the basis for “self-training” and “pseudo-labeling” approaches.

Task-specific vs. general distillation: Distilling on task-specific data works better for that task. General distillation (on diverse data like the pretraining corpus) produces a general-purpose student.

One thing to remember: Distillation works because a trained model’s probability distributions contain more information than its predicted labels — the “dark knowledge” in soft probabilities captures relationships that binary right/wrong labels can’t.

knowledge-distillationdistilbertmodel-compressionsoft-targetstemperature

See Also

  • Model Pruning How AI models lose weight without losing intelligence — removing the neurons that don't actually do anything useful to make models faster and smaller.
  • Model Quantization How AI models get shrunk to run on your phone — the precision-tradeoff trick that makes 70 billion parameter models fit in consumer hardware.
  • Speculative Decoding The clever trick that makes large AI models generate text 2-4x faster — using a small 'draft' model to guess tokens that a big model then quickly verifies.
  • Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
  • Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.