Knowledge Distillation — Explain Like I'm 5

The Expert Who Wrote a Textbook

Imagine a world-class professor with 40 years of experience. They know everything about their subject — not just the answers, but all the subtle nuances, edge cases, and how topics connect to each other.

Now they write a textbook for undergraduates. The textbook can’t contain everything in their head, but it’s organized to pass on the most important insights in a compact, accessible form. A new student can learn from this textbook and become genuinely capable — not as experienced as the professor, but far better than if they’d learned from nothing.

That’s knowledge distillation. The professor is the “teacher model” — a huge, capable AI. The textbook is the “student model” — a smaller, more efficient AI. The student learns not just from raw data, but from the teacher’s organized understanding.

Why You Can’t Just Make Small Models

The obvious approach to getting a small, fast AI: just train a smaller model. The problem is that small models trained from scratch don’t learn as well as large ones, even on the same data. Large models discover rich intermediate representations that help them generalize; small models don’t have the capacity to find those on their own.

Knowledge distillation found a clever workaround. Instead of training the small model to predict the correct answer (a 0 or 1 — right or wrong), you train it to predict the large model’s probability distribution over all possible answers.

For an image classifier, the large model might predict: “87% cat, 10% dog, 2% fox, 1% everything else.” That’s a much richer learning signal than just “cat.” The soft probabilities encode the model’s uncertainty and the relationships between categories — and these are what the small model learns to replicate.

Where This Is Used

Every time you run a capable AI model on your phone or laptop — Google Assistant, voice recognition, photo enhancement — you’re probably using a distilled model. The original GPT-4 is too large to run on your phone; a distilled version (much smaller, nearly as capable for common tasks) can.

DistilBERT, released by Hugging Face in 2019, is 40% smaller and 60% faster than BERT while retaining 97% of BERT’s performance. It’s one of the most widely used AI models in production.

One thing to remember: Knowledge distillation lets a small model learn how a large model thinks, not just what it predicts — the soft probability distributions encode far more information than hard right/wrong labels.

knowledge-distillationmodel-compressionefficiencydeep-learningon-device-ai

See Also

  • Model Pruning How AI models lose weight without losing intelligence — removing the neurons that don't actually do anything useful to make models faster and smaller.
  • Model Quantization How AI models get shrunk to run on your phone — the precision-tradeoff trick that makes 70 billion parameter models fit in consumer hardware.
  • Speculative Decoding The clever trick that makes large AI models generate text 2-4x faster — using a small 'draft' model to guess tokens that a big model then quickly verifies.
  • Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
  • Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.