Knowledge Distillation — Explain Like I'm 5

How AI companies shrink massive models down to phone-sized ones without losing much intelligence — the teacher-student trick that powers on-device AI.

The Expert Who Wrote a Textbook

Imagine a world-class professor with 40 years of experience. They know everything about their subject — not just the answers, but all the subtle nuances, edge cases, and how topics connect to each other.

Now they write a textbook for undergraduates. The textbook can’t contain everything in their head, but it’s organized to pass on the most important insights in a compact, accessible form. A new student can learn from this textbook and become genuinely capable — not as experienced as the professor, but far better than if they’d learned from nothing.

That’s knowledge distillation. The professor is the “teacher model” — a huge, capable AI. The textbook is the “student model” — a smaller, more efficient AI. The student learns not just from raw data, but from the teacher’s organized understanding.

Why You Can’t Just Make Small Models

The obvious approach to getting a small, fast AI: just train a smaller model. The problem is that small models trained from scratch don’t learn as well as large ones, even on the same data. Large models discover rich intermediate representations that help them generalize; small models don’t have the capacity to find those on their own.

Knowledge distillation found a clever workaround. Instead of training the small model to predict the correct answer (a 0 or 1 — right or wrong), you train it to predict the large model’s probability distribution over all possible answers.

For an image classifier, the large model might predict: “87% cat, 10% dog, 2% fox, 1% everything else.” That’s a much richer learning signal than just “cat.” The soft probabilities encode the model’s uncertainty and the relationships between categories — and these are what the small model learns to replicate.

Where This Is Used

Every time you run a capable AI model on your phone or laptop — Google Assistant, voice recognition, photo enhancement — you’re probably using a distilled model. The original GPT-4 is too large to run on your phone; a distilled version (much smaller, nearly as capable for common tasks) can.

DistilBERT, released by Hugging Face in 2019, is 40% smaller and 60% faster than BERT while retaining 97% of BERT’s performance. It’s one of the most widely used AI models in production.

One thing to remember: Knowledge distillation lets a small model learn how a large model thinks, not just what it predicts — the soft probability distributions encode far more information than hard right/wrong labels.

knowledge-distillationmodel-compressionefficiencydeep-learningon-device-ai

Knowledge Distillation — Explain Like I'm 5

The Expert Who Wrote a Textbook

Why You Can’t Just Make Small Models

Where This Is Used

See Also

Related Topics