Knowledge Distillation in Python — ELI5

How a big expert AI teaches a tiny student AI to be almost as smart — like a professor writing a cheat sheet for an exam.

Imagine a brilliant professor who’s read every book in the library and can answer any question about history. Now imagine they need to help a student pass tomorrow’s exam. The professor doesn’t make the student read every book — instead, they create perfect study notes that capture the most important insights.

Knowledge distillation works the same way. You have a big, powerful AI model (the “teacher”) that’s really good at its job but too large to run on a phone or small device. You want a small, fast model (the “student”) that can run anywhere.

Here’s the clever part: instead of training the student from scratch using just the original training data, you let the student learn from the teacher’s answers.

Why is the teacher’s answer better than raw data? Because the teacher gives nuanced responses. If you show both models a picture of a cat, the raw label just says “cat.” But the teacher might say: “95% cat, 3% lynx, 1% tiger, 0.5% dog.” Those extra details — the “soft” predictions — contain hidden knowledge about which categories are similar to each other. The student absorbs all of this.

The result is remarkable: a student model that’s 10-100× smaller than the teacher but retains 95-99% of its accuracy. It’s the difference between carrying a 500-page textbook and carrying a perfect 5-page summary.

This technique is used everywhere: Google uses it to make search faster, Apple uses it for Siri on your phone, and self-driving cars use it to make real-time decisions.

The one thing to remember: Knowledge distillation trains a small, fast “student” model by learning from a big, accurate “teacher” model’s detailed predictions rather than raw data — capturing the teacher’s expertise in a fraction of the size.

pythonmachine-learningmodel-optimization

Knowledge Distillation in Python — ELI5

See Also

Related Topics