Mixture of Experts — Explain Like I'm 5
The Hospital With Many Departments
When you walk into a hospital with a broken arm, you go to orthopedics. When you’re having a heart attack, you go to cardiology. You don’t see every doctor — just the specialist who knows your problem best.
A Mixture of Experts (MoE) model works the same way. Instead of every part of the network processing every piece of text, there are many “expert” networks that specialize in different things. A “router” reads your input and decides which experts should handle it.
For any given word or phrase, only 2–4 experts out of maybe 64 are actually activated. The others stay dormant, saving computation.
Why This Is Clever
A normal model with 100 layers processes every input through all 100 layers. An MoE model might have 640 “sub-layers” (64 experts per position), but for any given input, only 2 of those 64 experts fire — using the equivalent of a 20-layer model’s computation.
This means you get a model with the capacity of a huge network (it has lots of specialized knowledge stored in all those experts) but the speed of a much smaller network (most of it stays inactive).
Who Uses This
OpenAI reportedly uses MoE in GPT-4 — which is why it’s so capable yet can respond in seconds. Mistral AI’s Mixtral 8x7B (2023) was an open-source MoE that had only 7B active parameters but performed like a 70B model on many tasks. Google’s Gemini 1.5 also uses MoE components.
The Mixtral numbers make this concrete:
- Total parameters: 46.7 billion
- Active parameters per token: 12.9 billion
- Performance: beats Llama 2 70B on most benchmarks
- Speed: comparable to a 13B model
You get 70B-quality intelligence at 13B-model speed. That’s the MoE promise.
One thing to remember: Mixture of Experts gives AI models the knowledge of a giant model with the speed of a small one — by only activating a relevant subset of expert networks for each piece of input.
See Also
- Neural Scaling Laws Why bigger AI keeps getting better — the mathematical relationships that let researchers predict how smart an AI will be before they finish building it.
- Sparse Attention How AI models handle very long documents without running out of memory — the tricks that let language models work with books, not just paragraphs.
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
- Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
- Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.