Mixture of Experts — Explain Like I'm 5

How GPT-4 and Mixtral use specialized sub-networks to handle different types of questions — the architecture secret that lets AI be huge without being slow.

The Hospital With Many Departments

When you walk into a hospital with a broken arm, you go to orthopedics. When you’re having a heart attack, you go to cardiology. You don’t see every doctor — just the specialist who knows your problem best.

A Mixture of Experts (MoE) model works the same way. Instead of every part of the network processing every piece of text, there are many “expert” networks that specialize in different things. A “router” reads your input and decides which experts should handle it.

For any given word or phrase, only 2–4 experts out of maybe 64 are actually activated. The others stay dormant, saving computation.

Why This Is Clever

A normal model with 100 layers processes every input through all 100 layers. An MoE model might have 640 “sub-layers” (64 experts per position), but for any given input, only 2 of those 64 experts fire — using the equivalent of a 20-layer model’s computation.

This means you get a model with the capacity of a huge network (it has lots of specialized knowledge stored in all those experts) but the speed of a much smaller network (most of it stays inactive).

Who Uses This

OpenAI reportedly uses MoE in GPT-4 — which is why it’s so capable yet can respond in seconds. Mistral AI’s Mixtral 8x7B (2023) was an open-source MoE that had only 7B active parameters but performed like a 70B model on many tasks. Google’s Gemini 1.5 also uses MoE components.

The Mixtral numbers make this concrete:

Total parameters: 46.7 billion
Active parameters per token: 12.9 billion
Performance: beats Llama 2 70B on most benchmarks
Speed: comparable to a 13B model

You get 70B-quality intelligence at 13B-model speed. That’s the MoE promise.

One thing to remember: Mixture of Experts gives AI models the knowledge of a giant model with the speed of a small one — by only activating a relevant subset of expert networks for each piece of input.

mixture-of-expertsmoegpt-4mixtralllm-architecturescaling

Mixture of Experts — Explain Like I'm 5

The Hospital With Many Departments

Why This Is Clever

Who Uses This

See Also

Related Topics