Mixture of Experts — Deep Dive

Token Routing: Beyond Top-K Gating

Standard top-K routing assigns each token to exactly K experts. Several alternatives address its limitations:

Expert Choice Routing (Zhou et al., 2022): Instead of tokens choosing experts, experts choose tokens. Each expert selects the top-K tokens from the batch based on gating scores. This guarantees perfect load balance (each expert processes exactly K tokens) but means tokens may be assigned different numbers of experts (some tokens to many experts, some to none — “dropped”).

For training stability: all tokens should receive expert processing. Expert choice routing drops tokens not selected by any expert — requiring a “no-expert fallback” (usually the residual connection bypasses the MoE block).

Hash Routing: Use a deterministic hash of the token (or position) to route — no learned gating. Zero routing overhead, perfect load balance. Worse performance because routing doesn’t respond to semantic content. Used for ablations but not production.

Soft MoE (Puigcerver et al., 2023): Instead of discrete routing, each expert processes a weighted combination of all tokens (weighted by gating scores). No load imbalance, no token dropping, fully differentiable. Trades sparse activation for soft mixing — maintains some efficiency by using compact expert representations.

Expert Capacity: Managing Token Overflow

With top-K routing over batches, experts may receive unequal token counts. If all $B$ tokens in a batch prefer expert $i$, it must process $B$ tokens while others process 0.

Expert capacity $C$ limits tokens per expert: if more than $C$ tokens are routed to an expert, the excess are “dropped” — their representations are not modified by the MoE layer (they bypass via the residual connection with an “overflow” flag).

The capacity factor $\alpha$ sets $C = \alpha \cdot B / N$:

  • $\alpha = 1$: Each expert processes exactly $B/N$ tokens on average. Slight overflow → some token dropping.
  • $\alpha = 2$: 2x buffer — significant overflow would be needed to drop any tokens.
  • $\alpha = 2$ is standard. Google’s Switch Transformer found $\alpha = 1.25$ sufficient for most settings.

Dropped tokens are handled via the residual: their representation passes through unchanged, potentially losing relevant FFN computation. In practice, <1% of tokens are dropped at typical capacity factors.

DeepSeek-V2: Multi-Head Latent Attention + Fine-Grained MoE

DeepSeek-V2 (DeepSeek AI, 2024) combined architectural innovations that make it one of the most parameter-efficient models at its capability level:

Multi-Head Latent Attention (MLA): Instead of storing full KV pairs (2 × $n_{heads}$ × $d_{head}$ per token in cache), DeepSeek-V2 compresses the KV to a single low-rank latent vector per token. The full K and V are reconstructed via learned projections at attention time.

KV cache reduction: from $2 \times 128 \times 128 = 32,768$ floats per token to $512$ floats per token — 64x compression. At 128k context, this reduces KV cache from 4GB to 64MB per request. Critical for serving long-context models efficiently.

Fine-Grained MoE Experts: Instead of 8 large experts (as in Mixtral), use 160 small experts with 6 active. Fine-grained experts create more routing flexibility and more consistent load distribution.

The 160 experts with 6 active is mathematically similar to 8 experts with 2 active in total active parameters, but the finer granularity allows more specialized routing and reduces variance in load balancing.

Results: DeepSeek-V2 (236B total params, 21B active) achieves GPT-4-level performance on benchmarks while costing 6x less per token to serve than a dense model of comparable capability.

MoE Scaling Laws

Artetxe et al. (2022) “Efficient Large Scale Language Modeling with Mixtures of Experts” studied MoE scaling behavior.

MoE scaling law: For a fixed compute budget $C$, the optimal MoE model has:

  • More experts → larger total parameters at fixed active parameters
  • Each expert’s size → fixed by the training FLOPs budget on active parameters

Empirically: at fixed inference compute, MoE models achieve lower perplexity than dense models. But the improvement decreases with scale — at very large active parameter counts, the advantage of more experts diminishes.

The granularity tradeoff: More, smaller experts vs. fewer, larger experts.

  • Many small experts: better routing specialization, more token selection flexibility, higher all-to-all communication overhead
  • Few large experts: less communication, less routing overhead, less specialization flexibility

Optimal expert size depends on hardware topology (communication cost vs. compute ratio). Google found that expert sizes of ~1B parameters each work well for TPU clusters with fast interconnects.

Data efficiency: MoE models learn faster (require fewer training tokens) than dense models of equivalent active parameters to reach the same perplexity. The additional parameters (inactive experts) provide additional effective capacity that accelerates learning.

Inference-time scaling: MoE models are currently not amenable to “inference-time scaling” (chain-of-thought, longer generation) as efficiently as dense models, because increasing token count doesn’t increase expert utilization — the same 2 experts handle additional tokens. This is an active area of MoE research.

One thing to remember: MoE’s fundamental tension is between capacity (more experts = more specialized knowledge) and efficiency (routing, communication overhead, load balancing) — and the engineering of good MoE systems is largely about resolving this tension at production scale.

mixture-of-expertsswitch-transformerdeepseek-v2expert-capacitymoe-scaling-laws

See Also

  • Neural Scaling Laws Why bigger AI keeps getting better — the mathematical relationships that let researchers predict how smart an AI will be before they finish building it.
  • Sparse Attention How AI models handle very long documents without running out of memory — the tricks that let language models work with books, not just paragraphs.
  • Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
  • Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
  • Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.