Mixture of Experts — Deep Dive

Switch Transformer routing algorithms, expert capacity and token dropping, DeepSeek-V2's multi-head latent attention with MoE, fine-grained expert granularity, and MoE scaling laws.

Token Routing: Beyond Top-K Gating

Standard top-K routing assigns each token to exactly K experts. Several alternatives address its limitations:

Expert Choice Routing (Zhou et al., 2022): Instead of tokens choosing experts, experts choose tokens. Each expert selects the top-K tokens from the batch based on gating scores. This guarantees perfect load balance (each expert processes exactly K tokens) but means tokens may be assigned different numbers of experts (some tokens to many experts, some to none — “dropped”).

For training stability: all tokens should receive expert processing. Expert choice routing drops tokens not selected by any expert — requiring a “no-expert fallback” (usually the residual connection bypasses the MoE block).

Hash Routing: Use a deterministic hash of the token (or position) to route — no learned gating. Zero routing overhead, perfect load balance. Worse performance because routing doesn’t respond to semantic content. Used for ablations but not production.

Soft MoE (Puigcerver et al., 2023): Instead of discrete routing, each expert processes a weighted combination of all tokens (weighted by gating scores). No load imbalance, no token dropping, fully differentiable. Trades sparse activation for soft mixing — maintains some efficiency by using compact expert representations.

Expert Capacity: Managing Token Overflow

With top-K routing over batches, experts may receive unequal token counts. If all $B$ tokens in a batch prefer expert $i$, it must process $B$ tokens while others process 0.

Expert capacity $C$ limits tokens per expert: if more than $C$ tokens are routed to an expert, the excess are “dropped” — their representations are not modified by the MoE layer (they bypass via the residual connection with an “overflow” flag).

The capacity factor $\alpha$ sets $C = \alpha \cdot B / N$:

$\alpha = 1$: Each expert processes exactly $B/N$ tokens on average. Slight overflow → some token dropping.
$\alpha = 2$: 2x buffer — significant overflow would be needed to drop any tokens.
$\alpha = 2$ is standard. Google’s Switch Transformer found $\alpha = 1.25$ sufficient for most settings.

Dropped tokens are handled via the residual: their representation passes through unchanged, potentially losing relevant FFN computation. In practice, <1% of tokens are dropped at typical capacity factors.

DeepSeek-V2: Multi-Head Latent Attention + Fine-Grained MoE

DeepSeek-V2 (DeepSeek AI, 2024) combined architectural innovations that make it one of the most parameter-efficient models at its capability level:

Multi-Head Latent Attention (MLA): Instead of storing full KV pairs (2 × $n_{heads}$ × $d_{head}$ per token in cache), DeepSeek-V2 compresses the KV to a single low-rank latent vector per token. The full K and V are reconstructed via learned projections at attention time.

KV cache reduction: from $2 \times 128 \times 128 = 32,768$ floats per token to $512$ floats per token — 64x compression. At 128k context, this reduces KV cache from 4GB to 64MB per request. Critical for serving long-context models efficiently.

Fine-Grained MoE Experts: Instead of 8 large experts (as in Mixtral), use 160 small experts with 6 active. Fine-grained experts create more routing flexibility and more consistent load distribution.

The 160 experts with 6 active is mathematically similar to 8 experts with 2 active in total active parameters, but the finer granularity allows more specialized routing and reduces variance in load balancing.

Results: DeepSeek-V2 (236B total params, 21B active) achieves GPT-4-level performance on benchmarks while costing 6x less per token to serve than a dense model of comparable capability.

MoE Scaling Laws

Artetxe et al. (2022) “Efficient Large Scale Language Modeling with Mixtures of Experts” studied MoE scaling behavior.

MoE scaling law: For a fixed compute budget $C$, the optimal MoE model has:

More experts → larger total parameters at fixed active parameters
Each expert’s size → fixed by the training FLOPs budget on active parameters

Empirically: at fixed inference compute, MoE models achieve lower perplexity than dense models. But the improvement decreases with scale — at very large active parameter counts, the advantage of more experts diminishes.

The granularity tradeoff: More, smaller experts vs. fewer, larger experts.

Many small experts: better routing specialization, more token selection flexibility, higher all-to-all communication overhead
Few large experts: less communication, less routing overhead, less specialization flexibility

Optimal expert size depends on hardware topology (communication cost vs. compute ratio). Google found that expert sizes of ~1B parameters each work well for TPU clusters with fast interconnects.

Data efficiency: MoE models learn faster (require fewer training tokens) than dense models of equivalent active parameters to reach the same perplexity. The additional parameters (inactive experts) provide additional effective capacity that accelerates learning.

Inference-time scaling: MoE models are currently not amenable to “inference-time scaling” (chain-of-thought, longer generation) as efficiently as dense models, because increasing token count doesn’t increase expert utilization — the same 2 experts handle additional tokens. This is an active area of MoE research.

One thing to remember: MoE’s fundamental tension is between capacity (more experts = more specialized knowledge) and efficiency (routing, communication overhead, load balancing) — and the engineering of good MoE systems is largely about resolving this tension at production scale.

mixture-of-expertsswitch-transformerdeepseek-v2expert-capacitymoe-scaling-laws

Mixture of Experts — Deep Dive

Token Routing: Beyond Top-K Gating

Expert Capacity: Managing Token Overflow

DeepSeek-V2: Multi-Head Latent Attention + Fine-Grained MoE

MoE Scaling Laws

See Also

Related Topics