Model Pruning — Core Concepts

Why Pruning Works: Over-Parameterization

Training deep neural networks requires more parameters than inference. During training, the extra capacity:

  • Provides redundant gradient paths (reduces vanishing gradients)
  • Creates exploration diversity (different neurons specialize in different features)
  • Provides “insurance” — if one approach doesn’t work, another path does

After training, much of this capacity is redundant. Weight magnitudes approximately follow a Laplacian distribution — most weights are near zero, with a few having large magnitude. The near-zero weights contribute minimally to outputs.

Pruning removes these near-zero weights and fine-tunes the remaining network. The result: significantly smaller models with modest accuracy loss.

Pruning Criteria: What to Remove

Magnitude-based pruning: Remove weights with smallest absolute value. Simple, fast, surprisingly effective. Han et al. (2015) used L1 magnitude threshold: remove all weights where $|w_i| < \text{threshold}$.

Gradient-based criteria: Importance $I_i = |w_i \cdot \nabla_{w_i} \mathcal{L}|$ — considers both weight magnitude and gradient. Weights that are small AND whose change would reduce the loss are pruned first.

Optimal Brain Damage (OBD, LeCun 1989): Use second-order Taylor expansion to estimate the loss change when each weight is pruned: $$\delta \mathcal{L} \approx \frac{1}{2} h_{ii} \delta w_i^2$$

Where $h_{ii} = \partial^2 \mathcal{L} / \partial w_i^2$ is the diagonal Hessian. Prune weights with lowest $h_{ii} w_i^2$. More accurate than magnitude pruning but requires Hessian computation.

L1/L2 regularization during training: L1 regularization naturally drives many weights to zero. Train with L1 penalty, then prune near-zero weights — combines training and pruning into one step.

Unstructured vs. Structured Pruning

Unstructured pruning: Remove individual weights anywhere in the network. The weight matrix becomes sparse (scattered zeros). Maximum flexibility — can remove any weight.

The problem: modern hardware doesn’t efficiently benefit from unstructured sparsity. Sparse matrix multiplication on GPUs isn’t much faster than dense multiplication unless sparsity is extreme (>90%). The weight is stored as zero but still occupies memory and computation slots.

Structured pruning: Remove entire units — neurons, attention heads, convolutional filters, or transformer layers. The pruned network has the same dense structure, just smaller. Standard hardware efficiently handles smaller dense matrices.

Examples:

  • Filter pruning: Remove entire convolutional filters. The next layer’s corresponding input channels are also removed.
  • Attention head pruning: Remove entire attention heads. Michel et al. (2019) showed 20–40% of BERT’s attention heads can be pruned with minimal performance loss.
  • Layer pruning: Remove entire transformer layers. ShortFormer (Press et al., 2020) achieved 30% speedup by removing alternating layers.

The Lottery Ticket Hypothesis

Frankle & Carlin (MIT, 2019) discovered that large networks contain small “winning ticket” subnetworks:

Hypothesis: A randomly initialized dense network contains a small subnetwork (the “winning ticket”) that, when trained in isolation from the same random initialization, matches the full network’s accuracy.

Finding the ticket: Iterative magnitude pruning (IMP):

  1. Train the full network for $k$ steps
  2. Prune the p% lowest-magnitude weights globally
  3. Reset remaining weights to their original initialization values
  4. Repeat with reduced network

After multiple rounds, you find a sparse subnetwork that can be trained from scratch at high performance.

Implications: The large network’s training value isn’t in its final weights — it’s in revealing which connections matter. The “winning ticket” concept suggests that neural network design could theoretically start with a small network if we could find the right initialization.

Limitations: IMP is very expensive (many training rounds). The tickets don’t transfer well across different datasets or architectures. At very large scale (billion-parameter models), the hypothesis is harder to verify.

Pruning Schedules and Gradual Pruning

Pruning to final sparsity all at once hurts accuracy significantly. Gradual pruning recovers most of this:

Gradual magnitude pruning (Zhu & Gupta, 2018): Increase sparsity gradually during training, allowing the network to continuously adapt:

$$s_t = s_f + (s_0 - s_f)\left(1 - \frac{t - t_0}{n\Delta t}\right)^3$$

Where $s_t$ is current sparsity, $s_0$ is initial sparsity (0), $s_f$ is final target sparsity, and $\Delta t$ is pruning interval. The cubic decay allows fast initial pruning (many near-zero weights), slowing as fewer prunable weights remain.

Prune → Retrain → Prune cycle: Three-step cycle for each sparsity level:

  1. Prune to new sparsity target
  2. Fine-tune for $k$ steps
  3. Repeat

This “sparse fine-tuning” is the standard approach for production models.

Practical Results at Scale

BERT pruning: Sanh et al. (2020) “Movement Pruning” pruned BERT-base to 95% sparsity while retaining 93% of F1 score on SQuAD. Key insight: weight magnitude is not the best criterion for fine-tuned models; the gradient direction during fine-tuning indicates which weights moved toward task-specific useful values.

CNN pruning: VGG-16 can be pruned to 90% sparsity with ~1% top-5 accuracy loss on ImageNet. ResNet-50 at 80% sparsity retains 97% of original accuracy.

LLM structured pruning: SparseGPT (Frantar & Alistarh, 2023) prunes 50-60% of OPT/BLOOM/GPT-J weights in one forward pass (no retraining) using approximate second-order pruning. Pruned GPT models run efficiently on CPU with sparse matrix libraries.

One thing to remember: Structured pruning (removing entire heads, filters, or layers) is the practical choice for deployment because it directly reduces compute and memory without requiring special sparse hardware support.

model-pruninglottery-ticketstructured-pruningsparsitycompression

See Also

  • Knowledge Distillation How AI companies shrink massive models down to phone-sized ones without losing much intelligence — the teacher-student trick that powers on-device AI.
  • Model Quantization How AI models get shrunk to run on your phone — the precision-tradeoff trick that makes 70 billion parameter models fit in consumer hardware.
  • Speculative Decoding The clever trick that makes large AI models generate text 2-4x faster — using a small 'draft' model to guess tokens that a big model then quickly verifies.
  • Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
  • Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.