Model Quantization — Deep Dive

Second-Order Optimization for Quantization

Optimal Brain Quantization (OBQ) frames quantization as an optimization problem: choose quantized weights $\hat{W}$ to minimize the increase in output loss:

$$\min_{\hat{W}} |WX - \hat{W}X|_F^2$$

Where $X$ is representative calibration data. This is equivalent to minimizing the second-order expansion of the loss function change.

The exact solution uses the Hessian of the layer’s output with respect to weights:

$$H_W = 2X X^T$$

OBQ quantizes one weight at a time, updating remaining weights to compensate. For the $q$-th weight quantized with error $\epsilon_q = w_q - \hat{w}_q$:

$$\delta W_F = -\frac{\epsilon_q}{[H_F^{-1}]{qq}} (H_F^{-1}){:,q}$$

This is exact but $O(n^3)$ for an $n \times n$ weight matrix. GPTQ (2022) made this practical by:

  1. Column-wise batching: Process weights in blocks of 128 columns simultaneously
  2. Cholesky update: Use Cholesky decomposition to compute $(H_F^{-1})_{:,q}$ updates efficiently
  3. Lazy batch updates: Accumulate weight updates rather than applying each individually

The resulting algorithm runs in $O(n^2)$ effective time per block, enabling quantization of 175B parameter models in hours rather than days.

Calibration data: The Hessian estimate $H = 2XX^T$ requires a calibration dataset $X$. GPTQ uses 128 random text sequences from C4 dataset. Different calibration data can meaningfully affect quantization quality — domain-matched calibration (e.g., code data for a coding model) often improves downstream task performance by 5–10%.

SmoothQuant: Migrating Quantization Difficulty

LLM.int8() handles activation outliers by mixed-precision computation. SmoothQuant (Xiao et al., 2022) avoids this by mathematically migrating quantization difficulty from activations to weights.

Key insight: Activation outliers are persistent — the same channels consistently have large values. Weights corresponding to large-activation channels are quantized with effectively more bits (since their scale dominates); weights corresponding to small-activation channels are quantized precisely.

Apply a per-channel scaling factor $s_j$ to “smooth” activations: $$Y = (X \cdot \text{diag}(s)^{-1}) \cdot (\text{diag}(s) \cdot W)$$

The activation is divided by $s$ (becomes smaller, easier to quantize). The weight is multiplied by $s$ (absorbed into weight, easy because weights are quantized offline).

Choose $s_j = \max(|X_j|)^\alpha / \max(|W_j|)^{1-\alpha}$ with $\alpha \in [0, 1]$ controlling the migration degree. $\alpha = 0$: only weights quantized. $\alpha = 1$: only activations quantized. $\alpha = 0.5$ balances the difficulty.

With $\alpha = 0.5$, SmoothQuant enables W8A8 (weights and activations both INT8) for OPT-175B with <1% accuracy degradation. This is hardware-friendly — modern accelerators have fast INT8 matrix multiply (A100: 624 TOPS INT8 vs. 312 TOPS FP16), giving 2x throughput.

1-Bit Quantization: BitNet

BitNet (Wang et al., 2023) pushed to the extreme: 1-bit weights (values of -1 or +1 only). The motivation: binary operations are dramatically faster than INT8 operations on specialized hardware.

Training uses STE and binary step function: $$w_b = \text{sign}(w) = \begin{cases} +1 & \text{if } w \geq 0 \ -1 & \text{if } w < 0 \end{cases}$$

With scale factor $\alpha$ per weight matrix (FP16) to preserve magnitude: $W \approx \alpha W_b$.

BitNet b1.58 (Ma et al., 2024) extended to ternary weights: ${-1, 0, +1}$. The 0 allows selective activation — equivalent to structured sparsity. This extra bit of “information” dramatically improves quality over strict binary.

Results: BitNet b1.58 3B parameter model matches Llama 3.1 8B on several benchmarks while:

  • Using 7x less memory
  • Running 5–10x faster on specialized 1-bit hardware
  • Consuming 73% less energy

BitNet requires training from scratch (not applicable to existing models). Major hardware vendors (Microsoft, Intel) announced BitNet-specific hardware acceleration in 2024.

KV Cache Quantization for Long-Context LLMs

During LLM inference, attention key-value pairs are cached to avoid recomputation. For a model with context length 128k tokens, a single request’s KV cache can reach:

$$\text{KV size} = 2 \times L \times H \times d_h \times T \times \text{dtype_bytes}$$

For Llama-3-70B (80 layers, 8 KV heads, 128 head dim), at 128k tokens, FP16: $$2 \times 80 \times 8 \times 128 \times 128000 \times 2 = 42 \text{ GB}$$

Per request! This limits throughput dramatically.

KV cache quantization (KVQuant, 2024; MagicPIG, 2024): Quantize cached K/V tensors to INT4 or even INT2.

Challenges specific to KV quantization:

  • Keys and values are computed on the fly, not pre-computed
  • Distribution changes across token positions (early vs. late context)
  • Attention scores are sensitive to key precision (dot products amplify quantization error)

Per-vector quantization: Instead of per-channel (as for weights), use per-vector scales — each key/value vector gets its own scale factor. Overhead: one scale per vector (128 values) = 12.5% storage overhead. Quality improvement: substantial vs. per-tensor.

KVQuant uses non-uniform quantization for keys specifically — calibrate the non-uniform quantile-based grid offline, apply at inference. Reduces KV cache by 4x with <0.1% perplexity increase.

Quantization Sensitivity Analysis

Not all layers are equally sensitive to quantization. Empirically:

Most sensitive: First and last layers (embedding, unembedding). These are typically kept in FP16 even in aggressively quantized models.

Moderately sensitive: Attention projection weights (Q, K, V, O). Key weights slightly more sensitive than value weights.

Least sensitive: MLP layers (except with activated features — MoE output layers are sensitive).

Layer position: Earlier transformer layers tend to be more sensitive than later ones for the same bit depth.

Mixed-precision quantization exploits this: Q2_K in llama.cpp’s naming scheme uses 4-bit quantization for attention weights and 2-bit for MLP weights with 64-weight superblock normalization, achieving near-3-bit average quality with 2-bit memory efficiency.

One thing to remember: Quantization is not just a compression technique — it’s increasingly a design choice made at training time (QAT, BitNet) or architecture time (activation-friendly architectures), suggesting that future models will be designed with quantization in mind from the start rather than quantized as an afterthought.

model-quantizationgptqsmoothquantbitnetkv-cache-quantizationptq

See Also

  • Knowledge Distillation How AI companies shrink massive models down to phone-sized ones without losing much intelligence — the teacher-student trick that powers on-device AI.
  • Model Pruning How AI models lose weight without losing intelligence — removing the neurons that don't actually do anything useful to make models faster and smaller.
  • Speculative Decoding The clever trick that makes large AI models generate text 2-4x faster — using a small 'draft' model to guess tokens that a big model then quickly verifies.
  • Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
  • Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.