Model Quantization — Deep Dive
Second-Order Optimization for Quantization
Optimal Brain Quantization (OBQ) frames quantization as an optimization problem: choose quantized weights $\hat{W}$ to minimize the increase in output loss:
$$\min_{\hat{W}} |WX - \hat{W}X|_F^2$$
Where $X$ is representative calibration data. This is equivalent to minimizing the second-order expansion of the loss function change.
The exact solution uses the Hessian of the layer’s output with respect to weights:
$$H_W = 2X X^T$$
OBQ quantizes one weight at a time, updating remaining weights to compensate. For the $q$-th weight quantized with error $\epsilon_q = w_q - \hat{w}_q$:
$$\delta W_F = -\frac{\epsilon_q}{[H_F^{-1}]{qq}} (H_F^{-1}){:,q}$$
This is exact but $O(n^3)$ for an $n \times n$ weight matrix. GPTQ (2022) made this practical by:
- Column-wise batching: Process weights in blocks of 128 columns simultaneously
- Cholesky update: Use Cholesky decomposition to compute $(H_F^{-1})_{:,q}$ updates efficiently
- Lazy batch updates: Accumulate weight updates rather than applying each individually
The resulting algorithm runs in $O(n^2)$ effective time per block, enabling quantization of 175B parameter models in hours rather than days.
Calibration data: The Hessian estimate $H = 2XX^T$ requires a calibration dataset $X$. GPTQ uses 128 random text sequences from C4 dataset. Different calibration data can meaningfully affect quantization quality — domain-matched calibration (e.g., code data for a coding model) often improves downstream task performance by 5–10%.
SmoothQuant: Migrating Quantization Difficulty
LLM.int8() handles activation outliers by mixed-precision computation. SmoothQuant (Xiao et al., 2022) avoids this by mathematically migrating quantization difficulty from activations to weights.
Key insight: Activation outliers are persistent — the same channels consistently have large values. Weights corresponding to large-activation channels are quantized with effectively more bits (since their scale dominates); weights corresponding to small-activation channels are quantized precisely.
Apply a per-channel scaling factor $s_j$ to “smooth” activations: $$Y = (X \cdot \text{diag}(s)^{-1}) \cdot (\text{diag}(s) \cdot W)$$
The activation is divided by $s$ (becomes smaller, easier to quantize). The weight is multiplied by $s$ (absorbed into weight, easy because weights are quantized offline).
Choose $s_j = \max(|X_j|)^\alpha / \max(|W_j|)^{1-\alpha}$ with $\alpha \in [0, 1]$ controlling the migration degree. $\alpha = 0$: only weights quantized. $\alpha = 1$: only activations quantized. $\alpha = 0.5$ balances the difficulty.
With $\alpha = 0.5$, SmoothQuant enables W8A8 (weights and activations both INT8) for OPT-175B with <1% accuracy degradation. This is hardware-friendly — modern accelerators have fast INT8 matrix multiply (A100: 624 TOPS INT8 vs. 312 TOPS FP16), giving 2x throughput.
1-Bit Quantization: BitNet
BitNet (Wang et al., 2023) pushed to the extreme: 1-bit weights (values of -1 or +1 only). The motivation: binary operations are dramatically faster than INT8 operations on specialized hardware.
Training uses STE and binary step function: $$w_b = \text{sign}(w) = \begin{cases} +1 & \text{if } w \geq 0 \ -1 & \text{if } w < 0 \end{cases}$$
With scale factor $\alpha$ per weight matrix (FP16) to preserve magnitude: $W \approx \alpha W_b$.
BitNet b1.58 (Ma et al., 2024) extended to ternary weights: ${-1, 0, +1}$. The 0 allows selective activation — equivalent to structured sparsity. This extra bit of “information” dramatically improves quality over strict binary.
Results: BitNet b1.58 3B parameter model matches Llama 3.1 8B on several benchmarks while:
- Using 7x less memory
- Running 5–10x faster on specialized 1-bit hardware
- Consuming 73% less energy
BitNet requires training from scratch (not applicable to existing models). Major hardware vendors (Microsoft, Intel) announced BitNet-specific hardware acceleration in 2024.
KV Cache Quantization for Long-Context LLMs
During LLM inference, attention key-value pairs are cached to avoid recomputation. For a model with context length 128k tokens, a single request’s KV cache can reach:
$$\text{KV size} = 2 \times L \times H \times d_h \times T \times \text{dtype_bytes}$$
For Llama-3-70B (80 layers, 8 KV heads, 128 head dim), at 128k tokens, FP16: $$2 \times 80 \times 8 \times 128 \times 128000 \times 2 = 42 \text{ GB}$$
Per request! This limits throughput dramatically.
KV cache quantization (KVQuant, 2024; MagicPIG, 2024): Quantize cached K/V tensors to INT4 or even INT2.
Challenges specific to KV quantization:
- Keys and values are computed on the fly, not pre-computed
- Distribution changes across token positions (early vs. late context)
- Attention scores are sensitive to key precision (dot products amplify quantization error)
Per-vector quantization: Instead of per-channel (as for weights), use per-vector scales — each key/value vector gets its own scale factor. Overhead: one scale per vector (128 values) = 12.5% storage overhead. Quality improvement: substantial vs. per-tensor.
KVQuant uses non-uniform quantization for keys specifically — calibrate the non-uniform quantile-based grid offline, apply at inference. Reduces KV cache by 4x with <0.1% perplexity increase.
Quantization Sensitivity Analysis
Not all layers are equally sensitive to quantization. Empirically:
Most sensitive: First and last layers (embedding, unembedding). These are typically kept in FP16 even in aggressively quantized models.
Moderately sensitive: Attention projection weights (Q, K, V, O). Key weights slightly more sensitive than value weights.
Least sensitive: MLP layers (except with activated features — MoE output layers are sensitive).
Layer position: Earlier transformer layers tend to be more sensitive than later ones for the same bit depth.
Mixed-precision quantization exploits this: Q2_K in llama.cpp’s naming scheme uses 4-bit quantization for attention weights and 2-bit for MLP weights with 64-weight superblock normalization, achieving near-3-bit average quality with 2-bit memory efficiency.
One thing to remember: Quantization is not just a compression technique — it’s increasingly a design choice made at training time (QAT, BitNet) or architecture time (activation-friendly architectures), suggesting that future models will be designed with quantization in mind from the start rather than quantized as an afterthought.
See Also
- Knowledge Distillation How AI companies shrink massive models down to phone-sized ones without losing much intelligence — the teacher-student trick that powers on-device AI.
- Model Pruning How AI models lose weight without losing intelligence — removing the neurons that don't actually do anything useful to make models faster and smaller.
- Speculative Decoding The clever trick that makes large AI models generate text 2-4x faster — using a small 'draft' model to guess tokens that a big model then quickly verifies.
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
- Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.