Model Quantization — Core Concepts

Why Quantization Matters More Than Ever

A GPT-4 level model has hundreds of billions of parameters. Storing them in FP32 requires hundreds of gigabytes — far beyond any single GPU’s memory. Even inference-sized models are increasingly large: Meta’s Llama 3.1 405B requires 16 A100 80GB GPUs for FP16 inference.

Quantization makes large models accessible on consumer hardware and dramatically reduces inference costs for production deployments. The math is straightforward: INT4 quantization gives a 8x memory reduction vs. FP32 and a 4x reduction vs. FP16.

At the same time, LLMs are substantially harder to quantize than CNNs. LLMs have activation outliers — extreme values in specific dimensions that cause standard quantization to catastrophically degrade quality. Understanding why requires looking at the math.

Quantization Fundamentals

Uniform quantization: Map a floating-point range $[a, b]$ to integers $[0, 2^k - 1]$ for k-bit quantization:

$$Q(x) = \text{round}\left(\frac{x - a}{b - a} \times (2^k - 1)\right)$$

With scale $s = (b - a) / (2^k - 1)$ and zero point $z = -\text{round}(a / s)$, the dequantization is:

$$\hat{x} = s \times (Q(x) - z)$$

Quantization error is bounded: $|x - \hat{x}| \leq s/2$. The scale is the step size; smaller ranges give smaller steps and less error.

Symmetric quantization: Zero point $z = 0$; range is $[-a, a]$. Simpler but clips asymmetric distributions.

Asymmetric quantization: Range is $[a, b]$ with $z \neq 0$. Better for ReLU activations (range $[0, \text{max}]$).

Per-channel vs. per-tensor: Per-tensor uses one scale for the entire weight matrix (simpler, less accurate). Per-channel uses one scale per output channel (more accurate, higher overhead).

Post-Training Quantization (PTQ)

The most practical approach: train the full-precision model normally, then quantize without further training.

Weight-only quantization: Only weights are quantized; activations remain in FP16/BF16. Gives 2–8x memory reduction. Inference quality degrades minimally for 8-bit; degrades measurably but acceptably for 4-bit.

Weight + activation quantization: Both weights and activations in INT8. Enables the use of INT8 matrix multiply hardware (tensor cores), giving 2–4x speed improvement over FP16 on supported hardware. LLM.int8() (Dettmers et al., 2022) made this practical for LLMs.

LLM.int8(): Handling Activation Outliers

LLMs develop “emergent outliers” — hidden state dimensions with values 100x larger than typical (discovered during training, not present in small models). Standard INT8 quantization of activations fails because: if max value is 100 and typical values are ~0.5, the scale $s = 100 / 127 \approx 0.79$ — all typical values quantize to 0 or 1, losing all information.

LLM.int8()‘s solution: decompose the matrix multiplication. Identify outlier dimensions (typically < 0.1% of features), handle them in FP16; quantize the remaining 99.9%+ of dimensions in INT8.

$$Y = X_{outlier} W_{outlier}^T + X_{int8} W_{int8}^T$$

Both additions done in FP16. Net result: ~zero accuracy degradation, 2x memory reduction, modest speed improvement.

GPTQ: Optimal Brain Quantization for LLMs

GPTQ (Frantar et al., 2022) applies Optimal Brain Compression (OBC) — a second-order weight quantization method — to LLMs efficiently.

For each weight matrix $W$, quantize weights sequentially. When quantizing weight $w_q$, update remaining unquantized weights to compensate:

$$\delta W_F = -\frac{w_q - \text{quant}(w_q)}{[H_F^{-1}]{qq}} (H_F^{-1}){:,q}$$

Where $H_F$ is the block Hessian. This compensates for each quantization error before moving to the next weight — the errors are absorbed rather than accumulated.

GPTQ enables 3-4 bit quantization of models like Llama-2-70B with <1% perplexity increase. Runtime: quantizing 70B model takes ~4 GPU hours.

AWQ: Activation-Aware Weight Quantization

AWQ (Lin et al., 2023) observes that weight importance is not uniform — some weights are far more important than others, determined by the input activation scale.

For a layer with activation $x$ and weight $w$: the product $|x_j| \times |w_j|$ determines the contribution of channel $j$. High-activation channels contribute more to the output — quantizing them aggressively causes disproportionate error.

AWQ solution: scale important weights up by $s$ before quantization, scale activations down by $s$. Net mathematical effect: zero (the two scalings cancel). Quantization effect: quantizing a weight scaled to $s \times w_j$ is like giving it $\log_2(s)$ extra bits of precision.

AWQ is hardware-friendly (no decomposition like LLM.int8(), all operations in INT4/INT8), faster to apply than GPTQ, and achieves comparable accuracy.

Quantization-Aware Training (QAT)

QAT simulates quantization during training, allowing the model to adapt to quantization noise:

During forward pass: quantize and dequantize weights/activations (simulating quantization noise). During backward pass: use “straight-through estimator” (STE) — pass gradients through quantization as if it were the identity function.

QAT typically recovers 0.5–1% quality vs. PTQ at the cost of additional training. Most useful for:

  • Very aggressive quantization (2–3 bit) where PTQ fails
  • Models fine-tuned on specific tasks where quality is critical
  • Edge deployment where all inference happens on fixed hardware

GGUF and Consumer LLM Inference

GGUF (GPT-Generated Unified Format, llama.cpp) is the file format that made running quantized LLMs accessible on consumer hardware.

llama.cpp (Gerganov, 2023) implemented efficient CPU inference for quantized LLMs, running on laptops without GPUs. Key innovation: GGUF supports mixed-precision quantization (e.g., Q5_K_M uses different bit depths for different layers based on their sensitivity).

By mid-2024, you could run Llama-3-8B at Q4_K_M quality on a 8GB RAM MacBook Pro:

  • Memory: 4.5 GB
  • Speed: ~30 tokens/second
  • Quality: <2% perplexity increase vs. FP16

This represents a fundamental democratization — AI model inference no longer requires expensive cloud compute or specialized hardware.

One thing to remember: Quantization’s practical impact is enormous — it’s the primary reason large AI models are increasingly accessible on consumer devices, and the difference between “requires a data center” and “runs on a gaming PC” is often just an INT4 quantization step.

model-quantizationint8gptqawqggufmodel-compression

See Also

  • Knowledge Distillation How AI companies shrink massive models down to phone-sized ones without losing much intelligence — the teacher-student trick that powers on-device AI.
  • Model Pruning How AI models lose weight without losing intelligence — removing the neurons that don't actually do anything useful to make models faster and smaller.
  • Speculative Decoding The clever trick that makes large AI models generate text 2-4x faster — using a small 'draft' model to guess tokens that a big model then quickly verifies.
  • Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
  • Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.