PyTorch Quantization — Core Concepts

Understand post-training and quantization-aware training in PyTorch — when each works and how they affect model quality.

What Quantization Actually Changes

Neural networks store weights and compute activations as 32-bit floating-point numbers (FP32). Each number takes 4 bytes and supports a huge range with high precision. Quantization maps these to lower-precision formats — typically 8-bit integers (INT8) or sometimes 4-bit integers.

The conversion uses a scale factor and zero-point:

quantized_value = round(real_value / scale) + zero_point
real_value ≈ (quantized_value - zero_point) × scale

The scale and zero-point are calibrated so the INT8 range covers the actual range of values in the tensor. Values outside this range get clipped — that’s where accuracy loss comes from.

Three Quantization Approaches

Post-Training Dynamic Quantization

Weights are quantized ahead of time; activations are quantized on-the-fly during inference. No calibration data needed.

Pros: Easiest to apply, no training required Cons: Activations quantized without calibration may lose accuracy Best for: Models dominated by linear layers (LSTMs, Transformers for NLP)

Post-Training Static Quantization

Both weights and activations are quantized ahead of time. Requires a representative calibration dataset to determine the range of activations.

Pros: Faster inference than dynamic (no on-the-fly quantization overhead) Cons: Needs calibration data, more setup Best for: CNN-based vision models, any model where activation ranges are stable

Quantization-Aware Training (QAT)

Simulates quantization during training. The model learns to compensate for quantization error, producing weights that work better when actually quantized.

Pros: Highest accuracy — often within 0.1-0.5% of FP32 Cons: Requires retraining (or at least fine-tuning), more complex pipeline Best for: When post-training quantization loses too much accuracy, or when deploying to very constrained hardware

Accuracy vs. Speed Tradeoff

Method	Typical Accuracy Loss	Speedup	Effort
Dynamic (INT8)	0.5-2%	1.5-2×	Minutes
Static (INT8)	0.3-1%	2-3×	Hours (calibration)
QAT (INT8)	0.1-0.5%	2-3×	Days (retraining)
INT4 (weights only)	1-3%	2-4×	Hours

The right choice depends on your accuracy budget. If a 1% accuracy drop is acceptable, post-training quantization gets you there instantly. If you need near-original accuracy, invest in QAT.

What Quantizes Well (and What Doesn’t)

Quantizes well:

Large models with redundant parameters (the redundancy absorbs quantization noise)
Convolutional layers (weights tend to have narrow, stable distributions)
Models with batch normalization (normalizes activation ranges)

Quantizes poorly:

Small models with few parameters (no redundancy to absorb error)
Layers with extreme activation ranges (attention softmax outputs, for example)
Models with skip connections that accumulate quantization error across depth

Common Misconception

Many people think quantization is only for edge deployment — phones and IoT devices. In reality, server-side quantization is increasingly important. Running a 7B parameter LLM in INT4 instead of FP16 cuts memory from 14 GB to 3.5 GB and doubles throughput. For companies serving millions of requests, this halves their GPU bill.

The Calibration Step

Static quantization needs a calibration dataset — a small, representative sample (typically 100-1000 examples from your training set). During calibration, the model runs in FP32 while observers record the range of values at each layer. These ranges set the scale and zero-point for INT8 conversion.

Bad calibration data leads to bad quantization. If your calibration set doesn’t cover the range of real inputs, some activations will be clipped, causing accuracy degradation.

Hardware Considerations

Quantization benefits depend on hardware support:

x86 CPUs: INT8 via AVX-512 VNNI — major speedups
ARM CPUs: INT8 via NEON — essential for mobile
NVIDIA GPUs: INT8 via Tensor Cores (A100, H100) — 2× vs FP16
Apple Neural Engine: INT8/INT4 natively — how on-device models run

Without hardware support for low-precision arithmetic, quantized models don’t run faster — they may actually be slower due to dequantization overhead.

The one thing to remember: Quantization is the single most practical optimization for deployment — it reduces model size and latency dramatically, and the right approach (dynamic, static, or QAT) depends on how much accuracy you can afford to lose.

pythonmachine-learningpytorch