PyTorch Quantization — Core Concepts
What Quantization Actually Changes
Neural networks store weights and compute activations as 32-bit floating-point numbers (FP32). Each number takes 4 bytes and supports a huge range with high precision. Quantization maps these to lower-precision formats — typically 8-bit integers (INT8) or sometimes 4-bit integers.
The conversion uses a scale factor and zero-point:
quantized_value = round(real_value / scale) + zero_point
real_value ≈ (quantized_value - zero_point) × scale
The scale and zero-point are calibrated so the INT8 range covers the actual range of values in the tensor. Values outside this range get clipped — that’s where accuracy loss comes from.
Three Quantization Approaches
Post-Training Dynamic Quantization
Weights are quantized ahead of time; activations are quantized on-the-fly during inference. No calibration data needed.
Pros: Easiest to apply, no training required Cons: Activations quantized without calibration may lose accuracy Best for: Models dominated by linear layers (LSTMs, Transformers for NLP)
Post-Training Static Quantization
Both weights and activations are quantized ahead of time. Requires a representative calibration dataset to determine the range of activations.
Pros: Faster inference than dynamic (no on-the-fly quantization overhead) Cons: Needs calibration data, more setup Best for: CNN-based vision models, any model where activation ranges are stable
Quantization-Aware Training (QAT)
Simulates quantization during training. The model learns to compensate for quantization error, producing weights that work better when actually quantized.
Pros: Highest accuracy — often within 0.1-0.5% of FP32 Cons: Requires retraining (or at least fine-tuning), more complex pipeline Best for: When post-training quantization loses too much accuracy, or when deploying to very constrained hardware
Accuracy vs. Speed Tradeoff
| Method | Typical Accuracy Loss | Speedup | Effort |
|---|---|---|---|
| Dynamic (INT8) | 0.5-2% | 1.5-2× | Minutes |
| Static (INT8) | 0.3-1% | 2-3× | Hours (calibration) |
| QAT (INT8) | 0.1-0.5% | 2-3× | Days (retraining) |
| INT4 (weights only) | 1-3% | 2-4× | Hours |
The right choice depends on your accuracy budget. If a 1% accuracy drop is acceptable, post-training quantization gets you there instantly. If you need near-original accuracy, invest in QAT.
What Quantizes Well (and What Doesn’t)
Quantizes well:
- Large models with redundant parameters (the redundancy absorbs quantization noise)
- Convolutional layers (weights tend to have narrow, stable distributions)
- Models with batch normalization (normalizes activation ranges)
Quantizes poorly:
- Small models with few parameters (no redundancy to absorb error)
- Layers with extreme activation ranges (attention softmax outputs, for example)
- Models with skip connections that accumulate quantization error across depth
Common Misconception
Many people think quantization is only for edge deployment — phones and IoT devices. In reality, server-side quantization is increasingly important. Running a 7B parameter LLM in INT4 instead of FP16 cuts memory from 14 GB to 3.5 GB and doubles throughput. For companies serving millions of requests, this halves their GPU bill.
The Calibration Step
Static quantization needs a calibration dataset — a small, representative sample (typically 100-1000 examples from your training set). During calibration, the model runs in FP32 while observers record the range of values at each layer. These ranges set the scale and zero-point for INT8 conversion.
Bad calibration data leads to bad quantization. If your calibration set doesn’t cover the range of real inputs, some activations will be clipped, causing accuracy degradation.
Hardware Considerations
Quantization benefits depend on hardware support:
- x86 CPUs: INT8 via AVX-512 VNNI — major speedups
- ARM CPUs: INT8 via NEON — essential for mobile
- NVIDIA GPUs: INT8 via Tensor Cores (A100, H100) — 2× vs FP16
- Apple Neural Engine: INT8/INT4 natively — how on-device models run
Without hardware support for low-precision arithmetic, quantized models don’t run faster — they may actually be slower due to dequantization overhead.
The one thing to remember: Quantization is the single most practical optimization for deployment — it reduces model size and latency dramatically, and the right approach (dynamic, static, or QAT) depends on how much accuracy you can afford to lose.
See Also
- Python Hyperparameter Tuning Learn why adjusting the dials on a computer's learning recipe makes predictions way better.
- Python Knowledge Distillation How a big expert AI teaches a tiny student AI to be almost as smart — like a professor writing a cheat sheet for an exam.
- Python Model Compression Methods All the ways Python developers shrink massive AI models to fit on phones and tiny devices — like packing for a trip with a carry-on bag.
- Python Model Pruning Techniques Why cutting away parts of an AI's brain can make it faster without making it dumber.
- Python Neural Architecture Search How AI designs its own brain structure — like a robot architect building the perfect house by trying thousands of floor plans.