PyTorch Gradient Checkpointing — ELI5

How PyTorch trades a little extra time for massive memory savings when training huge neural networks.

Imagine you’re baking a really complicated cake with 50 steps. Normally, you’d keep notes from every single step so you can fix mistakes later. But your kitchen notebook is tiny — you can’t fit 50 pages of notes.

Gradient checkpointing is like only writing down notes for every 10th step. When you need to fix step 23, you go back to your notes from step 20 and redo steps 21–23. It takes a little extra time, but your tiny notebook can handle it.

In PyTorch, training a neural network means the computer remembers everything it calculated going forward (the “forward pass”) so it can learn from mistakes going backward. With really big models — think GPT-sized networks — remembering everything eats up all the GPU memory.

Gradient checkpointing says: “Don’t remember everything. Save only a few key snapshots. When you need the stuff in between, just recalculate it.” The GPU uses way less memory (often 60–70% less), but training takes about 20–30% longer because of the recalculation.

This is why researchers can train models that would otherwise need four expensive GPUs on just one. It’s a deliberate tradeoff: spend time to save memory.

The one thing to remember: Gradient checkpointing forgets intermediate calculations on purpose, then recalculates them when needed — trading time for memory so bigger models can train on smaller hardware.

pythonmachine-learningpytorch

PyTorch Gradient Checkpointing — ELI5

See Also

Related Topics