Notebook Testing — Core Concepts
Why notebook testing matters
Notebooks are code, but they rarely get treated like code. A Python module has unit tests, linting, and CI. A notebook often has none of these. The result is “notebook rot” — analyses that worked three months ago now crash because a dependency updated, a data source changed, or cells were reordered during an ad-hoc demo.
Testing notebooks catches rot early. It also builds confidence: when a stakeholder asks “can I trust these numbers?”, a passing test suite is a concrete answer.
The testing spectrum
Notebook testing is not one thing. It spans several levels:
| Level | What it checks | Tool |
|---|---|---|
| Smoke test | Every cell runs without error | nbval, papermill |
| Output regression | Outputs match saved versions | nbval --sanitize |
| Unit test | Individual functions work correctly | testbook, extract-and-test |
| Linting | Code quality, unused imports | nbqa + ruff/flake8 |
Most teams start with smoke tests and add layers as the notebook matures.
Smoke testing with papermill
Papermill executes a notebook programmatically and saves the result:
papermill input.ipynb output.ipynb -p dataset "sales_q1.csv"
The -p flag injects parameters. If any cell raises an exception, papermill exits with a non-zero code — perfect for CI. Teams at Netflix popularised this approach for running hundreds of analytical notebooks nightly.
Output regression with nbval
nbval is a pytest plugin. It re-executes every cell and compares outputs to the saved versions in the .ipynb file:
pytest --nbval my_notebook.ipynb
Floating-point noise and timestamps break naive comparisons, so nbval supports a sanitise config that strips known-variable content before comparison. This catches genuine output changes (a model accuracy that dropped) while ignoring irrelevant ones (a different execution timestamp).
Unit testing with testbook
testbook lets you call specific notebook functions from a standard pytest test:
from testbook import testbook
@testbook("analysis.ipynb", execute=True)
def test_clean_data(tb):
clean = tb.ref("clean_data")
result = clean([1, None, 3])
assert result == [1, 3]
This is powerful because it tests notebook logic without running every cell. It also encourages refactoring: if a function is important enough to test, it is probably important enough to move to a module eventually.
Linting with nbqa
nbqa applies standard Python linters to notebook cells:
nbqa ruff my_notebook.ipynb --fix
nbqa black my_notebook.ipynb
It extracts code cells, runs the tool, and writes fixes back into the .ipynb JSON. This catches import errors, undefined variables, and style issues that manual review misses.
Common misconception
Many people believe testing notebooks means converting them entirely to scripts. That is one option, but it is not the only one. Modern tools test notebooks as notebooks, preserving the interactive narrative that makes them valuable in the first place. The goal is to add confidence, not to kill the format.
Putting it in CI
A minimal GitHub Actions workflow:
- name: Test notebooks
run: |
pip install papermill nbval
papermill analysis.ipynb /dev/null -p test_mode true
pytest --nbval visualization.ipynb --sanitize-with sanitize.cfg
Run this on every pull request. Broken notebooks get caught before merge.
One thing to remember: Notebook testing is not all-or-nothing. Start with a smoke test that just runs every cell. That single step catches 80 percent of notebook rot with almost zero effort.
See Also
- Python Acceptance Testing Patterns How Python teams verify software does what real users actually asked for.
- Python Approval Testing How approval testing lets you verify complex Python output by comparing it to a saved 'golden' copy you already checked.
- Python Behavior Driven Development Get an intuitive feel for Behavior Driven Development so Python behavior stops feeling unpredictable.
- Python Browser Automation Testing How Python can control a web browser like a robot to test websites automatically.
- Python Chaos Testing Applications Why breaking your own Python systems on purpose makes them stronger.