Notebook Testing — Core Concepts

Why notebook testing matters

Notebooks are code, but they rarely get treated like code. A Python module has unit tests, linting, and CI. A notebook often has none of these. The result is “notebook rot” — analyses that worked three months ago now crash because a dependency updated, a data source changed, or cells were reordered during an ad-hoc demo.

Testing notebooks catches rot early. It also builds confidence: when a stakeholder asks “can I trust these numbers?”, a passing test suite is a concrete answer.

The testing spectrum

Notebook testing is not one thing. It spans several levels:

LevelWhat it checksTool
Smoke testEvery cell runs without errornbval, papermill
Output regressionOutputs match saved versionsnbval --sanitize
Unit testIndividual functions work correctlytestbook, extract-and-test
LintingCode quality, unused importsnbqa + ruff/flake8

Most teams start with smoke tests and add layers as the notebook matures.

Smoke testing with papermill

Papermill executes a notebook programmatically and saves the result:

papermill input.ipynb output.ipynb -p dataset "sales_q1.csv"

The -p flag injects parameters. If any cell raises an exception, papermill exits with a non-zero code — perfect for CI. Teams at Netflix popularised this approach for running hundreds of analytical notebooks nightly.

Output regression with nbval

nbval is a pytest plugin. It re-executes every cell and compares outputs to the saved versions in the .ipynb file:

pytest --nbval my_notebook.ipynb

Floating-point noise and timestamps break naive comparisons, so nbval supports a sanitise config that strips known-variable content before comparison. This catches genuine output changes (a model accuracy that dropped) while ignoring irrelevant ones (a different execution timestamp).

Unit testing with testbook

testbook lets you call specific notebook functions from a standard pytest test:

from testbook import testbook

@testbook("analysis.ipynb", execute=True)
def test_clean_data(tb):
    clean = tb.ref("clean_data")
    result = clean([1, None, 3])
    assert result == [1, 3]

This is powerful because it tests notebook logic without running every cell. It also encourages refactoring: if a function is important enough to test, it is probably important enough to move to a module eventually.

Linting with nbqa

nbqa applies standard Python linters to notebook cells:

nbqa ruff my_notebook.ipynb --fix
nbqa black my_notebook.ipynb

It extracts code cells, runs the tool, and writes fixes back into the .ipynb JSON. This catches import errors, undefined variables, and style issues that manual review misses.

Common misconception

Many people believe testing notebooks means converting them entirely to scripts. That is one option, but it is not the only one. Modern tools test notebooks as notebooks, preserving the interactive narrative that makes them valuable in the first place. The goal is to add confidence, not to kill the format.

Putting it in CI

A minimal GitHub Actions workflow:

- name: Test notebooks
  run: |
    pip install papermill nbval
    papermill analysis.ipynb /dev/null -p test_mode true
    pytest --nbval visualization.ipynb --sanitize-with sanitize.cfg

Run this on every pull request. Broken notebooks get caught before merge.

One thing to remember: Notebook testing is not all-or-nothing. Start with a smoke test that just runs every cell. That single step catches 80 percent of notebook rot with almost zero effort.

pythonjupytertesting

See Also