Notebook Testing — Deep Dive
Why notebooks need a testing strategy
Notebooks occupy an awkward middle ground: they are code, documentation, and data exploration rolled into one. Traditional testing frameworks assume a clean separation between source and output. Notebooks violate that assumption — every .ipynb file contains code cells interleaved with rich outputs (HTML tables, images, JSON). A testing strategy for notebooks must account for this hybrid nature.
Execution-based testing in depth
Papermill internals
Papermill works by opening the notebook JSON, injecting a parameters cell after the cell tagged parameters, launching a Jupyter kernel, and executing cells sequentially via the Jupyter messaging protocol. It captures each cell’s outputs into a new notebook file.
Key operational details:
- Kernel selection: Papermill respects
kernelspecmetadata. If the notebook specifiespython3, it starts that kernel. For testing across Python versions, override with--kernel python3.11. - Timeout per cell:
--execution-timeout 300kills any cell that runs longer than five minutes. Essential for CI where a hung cell would block the pipeline indefinitely. - Error behaviour: By default, papermill stops at the first error. Pass
--report-modeto execute all cells and collect errors — useful for generating a full report of everything that is broken.
Parameterised test matrices
Papermill’s parameter injection enables testing a notebook across multiple configurations:
import papermill as pm
import pytest
configs = [
{"dataset": "small.csv", "threshold": 0.5},
{"dataset": "large.csv", "threshold": 0.8},
{"dataset": "edge_cases.csv", "threshold": 0.0},
]
@pytest.mark.parametrize("params", configs)
def test_analysis(params, tmp_path):
output = tmp_path / "output.ipynb"
pm.execute_notebook(
"analysis.ipynb",
str(output),
parameters=params,
kernel_name="python3",
)
# If execution completes without error, the test passes
This pattern catches edge cases that a single hardcoded run would miss.
Output validation strategies
Snapshot testing with nbval
nbval compares cell outputs against stored values. Configuration is critical:
# sanitize.cfg
[regex_replace]
# Remove timestamps
regex: \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}
replace: <TIMESTAMP>
# Normalize floating point to 2 decimal places
regex: (\d+\.\d{3,})
replace: <FLOAT>
Without sanitisation, tests break on every run due to timestamps, memory addresses, and floating-point noise.
When to use: Output regression is most valuable for notebooks that produce specific numbers stakeholders rely on — financial reports, model metrics, KPI dashboards. For exploratory notebooks, smoke testing is sufficient.
Assertion cells
Embed test assertions directly in the notebook:
# Cell tagged 'test'
assert accuracy > 0.85, f"Model accuracy dropped to {accuracy}"
assert len(predictions) == len(test_data), "Prediction count mismatch"
nbval and papermill both respect these assertions. This approach is lightweight and keeps tests close to the code they validate.
Function-level testing with testbook
Advanced testbook patterns
testbook can execute only specific cells, reducing test time:
from testbook import testbook
@testbook("pipeline.ipynb", execute=["imports", "helpers"])
def test_normalize(tb):
normalize = tb.ref("normalize_scores")
result = normalize([100, 200, 300])
assert result == [0.0, 0.5, 1.0]
Cell tags (imports, helpers) control which cells run during setup. This avoids expensive data-loading cells when you only want to test a utility function.
Testing classes and stateful objects
@testbook("model.ipynb", execute=True)
def test_model_predict(tb):
tb.inject("""
test_input = [[1.0, 2.0, 3.0]]
prediction = model.predict(test_input)
assert prediction.shape == (1,), f"Unexpected shape: {prediction.shape}"
""")
tb.inject() runs arbitrary code inside the notebook’s kernel. This is useful for testing objects that are expensive to reconstruct outside the notebook context.
Linting and static analysis
nbqa deep integration
nbqa works by extracting code cells to a temporary .py file, running the linter, and mapping diagnostics back to cell numbers:
# Type checking
nbqa mypy my_notebook.ipynb --ignore-missing-imports
# Security scanning
nbqa bandit my_notebook.ipynb -ll
# Import sorting
nbqa isort my_notebook.ipynb --diff
Combine these in a pre-commit hook:
# .pre-commit-config.yaml
- repo: https://github.com/nbQA-dev/nbQA
rev: 1.8.5
hooks:
- id: nbqa-ruff
- id: nbqa-black
- id: nbqa-mypy
Detecting execution order issues
A common notebook bug: cells work when run interactively (out of order) but fail when run top-to-bottom. The tool nbstripout combined with a CI smoke test catches this. Strip outputs before commit, then run all cells in CI. If CI fails but the author’s local notebook had outputs, the notebook has an order dependency.
CI pipeline architecture
Multi-stage notebook CI
Stage 1: Lint (nbqa ruff + black --check)
↓
Stage 2: Smoke test (papermill, no output validation)
↓
Stage 3: Output regression (nbval on critical notebooks)
↓
Stage 4: Function tests (testbook / pytest)
Each stage is a separate CI job. Fast stages run first; expensive output regression runs last. This gives quick feedback on obvious failures.
Handling large data dependencies
Notebooks that load gigabytes of data cannot run in CI as-is. Solutions:
- Fixture data: Create a small representative sample (
tests/fixtures/sample_10k.csv) and parameterise the notebook to use it. - Mocking: Use
unittest.mock.patchin testbook to stub expensive I/O calls. - Caching: Use CI caching (GitHub Actions
actions/cache) for downloaded datasets. - Skip tags: Tag cells that require production data with
ci-skipand configure papermill to skip them.
Kernel management in CI
Install the kernel explicitly in CI:
pip install ipykernel
python -m ipykernel install --user --name=project-kernel
papermill notebook.ipynb output.ipynb --kernel project-kernel
This prevents “kernel not found” errors that occur when the CI environment’s kernel name doesn’t match the notebook’s kernelspec.
Tradeoffs
| Approach | Confidence | Maintenance cost | Speed |
|---|---|---|---|
| Smoke test only | Low-medium | Very low | Fast |
| Smoke + assertions | Medium | Low | Fast |
| Output regression | High | Medium (sanitise rules) | Medium |
| Function-level tests | High | Medium-high | Fast |
| Full pipeline (all above) | Very high | High | Slow |
The right choice depends on the notebook’s role. An exploratory scratch notebook needs only smoke tests. A notebook that generates board-level financial reports deserves the full pipeline.
Emerging tools
- nbmake — A pytest plugin similar to nbval but focused on execution-only testing (no output comparison). Simpler configuration, faster adoption.
- Ploomber — Turns notebooks into pipeline DAGs with built-in testing hooks.
- Jupyter Scheduler — JupyterHub’s native scheduling system, which can double as a smoke-test runner when configured with alerting.
One thing to remember: The hardest part of notebook testing is not the tooling — it is the cultural shift from treating notebooks as throwaway scratch pads to treating them as versioned, tested artefacts. Start with one smoke test in CI and let the habit grow from there.
See Also
- Python Acceptance Testing Patterns How Python teams verify software does what real users actually asked for.
- Python Approval Testing How approval testing lets you verify complex Python output by comparing it to a saved 'golden' copy you already checked.
- Python Behavior Driven Development Get an intuitive feel for Behavior Driven Development so Python behavior stops feeling unpredictable.
- Python Browser Automation Testing How Python can control a web browser like a robot to test websites automatically.
- Python Chaos Testing Applications Why breaking your own Python systems on purpose makes them stronger.