Notebook Testing — Deep Dive

Why notebooks need a testing strategy

Notebooks occupy an awkward middle ground: they are code, documentation, and data exploration rolled into one. Traditional testing frameworks assume a clean separation between source and output. Notebooks violate that assumption — every .ipynb file contains code cells interleaved with rich outputs (HTML tables, images, JSON). A testing strategy for notebooks must account for this hybrid nature.

Execution-based testing in depth

Papermill internals

Papermill works by opening the notebook JSON, injecting a parameters cell after the cell tagged parameters, launching a Jupyter kernel, and executing cells sequentially via the Jupyter messaging protocol. It captures each cell’s outputs into a new notebook file.

Key operational details:

  • Kernel selection: Papermill respects kernelspec metadata. If the notebook specifies python3, it starts that kernel. For testing across Python versions, override with --kernel python3.11.
  • Timeout per cell: --execution-timeout 300 kills any cell that runs longer than five minutes. Essential for CI where a hung cell would block the pipeline indefinitely.
  • Error behaviour: By default, papermill stops at the first error. Pass --report-mode to execute all cells and collect errors — useful for generating a full report of everything that is broken.

Parameterised test matrices

Papermill’s parameter injection enables testing a notebook across multiple configurations:

import papermill as pm
import pytest

configs = [
    {"dataset": "small.csv", "threshold": 0.5},
    {"dataset": "large.csv", "threshold": 0.8},
    {"dataset": "edge_cases.csv", "threshold": 0.0},
]

@pytest.mark.parametrize("params", configs)
def test_analysis(params, tmp_path):
    output = tmp_path / "output.ipynb"
    pm.execute_notebook(
        "analysis.ipynb",
        str(output),
        parameters=params,
        kernel_name="python3",
    )
    # If execution completes without error, the test passes

This pattern catches edge cases that a single hardcoded run would miss.

Output validation strategies

Snapshot testing with nbval

nbval compares cell outputs against stored values. Configuration is critical:

# sanitize.cfg
[regex_replace]
# Remove timestamps
regex: \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}
replace: <TIMESTAMP>

# Normalize floating point to 2 decimal places
regex: (\d+\.\d{3,})
replace: <FLOAT>

Without sanitisation, tests break on every run due to timestamps, memory addresses, and floating-point noise.

When to use: Output regression is most valuable for notebooks that produce specific numbers stakeholders rely on — financial reports, model metrics, KPI dashboards. For exploratory notebooks, smoke testing is sufficient.

Assertion cells

Embed test assertions directly in the notebook:

# Cell tagged 'test'
assert accuracy > 0.85, f"Model accuracy dropped to {accuracy}"
assert len(predictions) == len(test_data), "Prediction count mismatch"

nbval and papermill both respect these assertions. This approach is lightweight and keeps tests close to the code they validate.

Function-level testing with testbook

Advanced testbook patterns

testbook can execute only specific cells, reducing test time:

from testbook import testbook

@testbook("pipeline.ipynb", execute=["imports", "helpers"])
def test_normalize(tb):
    normalize = tb.ref("normalize_scores")
    result = normalize([100, 200, 300])
    assert result == [0.0, 0.5, 1.0]

Cell tags (imports, helpers) control which cells run during setup. This avoids expensive data-loading cells when you only want to test a utility function.

Testing classes and stateful objects

@testbook("model.ipynb", execute=True)
def test_model_predict(tb):
    tb.inject("""
        test_input = [[1.0, 2.0, 3.0]]
        prediction = model.predict(test_input)
        assert prediction.shape == (1,), f"Unexpected shape: {prediction.shape}"
    """)

tb.inject() runs arbitrary code inside the notebook’s kernel. This is useful for testing objects that are expensive to reconstruct outside the notebook context.

Linting and static analysis

nbqa deep integration

nbqa works by extracting code cells to a temporary .py file, running the linter, and mapping diagnostics back to cell numbers:

# Type checking
nbqa mypy my_notebook.ipynb --ignore-missing-imports

# Security scanning
nbqa bandit my_notebook.ipynb -ll

# Import sorting
nbqa isort my_notebook.ipynb --diff

Combine these in a pre-commit hook:

# .pre-commit-config.yaml
- repo: https://github.com/nbQA-dev/nbQA
  rev: 1.8.5
  hooks:
    - id: nbqa-ruff
    - id: nbqa-black
    - id: nbqa-mypy

Detecting execution order issues

A common notebook bug: cells work when run interactively (out of order) but fail when run top-to-bottom. The tool nbstripout combined with a CI smoke test catches this. Strip outputs before commit, then run all cells in CI. If CI fails but the author’s local notebook had outputs, the notebook has an order dependency.

CI pipeline architecture

Multi-stage notebook CI

Stage 1: Lint (nbqa ruff + black --check)

Stage 2: Smoke test (papermill, no output validation)

Stage 3: Output regression (nbval on critical notebooks)

Stage 4: Function tests (testbook / pytest)

Each stage is a separate CI job. Fast stages run first; expensive output regression runs last. This gives quick feedback on obvious failures.

Handling large data dependencies

Notebooks that load gigabytes of data cannot run in CI as-is. Solutions:

  1. Fixture data: Create a small representative sample (tests/fixtures/sample_10k.csv) and parameterise the notebook to use it.
  2. Mocking: Use unittest.mock.patch in testbook to stub expensive I/O calls.
  3. Caching: Use CI caching (GitHub Actions actions/cache) for downloaded datasets.
  4. Skip tags: Tag cells that require production data with ci-skip and configure papermill to skip them.

Kernel management in CI

Install the kernel explicitly in CI:

pip install ipykernel
python -m ipykernel install --user --name=project-kernel
papermill notebook.ipynb output.ipynb --kernel project-kernel

This prevents “kernel not found” errors that occur when the CI environment’s kernel name doesn’t match the notebook’s kernelspec.

Tradeoffs

ApproachConfidenceMaintenance costSpeed
Smoke test onlyLow-mediumVery lowFast
Smoke + assertionsMediumLowFast
Output regressionHighMedium (sanitise rules)Medium
Function-level testsHighMedium-highFast
Full pipeline (all above)Very highHighSlow

The right choice depends on the notebook’s role. An exploratory scratch notebook needs only smoke tests. A notebook that generates board-level financial reports deserves the full pipeline.

Emerging tools

  • nbmake — A pytest plugin similar to nbval but focused on execution-only testing (no output comparison). Simpler configuration, faster adoption.
  • Ploomber — Turns notebooks into pipeline DAGs with built-in testing hooks.
  • Jupyter Scheduler — JupyterHub’s native scheduling system, which can double as a smoke-test runner when configured with alerting.

One thing to remember: The hardest part of notebook testing is not the tooling — it is the cultural shift from treating notebooks as throwaway scratch pads to treating them as versioned, tested artefacts. Start with one smoke test in CI and let the habit grow from there.

pythonjupytertesting

See Also