Python Data Quality Checks — Core Concepts

Learn the categories of data quality checks and how Python frameworks enforce them across pipelines.

Data quality checks validate that data meets defined expectations before it reaches consumers. Without them, dashboards show wrong numbers, ML models train on garbage, and business decisions go sideways. Python offers multiple frameworks and patterns for embedding quality checks directly into data pipelines.

Categories of checks

Schema checks

Verify the structure of data before examining content:

Expected columns exist.
Column data types match (integer, string, date).
No unexpected extra columns appeared.

Schema drift—when a source quietly adds or removes columns—is one of the most common causes of pipeline failures. Catching it early is cheap; discovering it in a dashboard is expensive.

Value checks

Validate individual cell values:

Check	Example
Not null	`email` must never be empty
Range	`age` between 0 and 150
Pattern	`phone` matches regex `^\+\d{10,15}$`
Allowed values	`status` in `{active, inactive, suspended}`
Uniqueness	`order_id` has no duplicates

Aggregate checks

Validate properties of the dataset as a whole:

Row count within expected range.
Sum of amount column within 10% of yesterday’s total.
Null rate below a threshold (e.g., less than 1% of email is null).
No duplicate rows based on a composite key.

Cross-dataset checks

Compare two datasets for consistency:

Row count in the Silver table matches Bronze minus known rejections.
Foreign keys in the orders table all exist in the customers table.
Revenue totals in the Gold aggregate match the sum of Silver detail rows.

Where checks run

Checks should run between pipeline stages, not at the end:

After ingestion — is the raw data the right shape and size?
After transformation — did cleaning introduce anomalies?
Before publishing — does the final output meet consumer SLAs?

If a check fails at step 2, the pipeline stops before writing bad data to the Gold layer. This is called a quality gate.

Python frameworks

Great Expectations

The most established Python data quality framework. You define “expectations” that describe what valid data looks like:

expect_column_to_exist("email")
expect_column_values_to_not_be_null("order_id")
expect_column_mean_to_be_between("amount", 10, 500)

Results are stored as JSON documents and can generate HTML data quality reports.

Pandera

Schema-focused validation, especially for pandas and polars DataFrames. You define a schema class and validate DataFrames against it. Integrates well with type hints and static analysis.

Soda Core

SQL-based checks that work with warehouses and lakes. You write checks in YAML and Soda translates them to SQL queries. Python SDK available for programmatic use.

Custom assertions

For simple pipelines, plain Python assertions work:

assert df["order_id"].is_unique().all(), "Duplicate order IDs found"
assert df["amount"].null_count() == 0, "Null amounts detected"
assert len(df) > 1000, f"Suspiciously low row count: {len(df)}"

The downside of raw assertions is no reporting, no history, and no HTML dashboards. They are a starting point, not a production solution.

How it works in practice

A typical pipeline integrates checks as discrete steps:

Ingest raw data to Bronze.
Check: row count > 0, expected columns present.
Transform to Silver (clean, deduplicate, type-cast).
Check: null rates below thresholds, no duplicate keys, value ranges valid.
Aggregate to Gold.
Check: totals consistent with Silver, freshness within SLA.
Publish to dashboard or API.

Each check either passes (continue) or fails (halt and alert).

Common misconception

“Data quality checks slow down the pipeline.” The overhead of scanning a dataset for null rates or duplicate keys is tiny compared to the cost of publishing wrong data. A quality check that takes 30 seconds can prevent hours of incident response and lost trust.

Alerting and remediation

Checks are only useful if failures reach the right people:

Immediate alerts: Slack/PagerDuty for critical failures (empty dataset, schema change).
Daily reports: email digest of warning-level issues (null rate crept up but is still below threshold).
Dead letter tables: rows that fail validation are written to a separate table for manual review instead of being silently dropped.

One thing to remember: the best time to catch bad data is before it reaches anyone who depends on it—embed quality checks between every stage of your pipeline, not just at the end.

pythondata-qualitydata-engineering