Python Data Quality Checks — Core Concepts
Data quality checks validate that data meets defined expectations before it reaches consumers. Without them, dashboards show wrong numbers, ML models train on garbage, and business decisions go sideways. Python offers multiple frameworks and patterns for embedding quality checks directly into data pipelines.
Categories of checks
Schema checks
Verify the structure of data before examining content:
- Expected columns exist.
- Column data types match (integer, string, date).
- No unexpected extra columns appeared.
Schema drift—when a source quietly adds or removes columns—is one of the most common causes of pipeline failures. Catching it early is cheap; discovering it in a dashboard is expensive.
Value checks
Validate individual cell values:
| Check | Example |
|---|---|
| Not null | email must never be empty |
| Range | age between 0 and 150 |
| Pattern | phone matches regex ^\+\d{10,15}$ |
| Allowed values | status in {active, inactive, suspended} |
| Uniqueness | order_id has no duplicates |
Aggregate checks
Validate properties of the dataset as a whole:
- Row count within expected range.
- Sum of
amountcolumn within 10% of yesterday’s total. - Null rate below a threshold (e.g., less than 1% of
emailis null). - No duplicate rows based on a composite key.
Cross-dataset checks
Compare two datasets for consistency:
- Row count in the Silver table matches Bronze minus known rejections.
- Foreign keys in the orders table all exist in the customers table.
- Revenue totals in the Gold aggregate match the sum of Silver detail rows.
Where checks run
Checks should run between pipeline stages, not at the end:
- After ingestion — is the raw data the right shape and size?
- After transformation — did cleaning introduce anomalies?
- Before publishing — does the final output meet consumer SLAs?
If a check fails at step 2, the pipeline stops before writing bad data to the Gold layer. This is called a quality gate.
Python frameworks
Great Expectations
The most established Python data quality framework. You define “expectations” that describe what valid data looks like:
expect_column_to_exist("email")expect_column_values_to_not_be_null("order_id")expect_column_mean_to_be_between("amount", 10, 500)
Results are stored as JSON documents and can generate HTML data quality reports.
Pandera
Schema-focused validation, especially for pandas and polars DataFrames. You define a schema class and validate DataFrames against it. Integrates well with type hints and static analysis.
Soda Core
SQL-based checks that work with warehouses and lakes. You write checks in YAML and Soda translates them to SQL queries. Python SDK available for programmatic use.
Custom assertions
For simple pipelines, plain Python assertions work:
assert df["order_id"].is_unique().all(), "Duplicate order IDs found"
assert df["amount"].null_count() == 0, "Null amounts detected"
assert len(df) > 1000, f"Suspiciously low row count: {len(df)}"
The downside of raw assertions is no reporting, no history, and no HTML dashboards. They are a starting point, not a production solution.
How it works in practice
A typical pipeline integrates checks as discrete steps:
- Ingest raw data to Bronze.
- Check: row count > 0, expected columns present.
- Transform to Silver (clean, deduplicate, type-cast).
- Check: null rates below thresholds, no duplicate keys, value ranges valid.
- Aggregate to Gold.
- Check: totals consistent with Silver, freshness within SLA.
- Publish to dashboard or API.
Each check either passes (continue) or fails (halt and alert).
Common misconception
“Data quality checks slow down the pipeline.” The overhead of scanning a dataset for null rates or duplicate keys is tiny compared to the cost of publishing wrong data. A quality check that takes 30 seconds can prevent hours of incident response and lost trust.
Alerting and remediation
Checks are only useful if failures reach the right people:
- Immediate alerts: Slack/PagerDuty for critical failures (empty dataset, schema change).
- Daily reports: email digest of warning-level issues (null rate crept up but is still below threshold).
- Dead letter tables: rows that fail validation are written to a separate table for manual review instead of being silently dropped.
One thing to remember: the best time to catch bad data is before it reaches anyone who depends on it—embed quality checks between every stage of your pipeline, not just at the end.
See Also
- Python Adaptive Learning Systems How Python builds learning apps that adjust to each student like a personal tutor who knows exactly what you need next.
- Python Airflow Learn Airflow as a timetable manager that makes sure data tasks run in the right order every day.
- Python Altair Learn Altair through the idea of drawing charts by describing rules, not by hand-placing every visual element.
- Python Automated Grading How Python grades homework and exams automatically, from simple answer keys to understanding written essays.
- Python Batch Vs Stream Processing Batch processing is like doing laundry once a week; stream processing is like a self-cleaning shirt that cleans itself constantly.