Python Incremental Processing — Core Concepts
Incremental processing means handling only the data that has changed since the last pipeline run, rather than reprocessing the entire dataset. For pipelines operating on millions or billions of rows, this is the difference between a job that takes minutes and one that takes hours.
Why incremental matters
Cost
Reprocessing everything every run wastes compute. A daily pipeline that full-scans a 500 GB table when only 2 GB changed is burning money on the other 498 GB.
Speed
Incremental runs are faster because they touch less data. This means fresher results and shorter feedback loops.
Scalability
As data grows, full reprocessing eventually hits a wall. Incremental processing scales with the volume of changes, not the total data size.
Change detection strategies
Watermark (high-water mark)
Track the maximum value of a monotonically increasing column (timestamp or auto-incrementing ID) from the previous run. Next run, query everything above that value.
How it works:
- Read the watermark from the previous run:
last_processed_id = 50000. - Query the source:
SELECT * FROM orders WHERE id > 50000. - Process the results.
- Update the watermark to the new maximum:
last_processed_id = 50347.
Strengths: Simple, works with any database. Weaknesses: Misses records that arrive with IDs lower than the watermark (late-arriving data). Also misses updates to existing records (only catches inserts).
Modified timestamp
Similar to watermark but uses an updated_at column. Catches both inserts and updates.
Strengths: Catches modifications, not just new records.
Weaknesses: Requires the source to maintain updated_at reliably. Clock skew between servers can cause records to be missed or double-processed.
Change Data Capture (CDC)
The database itself tracks changes and exposes them as a stream of events: inserts, updates, and deletes. Tools like Debezium read the database transaction log and publish change events.
Strengths: Captures all changes including deletes. No polling overhead. Weaknesses: Requires infrastructure (Debezium, Kafka). More complex to set up and operate.
File-based detection
For file-based sources (S3, GCS), track which files have been processed. New files trigger processing.
| Method | Detects inserts | Detects updates | Detects deletes | Complexity |
|---|---|---|---|---|
| Watermark (ID) | ✅ | ❌ | ❌ | Low |
| Modified timestamp | ✅ | ✅ | ❌ | Low |
| CDC | ✅ | ✅ | ✅ | High |
| File listing | ✅ | Partial | ❌ | Low |
Idempotent writes
Incremental processing must be safe to rerun. If the pipeline crashes and restarts, it should produce the same result, not duplicate records. Common patterns:
- Upsert (MERGE): Insert new records, update existing ones based on a key.
- Partition overwrite: Replace the entire output partition for the affected time window.
- Delete-then-insert: Delete the output for the incremental range, then insert fresh.
Watermark storage
The watermark needs to persist between runs. Options:
- File on disk or object storage — a JSON file with the last watermark value.
- Database table — a
pipeline_statetable with one row per pipeline. - Orchestrator metadata — Airflow XComs, Prefect task results.
The watermark must be updated only after successful processing, not before. Otherwise, a crash after updating the watermark but before finishing the write means records are silently skipped.
Common misconception
“Incremental processing is always better than full reprocessing.” Not true. Incremental processing is faster but more complex. You need to handle late-arriving data, schema changes, and crash recovery. For small datasets (under a few GB), full reprocessing is simpler and fast enough. Switch to incremental when full reprocessing becomes too slow or too expensive.
Python ecosystem
- Airflow: Execution date and data intervals provide natural watermarks.
- Dagster: Software-defined assets with partition support handle incremental by default.
- Delta Lake:
MERGEfor idempotent upserts. - polars / pandas: Efficient filtering on timestamp columns for watermark-based reads.
- Debezium + Kafka: Python consumers read CDC events for real-time incremental processing.
One thing to remember: incremental processing trades simplicity for efficiency—only adopt it when full reprocessing is too slow, and always ensure your writes are idempotent so reruns are safe.
See Also
- Python Adaptive Learning Systems How Python builds learning apps that adjust to each student like a personal tutor who knows exactly what you need next.
- Python Airflow Learn Airflow as a timetable manager that makes sure data tasks run in the right order every day.
- Python Altair Learn Altair through the idea of drawing charts by describing rules, not by hand-placing every visual element.
- Python Automated Grading How Python grades homework and exams automatically, from simple answer keys to understanding written essays.
- Python Batch Vs Stream Processing Batch processing is like doing laundry once a week; stream processing is like a self-cleaning shirt that cleans itself constantly.