Python Incremental Processing — Core Concepts

Learn watermark tracking, change detection, and idempotent upserts for efficient Python data pipelines.

Incremental processing means handling only the data that has changed since the last pipeline run, rather than reprocessing the entire dataset. For pipelines operating on millions or billions of rows, this is the difference between a job that takes minutes and one that takes hours.

Why incremental matters

Cost

Reprocessing everything every run wastes compute. A daily pipeline that full-scans a 500 GB table when only 2 GB changed is burning money on the other 498 GB.

Speed

Incremental runs are faster because they touch less data. This means fresher results and shorter feedback loops.

Scalability

As data grows, full reprocessing eventually hits a wall. Incremental processing scales with the volume of changes, not the total data size.

Change detection strategies

Watermark (high-water mark)

Track the maximum value of a monotonically increasing column (timestamp or auto-incrementing ID) from the previous run. Next run, query everything above that value.

How it works:

Read the watermark from the previous run: last_processed_id = 50000.
Query the source: SELECT * FROM orders WHERE id > 50000.
Process the results.
Update the watermark to the new maximum: last_processed_id = 50347.

Strengths: Simple, works with any database. Weaknesses: Misses records that arrive with IDs lower than the watermark (late-arriving data). Also misses updates to existing records (only catches inserts).

Modified timestamp

Similar to watermark but uses an updated_at column. Catches both inserts and updates.

Strengths: Catches modifications, not just new records. Weaknesses: Requires the source to maintain updated_at reliably. Clock skew between servers can cause records to be missed or double-processed.

Change Data Capture (CDC)

The database itself tracks changes and exposes them as a stream of events: inserts, updates, and deletes. Tools like Debezium read the database transaction log and publish change events.

Strengths: Captures all changes including deletes. No polling overhead. Weaknesses: Requires infrastructure (Debezium, Kafka). More complex to set up and operate.

File-based detection

For file-based sources (S3, GCS), track which files have been processed. New files trigger processing.

Method	Detects inserts	Detects updates	Detects deletes	Complexity
Watermark (ID)	✅	❌	❌	Low
Modified timestamp	✅	✅	❌	Low
CDC	✅	✅	✅	High
File listing	✅	Partial	❌	Low

Idempotent writes

Incremental processing must be safe to rerun. If the pipeline crashes and restarts, it should produce the same result, not duplicate records. Common patterns:

Upsert (MERGE): Insert new records, update existing ones based on a key.
Partition overwrite: Replace the entire output partition for the affected time window.
Delete-then-insert: Delete the output for the incremental range, then insert fresh.

Watermark storage

The watermark needs to persist between runs. Options:

File on disk or object storage — a JSON file with the last watermark value.
Database table — a pipeline_state table with one row per pipeline.
Orchestrator metadata — Airflow XComs, Prefect task results.

The watermark must be updated only after successful processing, not before. Otherwise, a crash after updating the watermark but before finishing the write means records are silently skipped.

Common misconception

“Incremental processing is always better than full reprocessing.” Not true. Incremental processing is faster but more complex. You need to handle late-arriving data, schema changes, and crash recovery. For small datasets (under a few GB), full reprocessing is simpler and fast enough. Switch to incremental when full reprocessing becomes too slow or too expensive.

Python ecosystem

Airflow: Execution date and data intervals provide natural watermarks.
Dagster: Software-defined assets with partition support handle incremental by default.
Delta Lake: MERGE for idempotent upserts.
polars / pandas: Efficient filtering on timestamp columns for watermark-based reads.
Debezium + Kafka: Python consumers read CDC events for real-time incremental processing.

One thing to remember: incremental processing trades simplicity for efficiency—only adopt it when full reprocessing is too slow, and always ensure your writes are idempotent so reruns are safe.

pythonincremental-processingdata-engineering