Python Change Data Capture — Core Concepts

What Change Data Capture Is

Change Data Capture (CDC) detects and delivers database changes (inserts, updates, deletes) as a stream of events. Instead of periodically querying a database for modifications, CDC monitors the database’s internal mechanisms to capture changes at the source.

This enables real-time data synchronization, event-driven architectures, and audit trails without modifying the application that writes to the database.

CDC Approaches

Log-Based CDC

Databases maintain a write-ahead log (WAL in PostgreSQL, binlog in MySQL) for crash recovery. Log-based CDC reads this log to capture every committed change. This is the most reliable and least intrusive method because it does not add load to the database query engine.

Debezium is the most widely used log-based CDC tool. It runs as a Kafka Connect connector, reading database logs and publishing change events to Kafka topics. Python applications then consume these Kafka topics.

Query-Based CDC

A simpler but less reliable approach: poll the database for rows where a modified_at timestamp is newer than the last check. This misses deletes (unless soft-deleted), cannot capture multiple changes between polls, and adds query load to the database.

Trigger-Based CDC

Database triggers fire on each change and write to a separate changelog table. Python reads the changelog table. This captures all changes but adds write overhead to every operation and couples the CDC mechanism to the database schema.

The Debezium + Python Pipeline

The typical architecture:

PostgreSQL → Debezium (Kafka Connect) → Kafka → Python Consumer

Debezium captures changes from PostgreSQL’s WAL and publishes them as JSON events to Kafka. A Python consumer processes these events:

from confluent_kafka import Consumer
import json

consumer = Consumer({
    "bootstrap.servers": "broker:9092",
    "group.id": "cdc-processor",
    "auto.offset.reset": "earliest",
})
consumer.subscribe(["dbserver.public.orders"])

while True:
    msg = consumer.poll(timeout=1.0)
    if msg is None:
        continue
    event = json.loads(msg.value())
    operation = event["payload"]["op"]  # c=create, u=update, d=delete
    after = event["payload"].get("after")
    before = event["payload"].get("before")

    if operation == "c":
        index_new_order(after)
    elif operation == "u":
        update_search_index(before, after)
    elif operation == "d":
        remove_from_search(before)

Each event includes the operation type, the row’s state before the change, and the state after. This allows precise reactions to every kind of modification.

The Outbox Pattern

Sometimes you need to publish an event and update a database atomically. The outbox pattern solves this:

  1. Write the business data and an event record to the same database in one transaction.
  2. CDC captures the event record from the outbox table.
  3. A separate service reads the captured event and publishes it to a message broker.

This avoids the dual-write problem where writing to both a database and a message broker could leave them inconsistent if one fails.

Common Use Cases

  • Search index synchronization — keep Elasticsearch in sync with a PostgreSQL source of truth.
  • Cache invalidation — clear or update Redis cache entries when the underlying data changes.
  • Cross-service data replication — sync data between microservices without direct API calls.
  • Audit logging — capture every database change for compliance.
  • Real-time analytics — stream changes to a data warehouse without batch ETL.

Common Misconception

CDC does not replace an event-driven architecture — it complements it. CDC captures what changed in the database, but it cannot capture intent. An “order placed” event carries business meaning. A CDC event saying “row inserted in orders table” carries structural meaning. Ideally, services publish domain events explicitly and use CDC as a safety net for data synchronization.

The one thing to remember: Log-based CDC reads the database’s own change log to deliver reliable, real-time change events to Python without adding query load or modifying application code.

pythoncdcdatabasesstreaming

See Also

  • Python Faust Stream Processing How Faust lets Python programs process endless rivers of data in real time, like a factory assembly line that never stops.
  • Python Kafka Consumers Understand Python Kafka consumers as organized listeners that read event streams without losing place in the line.
  • Python Kafka Producers How Python programs send millions of messages into Kafka like a postal sorting machine that never sleeps.
  • Python Pulsar Messaging Why Apache Pulsar is like a super-powered mailroom that handles both quick notes and huge packages for Python applications.
  • Ci Cd Why big apps can ship updates every day without turning your phone into a glitchy mess — CI/CD is the behind-the-scenes quality gate and delivery truck.