Python Slowly Changing Dimensions — Core Concepts

Master SCD Types 1, 2, and 3 to handle changing reference data in Python data warehouses correctly.

Slowly Changing Dimensions (SCD) solve a fundamental problem in data warehousing: how do you track entities whose attributes change over time? A customer moves, a product changes price, an employee switches departments. Your dimension tables need a strategy for handling these changes.

The SCD types

Type 1 — Overwrite

Replace the old value with the new one. No history is preserved.

customer_id	name	city
101	Alice	Chicago ← was New York

When to use: The old value has no analytical value. Correcting a typo in a name, updating a phone format.

Risk: Historical reports change retroactively. A sales report from last quarter now shows Alice in Chicago even though she was in New York when the sale happened.

Type 2 — Add a new row with date tracking

Keep the old row and insert a new one. Date columns mark when each version was valid.

surrogate_key	customer_id	name	city	valid_from	valid_to	is_current
1001	101	Alice	New York	2023-01-15	2026-03-01	false
1002	101	Alice	Chicago	2026-03-01	9999-12-31	true

When to use: You need to report historical data accurately. “What was Alice’s city when she placed order #5842?”

Trade-off: Table size grows with every change. Joins become slightly more complex because you need to match on date ranges.

Type 3 — Add a column for the previous value

customer_id	name	current_city	previous_city
101	Alice	Chicago	New York

When to use: You only need one level of history and the change is infrequent.

Limitation: Only tracks one prior state. A second move overwrites previous_city.

How it works in practice

Most production warehouses use Type 2 for important dimensions (customers, products, employees) and Type 1 for corrections and non-critical attributes.

The typical Python workflow:

Load incoming data — new or updated records from a source system.
Look up existing dimension records — find current rows by business key.
Compare attributes — detect which fields actually changed.
Apply the SCD strategy:
- Type 1: UPDATE the existing row.
- Type 2: Close the old row (set valid_to, is_current = false) and INSERT a new row.
- Type 3: UPDATE the row, shifting the current value to the previous column.

Key concepts

Business key vs. surrogate key: The business key (like customer_id) identifies the entity in the source system. The surrogate key (auto-incrementing or UUID) uniquely identifies each version in the warehouse. Type 2 needs surrogate keys because the same business key appears in multiple rows.

Effective dating: valid_from and valid_to columns define when a row was the “truth.” Use a far-future date (like 9999-12-31) for current records to simplify range queries.

Change detection: Comparing every column is expensive. Hash the tracked columns and compare hashes instead. If the hash differs, something changed.

Common misconception

“SCD Type 2 means you version every column.” Not necessarily. You choose which columns trigger a new version. A customer’s email change might be Type 1 (overwrite), while their city change is Type 2 (new version). Mixing types within one table is normal and practical.

Python ecosystem

pandas / polars — merge and compare DataFrames for change detection.
SQLAlchemy — execute UPDATE and INSERT statements against warehouse tables.
Delta Lake MERGE — handles Type 2 upserts declaratively.
dbt snapshots — automate Type 2 SCD with configuration, backed by Python models.
Great Expectations — validate that SCD logic produces correct date ranges and no gaps.

One thing to remember: choose your SCD type based on whether historical accuracy matters for that specific attribute—not every column needs the same treatment.

pythonscddata-warehousing