Python Slowly Changing Dimensions — Core Concepts
Slowly Changing Dimensions (SCD) solve a fundamental problem in data warehousing: how do you track entities whose attributes change over time? A customer moves, a product changes price, an employee switches departments. Your dimension tables need a strategy for handling these changes.
The SCD types
Type 1 — Overwrite
Replace the old value with the new one. No history is preserved.
| customer_id | name | city |
|---|---|---|
| 101 | Alice | Chicago ← was New York |
When to use: The old value has no analytical value. Correcting a typo in a name, updating a phone format.
Risk: Historical reports change retroactively. A sales report from last quarter now shows Alice in Chicago even though she was in New York when the sale happened.
Type 2 — Add a new row with date tracking
Keep the old row and insert a new one. Date columns mark when each version was valid.
| surrogate_key | customer_id | name | city | valid_from | valid_to | is_current |
|---|---|---|---|---|---|---|
| 1001 | 101 | Alice | New York | 2023-01-15 | 2026-03-01 | false |
| 1002 | 101 | Alice | Chicago | 2026-03-01 | 9999-12-31 | true |
When to use: You need to report historical data accurately. “What was Alice’s city when she placed order #5842?”
Trade-off: Table size grows with every change. Joins become slightly more complex because you need to match on date ranges.
Type 3 — Add a column for the previous value
| customer_id | name | current_city | previous_city |
|---|---|---|---|
| 101 | Alice | Chicago | New York |
When to use: You only need one level of history and the change is infrequent.
Limitation: Only tracks one prior state. A second move overwrites previous_city.
How it works in practice
Most production warehouses use Type 2 for important dimensions (customers, products, employees) and Type 1 for corrections and non-critical attributes.
The typical Python workflow:
- Load incoming data — new or updated records from a source system.
- Look up existing dimension records — find current rows by business key.
- Compare attributes — detect which fields actually changed.
- Apply the SCD strategy:
- Type 1: UPDATE the existing row.
- Type 2: Close the old row (set
valid_to,is_current = false) and INSERT a new row. - Type 3: UPDATE the row, shifting the current value to the previous column.
Key concepts
Business key vs. surrogate key: The business key (like customer_id) identifies the entity in the source system. The surrogate key (auto-incrementing or UUID) uniquely identifies each version in the warehouse. Type 2 needs surrogate keys because the same business key appears in multiple rows.
Effective dating: valid_from and valid_to columns define when a row was the “truth.” Use a far-future date (like 9999-12-31) for current records to simplify range queries.
Change detection: Comparing every column is expensive. Hash the tracked columns and compare hashes instead. If the hash differs, something changed.
Common misconception
“SCD Type 2 means you version every column.” Not necessarily. You choose which columns trigger a new version. A customer’s email change might be Type 1 (overwrite), while their city change is Type 2 (new version). Mixing types within one table is normal and practical.
Python ecosystem
- pandas / polars — merge and compare DataFrames for change detection.
- SQLAlchemy — execute UPDATE and INSERT statements against warehouse tables.
- Delta Lake MERGE — handles Type 2 upserts declaratively.
- dbt snapshots — automate Type 2 SCD with configuration, backed by Python models.
- Great Expectations — validate that SCD logic produces correct date ranges and no gaps.
One thing to remember: choose your SCD type based on whether historical accuracy matters for that specific attribute—not every column needs the same treatment.
See Also
- Python Adaptive Learning Systems How Python builds learning apps that adjust to each student like a personal tutor who knows exactly what you need next.
- Python Airflow Learn Airflow as a timetable manager that makes sure data tasks run in the right order every day.
- Python Altair Learn Altair through the idea of drawing charts by describing rules, not by hand-placing every visual element.
- Python Automated Grading How Python grades homework and exams automatically, from simple answer keys to understanding written essays.
- Python Batch Vs Stream Processing Batch processing is like doing laundry once a week; stream processing is like a self-cleaning shirt that cleans itself constantly.