Python Slowly Changing Dimensions — Core Concepts

Slowly Changing Dimensions (SCD) solve a fundamental problem in data warehousing: how do you track entities whose attributes change over time? A customer moves, a product changes price, an employee switches departments. Your dimension tables need a strategy for handling these changes.

The SCD types

Type 1 — Overwrite

Replace the old value with the new one. No history is preserved.

customer_idnamecity
101AliceChicago ← was New York

When to use: The old value has no analytical value. Correcting a typo in a name, updating a phone format.

Risk: Historical reports change retroactively. A sales report from last quarter now shows Alice in Chicago even though she was in New York when the sale happened.

Type 2 — Add a new row with date tracking

Keep the old row and insert a new one. Date columns mark when each version was valid.

surrogate_keycustomer_idnamecityvalid_fromvalid_tois_current
1001101AliceNew York2023-01-152026-03-01false
1002101AliceChicago2026-03-019999-12-31true

When to use: You need to report historical data accurately. “What was Alice’s city when she placed order #5842?”

Trade-off: Table size grows with every change. Joins become slightly more complex because you need to match on date ranges.

Type 3 — Add a column for the previous value

customer_idnamecurrent_cityprevious_city
101AliceChicagoNew York

When to use: You only need one level of history and the change is infrequent.

Limitation: Only tracks one prior state. A second move overwrites previous_city.

How it works in practice

Most production warehouses use Type 2 for important dimensions (customers, products, employees) and Type 1 for corrections and non-critical attributes.

The typical Python workflow:

  1. Load incoming data — new or updated records from a source system.
  2. Look up existing dimension records — find current rows by business key.
  3. Compare attributes — detect which fields actually changed.
  4. Apply the SCD strategy:
    • Type 1: UPDATE the existing row.
    • Type 2: Close the old row (set valid_to, is_current = false) and INSERT a new row.
    • Type 3: UPDATE the row, shifting the current value to the previous column.

Key concepts

Business key vs. surrogate key: The business key (like customer_id) identifies the entity in the source system. The surrogate key (auto-incrementing or UUID) uniquely identifies each version in the warehouse. Type 2 needs surrogate keys because the same business key appears in multiple rows.

Effective dating: valid_from and valid_to columns define when a row was the “truth.” Use a far-future date (like 9999-12-31) for current records to simplify range queries.

Change detection: Comparing every column is expensive. Hash the tracked columns and compare hashes instead. If the hash differs, something changed.

Common misconception

“SCD Type 2 means you version every column.” Not necessarily. You choose which columns trigger a new version. A customer’s email change might be Type 1 (overwrite), while their city change is Type 2 (new version). Mixing types within one table is normal and practical.

Python ecosystem

  • pandas / polars — merge and compare DataFrames for change detection.
  • SQLAlchemy — execute UPDATE and INSERT statements against warehouse tables.
  • Delta Lake MERGE — handles Type 2 upserts declaratively.
  • dbt snapshots — automate Type 2 SCD with configuration, backed by Python models.
  • Great Expectations — validate that SCD logic produces correct date ranges and no gaps.

One thing to remember: choose your SCD type based on whether historical accuracy matters for that specific attribute—not every column needs the same treatment.

pythonscddata-warehousing

See Also

  • Python Adaptive Learning Systems How Python builds learning apps that adjust to each student like a personal tutor who knows exactly what you need next.
  • Python Airflow Learn Airflow as a timetable manager that makes sure data tasks run in the right order every day.
  • Python Altair Learn Altair through the idea of drawing charts by describing rules, not by hand-placing every visual element.
  • Python Automated Grading How Python grades homework and exams automatically, from simple answer keys to understanding written essays.
  • Python Batch Vs Stream Processing Batch processing is like doing laundry once a week; stream processing is like a self-cleaning shirt that cleans itself constantly.