Python Medallion Architecture — Core Concepts

Medallion architecture organizes data processing into three progressive layers—Bronze, Silver, and Gold—each with a distinct responsibility. Popularized by Databricks, the pattern is now used across cloud data platforms and can be implemented entirely in Python.

The three layers

Bronze (Raw)

Bronze ingests data exactly as it arrives. No transformations, no deduplication. Every record is appended with metadata: ingestion timestamp, source identifier, and batch ID.

The purpose is preservation. If a downstream bug corrupts Silver data, Bronze lets you replay from the original. If a new use case appears a year later, the raw events are still available.

Silver (Cleaned)

Silver applies business logic to produce a trusted, queryable dataset:

  • Deduplication — remove exact or near-duplicate records.
  • Type casting — convert string dates to proper timestamps, enforce numeric types.
  • Null handling — fill defaults, flag missing values, or reject invalid rows.
  • Schema enforcement — ensure every record conforms to an expected shape.

Silver is where most engineering effort lives. A single Bronze source might feed multiple Silver tables optimized for different consumers.

Gold (Curated)

Gold contains pre-aggregated, business-ready datasets: daily revenue by region, customer churn metrics, inventory snapshots. These are optimized for fast dashboards and reporting.

Gold tables are narrow in scope. Each one answers a specific set of questions and is documented with an owner, freshness SLA, and known limitations.

How data moves between layers

TransitionTriggerPython tools
Source → BronzeScheduled ingest or event-drivenboto3, requests, Airflow sensors
Bronze → SilverPost-ingest taskpandas, polars, pyspark, Great Expectations
Silver → GoldBusiness schedule (hourly, daily)duckdb, polars, dbt-core with Python models

Each transition is a separate, idempotent pipeline step. If it fails, you rerun it without side effects.

Why not two layers or four?

Two layers (raw + final) work for simple projects but collapse when:

  • Multiple teams need different cleaned views of the same raw data.
  • Schema changes in the source need to be absorbed without breaking dashboards.
  • Debugging requires isolating whether the problem is in ingestion or transformation.

Four or more layers add overhead without proportional benefit for most teams. Three layers hit a practical sweet spot between flexibility and simplicity.

Common misconception

“Bronze data is temporary and can be deleted once Silver is built.” This defeats the main benefit. Bronze is your insurance policy. Deleting it means you lose the ability to reprocess historical data when logic changes. Archive it to cold storage if cost is a concern, but do not delete it.

Practical considerations

  • Naming conventions: Use prefixes like bronze_, silver_, gold_ or directory paths that make the layer obvious.
  • Access control: Restrict Gold to read-only for analysts. Only pipeline service accounts write to Silver and Gold.
  • Monitoring: Track row counts, null rates, and freshness at each layer transition. Alert when metrics drift beyond thresholds.

One thing to remember: medallion architecture trades storage cost for operational safety—keeping raw data intact means you can always recover from bugs, schema changes, or new requirements.

pythonmedallion-architecturedata-engineering

See Also

  • Python Adaptive Learning Systems How Python builds learning apps that adjust to each student like a personal tutor who knows exactly what you need next.
  • Python Airflow Learn Airflow as a timetable manager that makes sure data tasks run in the right order every day.
  • Python Altair Learn Altair through the idea of drawing charts by describing rules, not by hand-placing every visual element.
  • Python Automated Grading How Python grades homework and exams automatically, from simple answer keys to understanding written essays.
  • Python Batch Vs Stream Processing Batch processing is like doing laundry once a week; stream processing is like a self-cleaning shirt that cleans itself constantly.