Python Data Lake Patterns — Deep Dive
Building a production data lake with Python means making deliberate choices about storage layout, file formats, metadata management, and lifecycle policies. Get these right and the lake scales to petabytes. Get them wrong and you pay for storage nobody can query.
1) Storage layout design
Zone hierarchy
A three-zone model (raw → cleaned → curated) is the industry baseline, but implementation details matter:
s3://company-lake/
raw/
source=shopify_orders/
year=2026/month=03/day=28/
batch_20260328T060000Z.json.gz
cleaned/
shopify_orders/
year=2026/month=03/day=28/
part-00000.parquet
curated/
daily_revenue/
year=2026/month=03/
part-00000.parquet
Key principles:
- Raw is immutable. Never modify or delete raw files. They are your audit trail and reprocessing source.
- Cleaned is idempotent. Rerunning the cleaning pipeline for a given partition overwrites the same output path deterministically.
- Curated is consumption-ready. Schemas are stable, documentation exists, SLAs are defined.
Partition key selection
Choose partition keys based on query patterns, not ingest patterns:
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table({
"order_id": [1, 2, 3],
"amount": [99.50, 230.00, 15.75],
"order_date": ["2026-03-28", "2026-03-28", "2026-03-27"],
"region": ["us-east", "eu-west", "us-east"],
})
pq.write_to_dataset(
table,
root_path="s3://company-lake/cleaned/orders/",
partition_cols=["order_date", "region"],
)
Over-partitioning creates a “small files problem”—millions of tiny files that degrade query performance and increase S3 LIST costs. A useful rule of thumb: target partition files between 64 MB and 256 MB each.
2) Table format selection
Parquet alone
Plain Parquet directories work for read-heavy, append-only workloads. No transaction support, no schema evolution guarantees. Fine for raw zone dumps.
Delta Lake
Delta Lake wraps Parquet with a JSON-based transaction log (_delta_log/). Python access via the deltalake package:
from deltalake import DeltaTable, write_deltalake
import pyarrow as pa
# Write
write_deltalake(
"s3://company-lake/cleaned/orders_delta/",
table,
mode="overwrite",
partition_by=["order_date"],
)
# Read with time travel
dt = DeltaTable("s3://company-lake/cleaned/orders_delta/")
print(dt.version()) # current version
historical = dt.load_as_version(2)
Delta supports MERGE (upsert), Z-ordering for multi-column pruning, and VACUUM for removing old files.
Apache Iceberg
Iceberg stores metadata in manifest files that track every data file’s partition values, column statistics, and file size. Python access via pyiceberg:
from pyiceberg.catalog import load_catalog
catalog = load_catalog("glue", **{"type": "glue"})
table = catalog.load_table("analytics.orders")
# Scan with partition pruning happens automatically
scan = table.scan(row_filter="order_date = '2026-03-28'")
df = scan.to_pandas()
Iceberg’s hidden partitioning means consumers query by column values, not by physical directory structure. Schema evolution (add, rename, reorder columns) is a metadata-only operation.
3) Compaction and maintenance
Small files accumulate when streaming jobs or frequent micro-batches write to the lake. Compaction merges them:
# Delta Lake compaction
dt = DeltaTable("s3://company-lake/cleaned/orders_delta/")
dt.optimize.compact()
dt.vacuum(retention_hours=168) # keep 7 days of history
For Iceberg, use rewrite_data_files via Spark or the REST catalog API. Schedule compaction as a daily Airflow task during off-peak hours.
Lifecycle policies
Set S3/GCS lifecycle rules to:
- Transition raw zone to Glacier/Coldline after 90 days.
- Delete temporary staging files after 7 days.
- Keep cleaned/curated in Standard tier for active querying.
import boto3
s3 = boto3.client("s3")
s3.put_bucket_lifecycle_configuration(
Bucket="company-lake",
LifecycleConfiguration={
"Rules": [
{
"ID": "archive-raw-90d",
"Filter": {"Prefix": "raw/"},
"Status": "Enabled",
"Transitions": [
{"Days": 90, "StorageClass": "GLACIER"},
],
},
],
},
)
4) Metadata and cataloging
Automated schema registration
After every pipeline run, register or update the schema in your catalog:
import pyarrow.parquet as pq
import json
schema = pq.read_schema("s3://company-lake/cleaned/orders/order_date=2026-03-28/part-00000.parquet")
catalog_entry = {
"table": "orders",
"zone": "cleaned",
"columns": [
{"name": f.name, "type": str(f.type), "nullable": f.nullable}
for f in schema
],
"updated_at": "2026-03-28T06:30:00Z",
"row_count": 150_000,
"partition_keys": ["order_date", "region"],
}
with open("catalog/orders.json", "w") as fh:
json.dump(catalog_entry, fh, indent=2)
For production, use a real catalog (Glue, Hive Metastore, DataHub) and push updates from your orchestrator’s post-task hooks.
Data quality gates
Quality checks belong between zones. A cleaned-zone writer should validate before promoting:
import pyarrow.parquet as pq
table = pq.read_table("staging/orders.parquet")
assert table.num_rows > 0, "Empty batch"
assert table.schema.field("order_id").type == pa.int64(), "order_id type drift"
null_rate = table.column("amount").null_count / table.num_rows
assert null_rate < 0.01, f"amount null rate {null_rate:.2%} exceeds 1% threshold"
5) Access patterns and query engines
Different consumers need different engines:
| Consumer | Engine | Why |
|---|---|---|
| Data scientists (ad hoc) | DuckDB, Polars | Fast local queries on Parquet |
| BI dashboards | Trino, Athena | Federated SQL across zones |
| ML pipelines | Spark, Ray | Distributed processing at scale |
| Compliance audits | Delta/Iceberg time travel | Point-in-time snapshots |
Python unifies access across all of these. A data scientist can prototype with DuckDB locally, then the same query logic scales to Trino in production.
import duckdb
conn = duckdb.connect()
result = conn.execute("""
SELECT region, SUM(amount) as total_revenue
FROM read_parquet('s3://company-lake/curated/daily_revenue/**/*.parquet')
WHERE year = 2026 AND month = 3
GROUP BY region
ORDER BY total_revenue DESC
""").fetchdf()
6) Cost management
Data lake costs sneak up through three channels:
- Storage: Raw data grows indefinitely. Use lifecycle policies and compression.
- API calls: S3 LIST and GET requests are billed per-call. Reduce small files, use manifest-based formats (Delta, Iceberg) to avoid listing.
- Compute: Scanning unpartitioned data wastes CPU. Partition pruning and column projection are your first line of defense.
Track costs per zone and per team using bucket-level tags and query engine cost attribution.
Production checklist
- Three-zone layout with immutable raw zone
- Columnar format (Parquet minimum) for cleaned and curated zones
- Partition keys aligned to dominant query filters
- Compaction job scheduled (daily or after high-ingest windows)
- Metadata catalog updated automatically after each pipeline run
- Lifecycle policies for cold storage and cleanup
- Quality gates between zone promotions
- Access controls per zone (raw = engineering, curated = analysts)
- Cost monitoring dashboards by zone and team
One thing to remember: a data lake’s value comes from the metadata layer—without catalogs, quality gates, and lifecycle policies, raw storage is just an expensive backup nobody queries.
See Also
- Python Adaptive Learning Systems How Python builds learning apps that adjust to each student like a personal tutor who knows exactly what you need next.
- Python Airflow Learn Airflow as a timetable manager that makes sure data tasks run in the right order every day.
- Python Altair Learn Altair through the idea of drawing charts by describing rules, not by hand-placing every visual element.
- Python Automated Grading How Python grades homework and exams automatically, from simple answer keys to understanding written essays.
- Python Batch Vs Stream Processing Batch processing is like doing laundry once a week; stream processing is like a self-cleaning shirt that cleans itself constantly.