Python Data Lake Patterns — Deep Dive

Building a production data lake with Python means making deliberate choices about storage layout, file formats, metadata management, and lifecycle policies. Get these right and the lake scales to petabytes. Get them wrong and you pay for storage nobody can query.

1) Storage layout design

Zone hierarchy

A three-zone model (raw → cleaned → curated) is the industry baseline, but implementation details matter:

s3://company-lake/
  raw/
    source=shopify_orders/
      year=2026/month=03/day=28/
        batch_20260328T060000Z.json.gz
  cleaned/
    shopify_orders/
      year=2026/month=03/day=28/
        part-00000.parquet
  curated/
    daily_revenue/
      year=2026/month=03/
        part-00000.parquet

Key principles:

  • Raw is immutable. Never modify or delete raw files. They are your audit trail and reprocessing source.
  • Cleaned is idempotent. Rerunning the cleaning pipeline for a given partition overwrites the same output path deterministically.
  • Curated is consumption-ready. Schemas are stable, documentation exists, SLAs are defined.

Partition key selection

Choose partition keys based on query patterns, not ingest patterns:

import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table({
    "order_id": [1, 2, 3],
    "amount": [99.50, 230.00, 15.75],
    "order_date": ["2026-03-28", "2026-03-28", "2026-03-27"],
    "region": ["us-east", "eu-west", "us-east"],
})

pq.write_to_dataset(
    table,
    root_path="s3://company-lake/cleaned/orders/",
    partition_cols=["order_date", "region"],
)

Over-partitioning creates a “small files problem”—millions of tiny files that degrade query performance and increase S3 LIST costs. A useful rule of thumb: target partition files between 64 MB and 256 MB each.

2) Table format selection

Parquet alone

Plain Parquet directories work for read-heavy, append-only workloads. No transaction support, no schema evolution guarantees. Fine for raw zone dumps.

Delta Lake

Delta Lake wraps Parquet with a JSON-based transaction log (_delta_log/). Python access via the deltalake package:

from deltalake import DeltaTable, write_deltalake
import pyarrow as pa

# Write
write_deltalake(
    "s3://company-lake/cleaned/orders_delta/",
    table,
    mode="overwrite",
    partition_by=["order_date"],
)

# Read with time travel
dt = DeltaTable("s3://company-lake/cleaned/orders_delta/")
print(dt.version())  # current version
historical = dt.load_as_version(2)

Delta supports MERGE (upsert), Z-ordering for multi-column pruning, and VACUUM for removing old files.

Apache Iceberg

Iceberg stores metadata in manifest files that track every data file’s partition values, column statistics, and file size. Python access via pyiceberg:

from pyiceberg.catalog import load_catalog

catalog = load_catalog("glue", **{"type": "glue"})
table = catalog.load_table("analytics.orders")

# Scan with partition pruning happens automatically
scan = table.scan(row_filter="order_date = '2026-03-28'")
df = scan.to_pandas()

Iceberg’s hidden partitioning means consumers query by column values, not by physical directory structure. Schema evolution (add, rename, reorder columns) is a metadata-only operation.

3) Compaction and maintenance

Small files accumulate when streaming jobs or frequent micro-batches write to the lake. Compaction merges them:

# Delta Lake compaction
dt = DeltaTable("s3://company-lake/cleaned/orders_delta/")
dt.optimize.compact()
dt.vacuum(retention_hours=168)  # keep 7 days of history

For Iceberg, use rewrite_data_files via Spark or the REST catalog API. Schedule compaction as a daily Airflow task during off-peak hours.

Lifecycle policies

Set S3/GCS lifecycle rules to:

  • Transition raw zone to Glacier/Coldline after 90 days.
  • Delete temporary staging files after 7 days.
  • Keep cleaned/curated in Standard tier for active querying.
import boto3

s3 = boto3.client("s3")
s3.put_bucket_lifecycle_configuration(
    Bucket="company-lake",
    LifecycleConfiguration={
        "Rules": [
            {
                "ID": "archive-raw-90d",
                "Filter": {"Prefix": "raw/"},
                "Status": "Enabled",
                "Transitions": [
                    {"Days": 90, "StorageClass": "GLACIER"},
                ],
            },
        ],
    },
)

4) Metadata and cataloging

Automated schema registration

After every pipeline run, register or update the schema in your catalog:

import pyarrow.parquet as pq
import json

schema = pq.read_schema("s3://company-lake/cleaned/orders/order_date=2026-03-28/part-00000.parquet")

catalog_entry = {
    "table": "orders",
    "zone": "cleaned",
    "columns": [
        {"name": f.name, "type": str(f.type), "nullable": f.nullable}
        for f in schema
    ],
    "updated_at": "2026-03-28T06:30:00Z",
    "row_count": 150_000,
    "partition_keys": ["order_date", "region"],
}

with open("catalog/orders.json", "w") as fh:
    json.dump(catalog_entry, fh, indent=2)

For production, use a real catalog (Glue, Hive Metastore, DataHub) and push updates from your orchestrator’s post-task hooks.

Data quality gates

Quality checks belong between zones. A cleaned-zone writer should validate before promoting:

import pyarrow.parquet as pq

table = pq.read_table("staging/orders.parquet")
assert table.num_rows > 0, "Empty batch"
assert table.schema.field("order_id").type == pa.int64(), "order_id type drift"

null_rate = table.column("amount").null_count / table.num_rows
assert null_rate < 0.01, f"amount null rate {null_rate:.2%} exceeds 1% threshold"

5) Access patterns and query engines

Different consumers need different engines:

ConsumerEngineWhy
Data scientists (ad hoc)DuckDB, PolarsFast local queries on Parquet
BI dashboardsTrino, AthenaFederated SQL across zones
ML pipelinesSpark, RayDistributed processing at scale
Compliance auditsDelta/Iceberg time travelPoint-in-time snapshots

Python unifies access across all of these. A data scientist can prototype with DuckDB locally, then the same query logic scales to Trino in production.

import duckdb

conn = duckdb.connect()
result = conn.execute("""
    SELECT region, SUM(amount) as total_revenue
    FROM read_parquet('s3://company-lake/curated/daily_revenue/**/*.parquet')
    WHERE year = 2026 AND month = 3
    GROUP BY region
    ORDER BY total_revenue DESC
""").fetchdf()

6) Cost management

Data lake costs sneak up through three channels:

  1. Storage: Raw data grows indefinitely. Use lifecycle policies and compression.
  2. API calls: S3 LIST and GET requests are billed per-call. Reduce small files, use manifest-based formats (Delta, Iceberg) to avoid listing.
  3. Compute: Scanning unpartitioned data wastes CPU. Partition pruning and column projection are your first line of defense.

Track costs per zone and per team using bucket-level tags and query engine cost attribution.

Production checklist

  • Three-zone layout with immutable raw zone
  • Columnar format (Parquet minimum) for cleaned and curated zones
  • Partition keys aligned to dominant query filters
  • Compaction job scheduled (daily or after high-ingest windows)
  • Metadata catalog updated automatically after each pipeline run
  • Lifecycle policies for cold storage and cleanup
  • Quality gates between zone promotions
  • Access controls per zone (raw = engineering, curated = analysts)
  • Cost monitoring dashboards by zone and team

One thing to remember: a data lake’s value comes from the metadata layer—without catalogs, quality gates, and lifecycle policies, raw storage is just an expensive backup nobody queries.

pythondata-lakedata-engineering

See Also

  • Python Adaptive Learning Systems How Python builds learning apps that adjust to each student like a personal tutor who knows exactly what you need next.
  • Python Airflow Learn Airflow as a timetable manager that makes sure data tasks run in the right order every day.
  • Python Altair Learn Altair through the idea of drawing charts by describing rules, not by hand-placing every visual element.
  • Python Automated Grading How Python grades homework and exams automatically, from simple answer keys to understanding written essays.
  • Python Batch Vs Stream Processing Batch processing is like doing laundry once a week; stream processing is like a self-cleaning shirt that cleans itself constantly.