Python Data Lake Patterns — Deep Dive

Architect production data lakes in Python with partitioning, table formats, compaction, and cost controls.

Building a production data lake with Python means making deliberate choices about storage layout, file formats, metadata management, and lifecycle policies. Get these right and the lake scales to petabytes. Get them wrong and you pay for storage nobody can query.

1) Storage layout design

Zone hierarchy

A three-zone model (raw → cleaned → curated) is the industry baseline, but implementation details matter:

s3://company-lake/
  raw/
    source=shopify_orders/
      year=2026/month=03/day=28/
        batch_20260328T060000Z.json.gz
  cleaned/
    shopify_orders/
      year=2026/month=03/day=28/
        part-00000.parquet
  curated/
    daily_revenue/
      year=2026/month=03/
        part-00000.parquet

Key principles:

Raw is immutable. Never modify or delete raw files. They are your audit trail and reprocessing source.
Cleaned is idempotent. Rerunning the cleaning pipeline for a given partition overwrites the same output path deterministically.
Curated is consumption-ready. Schemas are stable, documentation exists, SLAs are defined.

Partition key selection

Choose partition keys based on query patterns, not ingest patterns:

import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table({
    "order_id": [1, 2, 3],
    "amount": [99.50, 230.00, 15.75],
    "order_date": ["2026-03-28", "2026-03-28", "2026-03-27"],
    "region": ["us-east", "eu-west", "us-east"],
})

pq.write_to_dataset(
    table,
    root_path="s3://company-lake/cleaned/orders/",
    partition_cols=["order_date", "region"],
)

Over-partitioning creates a “small files problem”—millions of tiny files that degrade query performance and increase S3 LIST costs. A useful rule of thumb: target partition files between 64 MB and 256 MB each.

2) Table format selection

Parquet alone

Plain Parquet directories work for read-heavy, append-only workloads. No transaction support, no schema evolution guarantees. Fine for raw zone dumps.

Delta Lake

Delta Lake wraps Parquet with a JSON-based transaction log (_delta_log/). Python access via the deltalake package:

from deltalake import DeltaTable, write_deltalake
import pyarrow as pa

# Write
write_deltalake(
    "s3://company-lake/cleaned/orders_delta/",
    table,
    mode="overwrite",
    partition_by=["order_date"],
)

# Read with time travel
dt = DeltaTable("s3://company-lake/cleaned/orders_delta/")
print(dt.version())  # current version
historical = dt.load_as_version(2)

Delta supports MERGE (upsert), Z-ordering for multi-column pruning, and VACUUM for removing old files.

Apache Iceberg

Iceberg stores metadata in manifest files that track every data file’s partition values, column statistics, and file size. Python access via pyiceberg:

from pyiceberg.catalog import load_catalog

catalog = load_catalog("glue", **{"type": "glue"})
table = catalog.load_table("analytics.orders")

# Scan with partition pruning happens automatically
scan = table.scan(row_filter="order_date = '2026-03-28'")
df = scan.to_pandas()

Iceberg’s hidden partitioning means consumers query by column values, not by physical directory structure. Schema evolution (add, rename, reorder columns) is a metadata-only operation.

3) Compaction and maintenance

Small files accumulate when streaming jobs or frequent micro-batches write to the lake. Compaction merges them:

# Delta Lake compaction
dt = DeltaTable("s3://company-lake/cleaned/orders_delta/")
dt.optimize.compact()
dt.vacuum(retention_hours=168)  # keep 7 days of history

For Iceberg, use rewrite_data_files via Spark or the REST catalog API. Schedule compaction as a daily Airflow task during off-peak hours.

Lifecycle policies

Set S3/GCS lifecycle rules to:

Transition raw zone to Glacier/Coldline after 90 days.
Delete temporary staging files after 7 days.
Keep cleaned/curated in Standard tier for active querying.

import boto3

s3 = boto3.client("s3")
s3.put_bucket_lifecycle_configuration(
    Bucket="company-lake",
    LifecycleConfiguration={
        "Rules": [
            {
                "ID": "archive-raw-90d",
                "Filter": {"Prefix": "raw/"},
                "Status": "Enabled",
                "Transitions": [
                    {"Days": 90, "StorageClass": "GLACIER"},
                ],
            },
        ],
    },
)

4) Metadata and cataloging

Automated schema registration

After every pipeline run, register or update the schema in your catalog:

import pyarrow.parquet as pq
import json

schema = pq.read_schema("s3://company-lake/cleaned/orders/order_date=2026-03-28/part-00000.parquet")

catalog_entry = {
    "table": "orders",
    "zone": "cleaned",
    "columns": [
        {"name": f.name, "type": str(f.type), "nullable": f.nullable}
        for f in schema
    ],
    "updated_at": "2026-03-28T06:30:00Z",
    "row_count": 150_000,
    "partition_keys": ["order_date", "region"],
}

with open("catalog/orders.json", "w") as fh:
    json.dump(catalog_entry, fh, indent=2)

For production, use a real catalog (Glue, Hive Metastore, DataHub) and push updates from your orchestrator’s post-task hooks.

Data quality gates

Quality checks belong between zones. A cleaned-zone writer should validate before promoting:

import pyarrow.parquet as pq

table = pq.read_table("staging/orders.parquet")
assert table.num_rows > 0, "Empty batch"
assert table.schema.field("order_id").type == pa.int64(), "order_id type drift"

null_rate = table.column("amount").null_count / table.num_rows
assert null_rate < 0.01, f"amount null rate {null_rate:.2%} exceeds 1% threshold"

5) Access patterns and query engines

Different consumers need different engines:

Consumer	Engine	Why
Data scientists (ad hoc)	DuckDB, Polars	Fast local queries on Parquet
BI dashboards	Trino, Athena	Federated SQL across zones
ML pipelines	Spark, Ray	Distributed processing at scale
Compliance audits	Delta/Iceberg time travel	Point-in-time snapshots

Python unifies access across all of these. A data scientist can prototype with DuckDB locally, then the same query logic scales to Trino in production.

import duckdb

conn = duckdb.connect()
result = conn.execute("""
    SELECT region, SUM(amount) as total_revenue
    FROM read_parquet('s3://company-lake/curated/daily_revenue/**/*.parquet')
    WHERE year = 2026 AND month = 3
    GROUP BY region
    ORDER BY total_revenue DESC
""").fetchdf()

6) Cost management

Data lake costs sneak up through three channels:

Storage: Raw data grows indefinitely. Use lifecycle policies and compression.
API calls: S3 LIST and GET requests are billed per-call. Reduce small files, use manifest-based formats (Delta, Iceberg) to avoid listing.
Compute: Scanning unpartitioned data wastes CPU. Partition pruning and column projection are your first line of defense.

Track costs per zone and per team using bucket-level tags and query engine cost attribution.

Production checklist

Three-zone layout with immutable raw zone
Columnar format (Parquet minimum) for cleaned and curated zones
Partition keys aligned to dominant query filters
Compaction job scheduled (daily or after high-ingest windows)
Metadata catalog updated automatically after each pipeline run
Lifecycle policies for cold storage and cleanup
Quality gates between zone promotions
Access controls per zone (raw = engineering, curated = analysts)
Cost monitoring dashboards by zone and team

One thing to remember: a data lake’s value comes from the metadata layer—without catalogs, quality gates, and lifecycle policies, raw storage is just an expensive backup nobody queries.

pythondata-lakedata-engineering