Feature Store Design in Python — Deep Dive

Feast: The Open-Source Feature Store

Feast (Feature Store) is the most widely adopted open-source feature store. It provides feature definition as code, materialization to online/offline stores, and consistent retrieval APIs.

Project Setup

pip install feast
feast init my_feature_repo
cd my_feature_repo

The generated structure:

my_feature_repo/
├── feature_store.yaml    # Infrastructure configuration
├── features.py           # Feature definitions
└── data/                 # Sample data sources

Defining Features

# features.py
from feast import Entity, FeatureView, Field, FileSource
from feast.types import Float32, Int64, String
from datetime import timedelta

# Entity: the thing features describe
customer = Entity(
    name="customer_id",
    join_keys=["customer_id"],
    description="Unique customer identifier"
)

# Data source
customer_stats_source = FileSource(
    path="data/customer_stats.parquet",
    timestamp_field="event_timestamp",
    created_timestamp_column="created_timestamp"
)

# Feature view: a group of related features
customer_stats_fv = FeatureView(
    name="customer_stats",
    entities=[customer],
    ttl=timedelta(days=1),
    schema=[
        Field(name="purchase_count_30d", dtype=Int64),
        Field(name="avg_order_value", dtype=Float32),
        Field(name="days_since_last_purchase", dtype=Int64),
        Field(name="preferred_category", dtype=String),
    ],
    source=customer_stats_source,
    online=True,
    tags={"team": "growth", "owner": "ml-platform"}
)

Infrastructure Configuration

# feature_store.yaml
project: fraud_detection
provider: gcp
registry: gs://my-bucket/feast-registry.pb
online_store:
  type: redis
  connection_string: redis://10.0.0.1:6379
offline_store:
  type: bigquery
  dataset: feast_offline

Materialization

# Load historical data into the offline store and latest values into online
feast apply          # Register feature definitions
feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S")

Materialization reads from the source, computes the latest value per entity, and writes to the online store. Running it incrementally processes only new data since the last materialization.

Training Data Retrieval with Point-in-Time Joins

from feast import FeatureStore
import pandas as pd

store = FeatureStore(repo_path=".")

# Entity dataframe: events we want features for
entity_df = pd.DataFrame({
    "customer_id": [1001, 1002, 1003, 1001],
    "event_timestamp": pd.to_datetime([
        "2025-06-15 10:00:00",
        "2025-06-15 11:00:00",
        "2025-06-16 09:00:00",
        "2025-06-17 14:00:00",  # Same customer, different time
    ])
})

# Point-in-time correct retrieval
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "customer_stats:purchase_count_30d",
        "customer_stats:avg_order_value",
        "customer_stats:days_since_last_purchase",
    ]
).to_df()

Each row gets feature values from the most recent feature row before its event timestamp. Customer 1001 appears twice with different timestamps and may receive different feature values — reflecting their state at each point in time.

Online Serving

# Low-latency retrieval for real-time inference
feature_vector = store.get_online_features(
    features=[
        "customer_stats:purchase_count_30d",
        "customer_stats:avg_order_value",
        "customer_stats:preferred_category",
    ],
    entity_rows=[{"customer_id": 1001}]
).to_dict()

# Returns: {"customer_id": [1001], "purchase_count_30d": [42], ...}

Online reads hit the Redis (or DynamoDB) store and typically return in 1-5ms — fast enough for serving paths that need sub-50ms total latency.

Feature Services: Bundling for Models

from feast import FeatureService

fraud_detection_service = FeatureService(
    name="fraud_detection_v2",
    features=[
        customer_stats_fv[["purchase_count_30d", "avg_order_value"]],
        transaction_stats_fv[["txn_amount_zscore", "merchant_risk_score"]],
    ],
    description="Features for fraud detection model v2",
    tags={"model": "fraud-detector-v2"}
)

Feature services pin the exact features a model needs. When the model gets retrained or updated, the feature service version documents which features it consumed — creating a reproducibility link.

Streaming Features

Batch materialization introduces latency — features are only as fresh as the last materialization run. For real-time use cases (fraud detection, dynamic pricing), streaming features update the online store continuously:

from feast import StreamFeatureView
from feast.data_sources import KafkaSource

txn_stream = KafkaSource(
    name="transaction_events",
    kafka_bootstrap_servers="broker:9092",
    topic="transactions",
    timestamp_field="event_timestamp",
    message_format=AvroFormat(schema_json=AVRO_SCHEMA)
)

realtime_txn_stats = StreamFeatureView(
    name="realtime_txn_stats",
    entities=[customer],
    ttl=timedelta(hours=1),
    schema=[
        Field(name="txn_count_last_hour", dtype=Int64),
        Field(name="txn_total_last_hour", dtype=Float32),
    ],
    source=txn_stream,
    aggregations=[
        Aggregation(column="amount", function="count", time_window=timedelta(hours=1)),
        Aggregation(column="amount", function="sum", time_window=timedelta(hours=1)),
    ],
    online=True
)

Design Decisions and Tradeoffs

Offline Store Choice

StoreLatencyCostBest For
BigQuery / RedshiftSecondsPay-per-queryLarge-scale historical joins
Parquet on S3MinutesStorage onlySmall teams, prototyping
Spark / DatabricksSeconds-minutesCluster costExisting Spark infrastructure

Online Store Choice

StoreP99 LatencyCostScaling
Redis<2msMemory-boundVertical + cluster
DynamoDB<5msPay-per-requestAutomatic
Bigtable<10msProvisionedNode-based
SQLite<1msFreeSingle-machine only

Feature Computation: Push vs Pull

Pull model — the feature store reads from source tables on a schedule. Simple but introduces staleness.

Push model — producers write features directly to the store when events happen. Lower latency but requires producers to know the feature schema.

Most production systems use pull for batch features and push for streaming features.

Monitoring Feature Health

Feature stores need observability:

# Check feature freshness
from datetime import datetime, timedelta

def check_feature_freshness(store, feature_view_name, max_age: timedelta):
    """Alert if features are staler than max_age."""
    fv = store.get_feature_view(feature_view_name)
    last_materialized = fv.materialization_intervals[-1].end_date

    age = datetime.utcnow() - last_materialized
    if age > max_age:
        return {
            "status": "stale",
            "feature_view": feature_view_name,
            "age_hours": age.total_seconds() / 3600,
            "threshold_hours": max_age.total_seconds() / 3600
        }
    return {"status": "fresh", "feature_view": feature_view_name}

Key metrics to track:

  • Freshness — time since last materialization
  • Coverage — percentage of entities with non-null values
  • Distribution drift — statistical distance from training-time distributions
  • Serving latency — online store read times

Real-World Architecture: DoorDash

DoorDash processes millions of delivery orders daily and uses a feature store to power models for delivery time estimation, fraud detection, and merchant ranking. Their architecture:

  1. Raw events stream through Kafka
  2. Flink jobs compute real-time features (running delivery count, recent cancellation rate)
  3. Features land in Redis for online serving and S3 for offline training
  4. A metadata layer tracks feature ownership, SLAs, and schema versions

The key insight: they compute features once in Flink and serve them to 50+ models through the same online store — eliminating per-model computation and guaranteeing consistency.

One thing to remember: A well-designed feature store separates feature computation from feature consumption, ensuring every model — whether training offline or serving in real time — sees identical, point-in-time correct data.

pythonfeature-storemlopsmachine-learning

See Also