Model Monitoring and Drift Detection in Python — Core Concepts

Detect data drift, concept drift, and prediction drift using statistical tests, monitoring dashboards, and automated alerting pipelines.

Why Models Degrade

A deployed model is a frozen snapshot of patterns learned from historical data. The real world does not freeze. Customer behavior shifts, market conditions change, upstream data pipelines break, and new categories appear. Without monitoring, these changes silently erode model accuracy.

LinkedIn reported that some of their models lost 10-20% accuracy within weeks of deployment simply because user behavior evolved faster than expected.

Types of Drift

Data Drift (Covariate Shift)

The distribution of input features changes. The model sees data that looks different from its training data.

Example: A loan approval model trained on applications from employed professionals starts receiving applications from gig workers with different income patterns.

Concept Drift

The relationship between inputs and outputs changes. Even if the features look the same, what they mean has shifted.

Example: During COVID-19, the relationship between restaurant location and revenue fundamentally changed — downtown locations went from premium to liability.

Prediction Drift

The distribution of model outputs changes, even if input distributions seem stable. This can signal hidden feature interactions or label definition changes.

Label Drift

The distribution of ground truth labels changes over time. A fraud detection model trained when fraud was 1% of transactions may degrade when fraud rises to 3%.

How to Detect Drift

Method	What It Detects	Speed
Population Stability Index (PSI)	Distribution shift in features or predictions	Fast, simple
Kolmogorov-Smirnov test	Statistical difference between two distributions	Fast, per-feature
Jensen-Shannon divergence	Symmetric distance between distributions	Fast
Performance monitoring	Accuracy/F1 drop when labels are available	Slow (needs ground truth)
Page-Hinkley test	Abrupt changes in a streaming metric	Real-time

Monitoring Architecture

A typical monitoring system has three layers:

Data collection — log every prediction request (inputs, outputs, timestamps) to a store
Analysis — periodically compare recent data distributions against a reference (training data or a recent “known good” window)
Alerting — trigger notifications when drift metrics exceed thresholds

What to Monitor

Input feature distributions — per-feature summary statistics and histograms
Prediction distributions — mean, variance, and shape of model outputs
Missing value rates — sudden increases signal upstream pipeline issues
Latency — serving time spikes may indicate infrastructure or data issues
Business metrics — click-through rate, conversion rate, or other downstream KPIs

The Ground Truth Delay Problem

The hardest part of monitoring is that ground truth labels often arrive late. A churn prediction model needs months to know if a customer actually churned. A credit risk model may wait years for default data.

This is why input monitoring (data drift) matters so much — it catches problems without waiting for labels.

Common Misconception

Many teams think monitoring means checking model accuracy on a test set once a month. That is evaluation, not monitoring. Monitoring is continuous, automated, and operates on live production data — catching degradation in hours or days, not months.

One thing to remember: Monitor your model’s inputs and outputs continuously because accuracy measured at training time says nothing about how the model performs as the world changes around it.

pythonmodel-monitoringdrift-detectionmlops