Model Monitoring and Drift Detection in Python — Core Concepts
Why Models Degrade
A deployed model is a frozen snapshot of patterns learned from historical data. The real world does not freeze. Customer behavior shifts, market conditions change, upstream data pipelines break, and new categories appear. Without monitoring, these changes silently erode model accuracy.
LinkedIn reported that some of their models lost 10-20% accuracy within weeks of deployment simply because user behavior evolved faster than expected.
Types of Drift
Data Drift (Covariate Shift)
The distribution of input features changes. The model sees data that looks different from its training data.
Example: A loan approval model trained on applications from employed professionals starts receiving applications from gig workers with different income patterns.
Concept Drift
The relationship between inputs and outputs changes. Even if the features look the same, what they mean has shifted.
Example: During COVID-19, the relationship between restaurant location and revenue fundamentally changed — downtown locations went from premium to liability.
Prediction Drift
The distribution of model outputs changes, even if input distributions seem stable. This can signal hidden feature interactions or label definition changes.
Label Drift
The distribution of ground truth labels changes over time. A fraud detection model trained when fraud was 1% of transactions may degrade when fraud rises to 3%.
How to Detect Drift
| Method | What It Detects | Speed |
|---|---|---|
| Population Stability Index (PSI) | Distribution shift in features or predictions | Fast, simple |
| Kolmogorov-Smirnov test | Statistical difference between two distributions | Fast, per-feature |
| Jensen-Shannon divergence | Symmetric distance between distributions | Fast |
| Performance monitoring | Accuracy/F1 drop when labels are available | Slow (needs ground truth) |
| Page-Hinkley test | Abrupt changes in a streaming metric | Real-time |
Monitoring Architecture
A typical monitoring system has three layers:
- Data collection — log every prediction request (inputs, outputs, timestamps) to a store
- Analysis — periodically compare recent data distributions against a reference (training data or a recent “known good” window)
- Alerting — trigger notifications when drift metrics exceed thresholds
What to Monitor
- Input feature distributions — per-feature summary statistics and histograms
- Prediction distributions — mean, variance, and shape of model outputs
- Missing value rates — sudden increases signal upstream pipeline issues
- Latency — serving time spikes may indicate infrastructure or data issues
- Business metrics — click-through rate, conversion rate, or other downstream KPIs
The Ground Truth Delay Problem
The hardest part of monitoring is that ground truth labels often arrive late. A churn prediction model needs months to know if a customer actually churned. A credit risk model may wait years for default data.
This is why input monitoring (data drift) matters so much — it catches problems without waiting for labels.
Common Misconception
Many teams think monitoring means checking model accuracy on a test set once a month. That is evaluation, not monitoring. Monitoring is continuous, automated, and operates on live production data — catching degradation in hours or days, not months.
One thing to remember: Monitor your model’s inputs and outputs continuously because accuracy measured at training time says nothing about how the model performs as the world changes around it.
See Also
- Python Ab Testing Ml Models Why taste-testing two cookie recipes with different friends is the fairest way to pick a winner.
- Python Feature Store Design Why a shared ingredient pantry saves every cook in the kitchen from buying the same spices over and over.
- Python Ml Pipeline Orchestration Why a factory assembly line needs a foreman to make sure every step happens in the right order at the right time.
- Python Mlflow Experiment Tracking Find out why writing down every cooking experiment helps you recreate the perfect recipe every time.
- Python Model Explainability Shap How asking 'why did you pick that answer?' turns a mysterious black box into something you can actually trust.