Model Monitoring and Drift Detection in Python — Core Concepts

Why Models Degrade

A deployed model is a frozen snapshot of patterns learned from historical data. The real world does not freeze. Customer behavior shifts, market conditions change, upstream data pipelines break, and new categories appear. Without monitoring, these changes silently erode model accuracy.

LinkedIn reported that some of their models lost 10-20% accuracy within weeks of deployment simply because user behavior evolved faster than expected.

Types of Drift

Data Drift (Covariate Shift)

The distribution of input features changes. The model sees data that looks different from its training data.

Example: A loan approval model trained on applications from employed professionals starts receiving applications from gig workers with different income patterns.

Concept Drift

The relationship between inputs and outputs changes. Even if the features look the same, what they mean has shifted.

Example: During COVID-19, the relationship between restaurant location and revenue fundamentally changed — downtown locations went from premium to liability.

Prediction Drift

The distribution of model outputs changes, even if input distributions seem stable. This can signal hidden feature interactions or label definition changes.

Label Drift

The distribution of ground truth labels changes over time. A fraud detection model trained when fraud was 1% of transactions may degrade when fraud rises to 3%.

How to Detect Drift

MethodWhat It DetectsSpeed
Population Stability Index (PSI)Distribution shift in features or predictionsFast, simple
Kolmogorov-Smirnov testStatistical difference between two distributionsFast, per-feature
Jensen-Shannon divergenceSymmetric distance between distributionsFast
Performance monitoringAccuracy/F1 drop when labels are availableSlow (needs ground truth)
Page-Hinkley testAbrupt changes in a streaming metricReal-time

Monitoring Architecture

A typical monitoring system has three layers:

  1. Data collection — log every prediction request (inputs, outputs, timestamps) to a store
  2. Analysis — periodically compare recent data distributions against a reference (training data or a recent “known good” window)
  3. Alerting — trigger notifications when drift metrics exceed thresholds

What to Monitor

  • Input feature distributions — per-feature summary statistics and histograms
  • Prediction distributions — mean, variance, and shape of model outputs
  • Missing value rates — sudden increases signal upstream pipeline issues
  • Latency — serving time spikes may indicate infrastructure or data issues
  • Business metrics — click-through rate, conversion rate, or other downstream KPIs

The Ground Truth Delay Problem

The hardest part of monitoring is that ground truth labels often arrive late. A churn prediction model needs months to know if a customer actually churned. A credit risk model may wait years for default data.

This is why input monitoring (data drift) matters so much — it catches problems without waiting for labels.

Common Misconception

Many teams think monitoring means checking model accuracy on a test set once a month. That is evaluation, not monitoring. Monitoring is continuous, automated, and operates on live production data — catching degradation in hours or days, not months.

One thing to remember: Monitor your model’s inputs and outputs continuously because accuracy measured at training time says nothing about how the model performs as the world changes around it.

pythonmodel-monitoringdrift-detectionmlops

See Also