Shadow Deployment for ML Models in Python — Core Concepts

What Is Shadow Deployment?

Shadow deployment (also called dark launching or shadow mode) runs a new model alongside the production model on the same live traffic. The production model’s predictions are served to users. The shadow model’s predictions are recorded but discarded. This gives you real-world performance data without any user-facing risk.

Why Not Just Use A/B Testing?

A/B tests expose real users to the new model. For high-stakes applications — medical diagnosis, fraud detection, autonomous driving — even a small percentage of bad predictions can cause real harm. Shadow deployment eliminates this risk entirely because the shadow model never affects any user.

Shadow deployment also catches issues that A/B tests cannot detect without exposure:

  • Latency problems — the new model might be too slow under production load
  • Infrastructure failures — memory leaks, GPU errors, or serialization bugs
  • Data pipeline mismatches — features missing or formatted differently in production

How It Works

User Request → Production Model → Response to User
                  ↓ (copy)
              Shadow Model → Logged (not served)

              Comparison Engine → Metrics Dashboard

Both models receive identical input. Only the production model’s output reaches the user. The comparison engine logs both outputs for analysis.

What to Compare

MetricWhat It Reveals
Output agreement rateHow often shadow matches production
Prediction distributionWhether shadow skews differently
Latency (p50, p95, p99)Whether shadow meets serving SLAs
Error rateCrashes, timeouts, invalid outputs
Memory and CPU usageResource requirements at production scale
Feature completenessMissing or null features in shadow path

When to Use Shadow Deployment

  • Before the first A/B test — validate that the model works at all in production before sending it real traffic
  • Major model architecture changes — switching from logistic regression to a deep learning model
  • Regulated environments — healthcare, finance, where any production change needs evidence
  • Critical systems — search ranking, fraud detection, ad bidding

Limitations

Shadow deployment cannot measure user behavior changes. If the new model would show different recommendations, you cannot measure whether users would click on them — because users never see the shadow output. For behavioral metrics, you eventually need an A/B test.

Shadow deployment also doubles infrastructure cost during the testing period, since two models process every request.

Common Misconception

Shadow deployment is not the same as canary deployment. A canary sends real traffic to a new model for a small percentage of users — they actually see the new model’s output. Shadow deployment never serves the shadow model’s output to anyone. Canary tests the deployment; shadow tests the model.

One thing to remember: Shadow deployment is the safest way to test a new model against real production traffic because no user ever sees or is affected by the shadow model’s predictions.

pythonshadow-deploymentmlopsmachine-learning

See Also