Shadow Deployment for ML Models in Python — Core Concepts
What Is Shadow Deployment?
Shadow deployment (also called dark launching or shadow mode) runs a new model alongside the production model on the same live traffic. The production model’s predictions are served to users. The shadow model’s predictions are recorded but discarded. This gives you real-world performance data without any user-facing risk.
Why Not Just Use A/B Testing?
A/B tests expose real users to the new model. For high-stakes applications — medical diagnosis, fraud detection, autonomous driving — even a small percentage of bad predictions can cause real harm. Shadow deployment eliminates this risk entirely because the shadow model never affects any user.
Shadow deployment also catches issues that A/B tests cannot detect without exposure:
- Latency problems — the new model might be too slow under production load
- Infrastructure failures — memory leaks, GPU errors, or serialization bugs
- Data pipeline mismatches — features missing or formatted differently in production
How It Works
User Request → Production Model → Response to User
↓ (copy)
Shadow Model → Logged (not served)
↓
Comparison Engine → Metrics Dashboard
Both models receive identical input. Only the production model’s output reaches the user. The comparison engine logs both outputs for analysis.
What to Compare
| Metric | What It Reveals |
|---|---|
| Output agreement rate | How often shadow matches production |
| Prediction distribution | Whether shadow skews differently |
| Latency (p50, p95, p99) | Whether shadow meets serving SLAs |
| Error rate | Crashes, timeouts, invalid outputs |
| Memory and CPU usage | Resource requirements at production scale |
| Feature completeness | Missing or null features in shadow path |
When to Use Shadow Deployment
- Before the first A/B test — validate that the model works at all in production before sending it real traffic
- Major model architecture changes — switching from logistic regression to a deep learning model
- Regulated environments — healthcare, finance, where any production change needs evidence
- Critical systems — search ranking, fraud detection, ad bidding
Limitations
Shadow deployment cannot measure user behavior changes. If the new model would show different recommendations, you cannot measure whether users would click on them — because users never see the shadow output. For behavioral metrics, you eventually need an A/B test.
Shadow deployment also doubles infrastructure cost during the testing period, since two models process every request.
Common Misconception
Shadow deployment is not the same as canary deployment. A canary sends real traffic to a new model for a small percentage of users — they actually see the new model’s output. Shadow deployment never serves the shadow model’s output to anyone. Canary tests the deployment; shadow tests the model.
One thing to remember: Shadow deployment is the safest way to test a new model against real production traffic because no user ever sees or is affected by the shadow model’s predictions.
See Also
- Python Ab Testing Ml Models Why taste-testing two cookie recipes with different friends is the fairest way to pick a winner.
- Python Feature Store Design Why a shared ingredient pantry saves every cook in the kitchen from buying the same spices over and over.
- Python Ml Pipeline Orchestration Why a factory assembly line needs a foreman to make sure every step happens in the right order at the right time.
- Python Mlflow Experiment Tracking Find out why writing down every cooking experiment helps you recreate the perfect recipe every time.
- Python Model Explainability Shap How asking 'why did you pick that answer?' turns a mysterious black box into something you can actually trust.