MLOps — Deep Dive
Hidden Technical Debt in ML Systems
Sculley et al. (2015) “Hidden Technical Debt in Machine Learning Systems” is the seminal paper on why ML systems accumulate complexity that software best practices don’t address.
The key observation: the ML model code is a small fraction of the overall system. The paper introduced the “CACE principle” (Changing Anything Changes Everything) — in an ML system, changing data, hyperparameters, preprocessing, evaluation metrics, or infrastructure can all affect model behavior in non-obvious ways.
Specific debt patterns:
Glue code: Data scientists prototype in notebooks, then a different team productionizes in a different codebase. Discrepancies accumulate — different preprocessing, different handling of edge cases.
Pipeline jungles: Multiple data pipelines for different model versions, feature groups, or teams. No clear ownership; duplicated logic; inconsistent results.
Dead experimental code paths: A/B test code, experimental features, and old model versions never get cleaned up. 10% of code becomes production load with unclear impact.
Undeclared consumers: Other systems quietly depend on a model’s output structure. When the model is updated, these break silently.
Feedback loops: Model outputs affect user behavior, which generates future training data. Closed feedback loops can cause models to reinforce biases (e.g., a content recommender that shows increasingly extreme content because engagement is slightly higher).
Train-Serve Skew: The Silent Killer
Train-serve skew occurs when the features computed at inference time differ from those used during training. This can cause model performance to degrade dramatically without obvious errors.
Common causes:
- Training data was preprocessed in Python (Pandas); inference feature computation is in Java
- Null/missing value handling differs between training and serving code
- Feature computation uses different time windows (7-day average in training vs. trailing 24 hours in serving)
- Normalization statistics (mean, std) computed on training data, but at serving time, production mean has shifted
Detection: Log model inputs at serving time. Periodically compare the distribution of serving inputs to training inputs using statistical tests (KS test, PSI). Alert when divergence exceeds thresholds.
Prevention: Single feature computation code path — use the same code (or code generated from the same specification) for both training and serving. Feature stores with server-client SDKs enforce this: the same feature definitions generate both training and serving features.
Google’s TFX (TensorFlow Extended) enforces schema consistency: features validated against a schema at training time; the same validation runs at serving time. Schema violations cause explicit failures rather than silent performance degradation.
Shadow Mode Testing
Before replacing a production model, shadow mode (or “dark launch”) validates the new model on real traffic without affecting users.
Architecture:
- All requests go to the primary (current production) model
- Requests are also asynchronously forwarded to the shadow (new candidate) model
- Only the primary model’s predictions are returned to users
- Shadow model predictions are logged for comparison
This enables:
- Correctness comparison: Does the shadow model produce the same (or better) predictions as primary?
- Performance validation: Does shadow meet latency SLOs under production traffic patterns?
- Edge case discovery: Does production traffic expose input patterns not seen in testing?
- Cost estimation: What are the actual inference costs at production scale?
Disaggregated logging allows analysis by slice: does the shadow model perform differently on specific user segments, input types, or time-of-day patterns?
Uber’s Michelangelo enforces shadow mode as a required step before any production model promotion. The comparison data from shadow testing becomes evidence for or against promotion decisions.
LLMOps: Specific Challenges for Large Language Models
Deploying LLMs has MLOps challenges distinct from traditional ML:
Prompt management: LLM behavior is heavily influenced by prompts, which can be tuned without model retraining. Prompt versioning, A/B testing prompts, and rollback mechanisms are needed.
Latency vs. quality tradeoff: LLM inference is dominated by memory bandwidth (loading model weights each token). Techniques:
- KV-cache: Cache attention key-value pairs across tokens in a sequence — reduces compute for long contexts
- Continuous batching (Orca, 2022): Instead of waiting for all batch members to complete (which creates idle time when fast requests finish before slow ones), continuously add new requests as slots open. Significantly improves GPU utilization.
- Speculative decoding: Small draft model generates multiple tokens speculatively; large model verifies in parallel. 2–3x speedup.
Model quantization: INT4/INT8 quantization reduces memory from 2 bytes/param (FP16) to 0.5–1 byte/param. Allows larger models to fit in available GPU memory. AWQ and GPTQ are production-ready quantization methods.
Hallucination monitoring: Monitor for factual inconsistencies using:
- Self-consistency checks (multiple samples, check agreement)
- Retrieval-augmented generation with source citation verification
- LLM-as-judge scoring pipelines
Evaluation pipelines: For LLMs, traditional accuracy metrics don’t apply. Evaluation uses:
- LLM-judged evaluations (GPT-4 or Llama-3 as judge)
- Task-specific benchmarks run automatically on each model version
- Human evaluation for quality assurance sampling
Building Observable ML Systems
Observability for ML systems extends software observability (logs, metrics, traces) with ML-specific signals.
The four golden signals for ML:
- Prediction volume: Request rate and latency distribution
- Input feature distributions: Statistical summary of features arriving at serving time
- Output distributions: Distribution of predicted scores/classes — shifts indicate behavior change
- Ground truth performance: Where labels are available (with delay), track accuracy/precision/recall
Structured logging for ML: Log not just outputs but inputs, feature values, model version, A/B test assignment, and request context. This enables retrospective analysis when issues are discovered.
Real-time feature monitoring with Evidently or Arize: set alerts for PSI (Population Stability Index) > 0.2 on critical features. PSI < 0.1 is stable; 0.1–0.2 is minor shift; > 0.2 is major shift requiring investigation.
$$PSI = \sum_i (expected_i - actual_i) \times \ln\left(\frac{expected_i}{actual_i}\right)$$
Chaos engineering for ML: Deliberately introduce distribution shift or upstream data failures to verify that monitoring and fallback systems work as intended. Netflix’s Chaos Monkey approach applied to ML pipelines.
One thing to remember: The goal of MLOps is not process for process’s sake — it’s making the relationship between “this is my model” and “this is what users experience” observable, controllable, and trustworthy, so teams can iterate with confidence rather than hope.
See Also
- Edge Ai Why AI is moving from cloud data centers to your devices — and what becomes possible when AI runs right where you are instead of sending your data far away.
- Gpu Computing Why the graphics cards gamers use became the engine of the AI revolution — and how thousands of tiny processors working together changed what's computationally possible.
- Kubernetes You built a toy factory with robots. Then business exploded and you need 50 factories. Kubernetes is the boss who makes sure all the robots stay busy — without you having to do anything.