MLOps — Deep Dive

ML system design patterns, train-serve skew, shadow mode testing, LLMOps specifics, the hidden technical debt paper, and building observable ML systems.

Hidden Technical Debt in ML Systems

Sculley et al. (2015) “Hidden Technical Debt in Machine Learning Systems” is the seminal paper on why ML systems accumulate complexity that software best practices don’t address.

The key observation: the ML model code is a small fraction of the overall system. The paper introduced the “CACE principle” (Changing Anything Changes Everything) — in an ML system, changing data, hyperparameters, preprocessing, evaluation metrics, or infrastructure can all affect model behavior in non-obvious ways.

Specific debt patterns:

Glue code: Data scientists prototype in notebooks, then a different team productionizes in a different codebase. Discrepancies accumulate — different preprocessing, different handling of edge cases.

Pipeline jungles: Multiple data pipelines for different model versions, feature groups, or teams. No clear ownership; duplicated logic; inconsistent results.

Dead experimental code paths: A/B test code, experimental features, and old model versions never get cleaned up. 10% of code becomes production load with unclear impact.

Undeclared consumers: Other systems quietly depend on a model’s output structure. When the model is updated, these break silently.

Feedback loops: Model outputs affect user behavior, which generates future training data. Closed feedback loops can cause models to reinforce biases (e.g., a content recommender that shows increasingly extreme content because engagement is slightly higher).

Train-Serve Skew: The Silent Killer

Train-serve skew occurs when the features computed at inference time differ from those used during training. This can cause model performance to degrade dramatically without obvious errors.

Common causes:

Training data was preprocessed in Python (Pandas); inference feature computation is in Java
Null/missing value handling differs between training and serving code
Feature computation uses different time windows (7-day average in training vs. trailing 24 hours in serving)
Normalization statistics (mean, std) computed on training data, but at serving time, production mean has shifted

Detection: Log model inputs at serving time. Periodically compare the distribution of serving inputs to training inputs using statistical tests (KS test, PSI). Alert when divergence exceeds thresholds.

Prevention: Single feature computation code path — use the same code (or code generated from the same specification) for both training and serving. Feature stores with server-client SDKs enforce this: the same feature definitions generate both training and serving features.

Google’s TFX (TensorFlow Extended) enforces schema consistency: features validated against a schema at training time; the same validation runs at serving time. Schema violations cause explicit failures rather than silent performance degradation.

Shadow Mode Testing

Before replacing a production model, shadow mode (or “dark launch”) validates the new model on real traffic without affecting users.

Architecture:

All requests go to the primary (current production) model
Requests are also asynchronously forwarded to the shadow (new candidate) model
Only the primary model’s predictions are returned to users
Shadow model predictions are logged for comparison

This enables:

Correctness comparison: Does the shadow model produce the same (or better) predictions as primary?
Performance validation: Does shadow meet latency SLOs under production traffic patterns?
Edge case discovery: Does production traffic expose input patterns not seen in testing?
Cost estimation: What are the actual inference costs at production scale?

Disaggregated logging allows analysis by slice: does the shadow model perform differently on specific user segments, input types, or time-of-day patterns?

Uber’s Michelangelo enforces shadow mode as a required step before any production model promotion. The comparison data from shadow testing becomes evidence for or against promotion decisions.

LLMOps: Specific Challenges for Large Language Models

Deploying LLMs has MLOps challenges distinct from traditional ML:

Prompt management: LLM behavior is heavily influenced by prompts, which can be tuned without model retraining. Prompt versioning, A/B testing prompts, and rollback mechanisms are needed.

Latency vs. quality tradeoff: LLM inference is dominated by memory bandwidth (loading model weights each token). Techniques:

KV-cache: Cache attention key-value pairs across tokens in a sequence — reduces compute for long contexts
Continuous batching (Orca, 2022): Instead of waiting for all batch members to complete (which creates idle time when fast requests finish before slow ones), continuously add new requests as slots open. Significantly improves GPU utilization.
Speculative decoding: Small draft model generates multiple tokens speculatively; large model verifies in parallel. 2–3x speedup.

Model quantization: INT4/INT8 quantization reduces memory from 2 bytes/param (FP16) to 0.5–1 byte/param. Allows larger models to fit in available GPU memory. AWQ and GPTQ are production-ready quantization methods.

Hallucination monitoring: Monitor for factual inconsistencies using:

Self-consistency checks (multiple samples, check agreement)
Retrieval-augmented generation with source citation verification
LLM-as-judge scoring pipelines

Evaluation pipelines: For LLMs, traditional accuracy metrics don’t apply. Evaluation uses:

LLM-judged evaluations (GPT-4 or Llama-3 as judge)
Task-specific benchmarks run automatically on each model version
Human evaluation for quality assurance sampling

Building Observable ML Systems

Observability for ML systems extends software observability (logs, metrics, traces) with ML-specific signals.

The four golden signals for ML:

Prediction volume: Request rate and latency distribution
Input feature distributions: Statistical summary of features arriving at serving time
Output distributions: Distribution of predicted scores/classes — shifts indicate behavior change
Ground truth performance: Where labels are available (with delay), track accuracy/precision/recall

Structured logging for ML: Log not just outputs but inputs, feature values, model version, A/B test assignment, and request context. This enables retrospective analysis when issues are discovered.

Real-time feature monitoring with Evidently or Arize: set alerts for PSI (Population Stability Index) > 0.2 on critical features. PSI < 0.1 is stable; 0.1–0.2 is minor shift; > 0.2 is major shift requiring investigation.

$$PSI = \sum_i (expected_i - actual_i) \times \ln\left(\frac{expected_i}{actual_i}\right)$$

Chaos engineering for ML: Deliberately introduce distribution shift or upstream data failures to verify that monitoring and fallback systems work as intended. Netflix’s Chaos Monkey approach applied to ML pipelines.

One thing to remember: The goal of MLOps is not process for process’s sake — it’s making the relationship between “this is my model” and “this is what users experience” observable, controllable, and trustworthy, so teams can iterate with confidence rather than hope.

mlopstrain-serve-skewshadow-modellmopstechnical-debtobservability

MLOps — Deep Dive

Hidden Technical Debt in ML Systems

Train-Serve Skew: The Silent Killer

Shadow Mode Testing

LLMOps: Specific Challenges for Large Language Models

Building Observable ML Systems

See Also

Related Topics