MLOps — Core Concepts

The full ML lifecycle in production: experiment tracking, model registries, CI/CD for ML, data drift detection, and why Uber's Michelangelo changed how companies think about ML infrastructure.

The ML-Software Engineering Gap

Software engineering has decades of practices: version control, automated testing, continuous integration, infrastructure as code. These practices exist because deploying software reliably at scale is hard.

ML systems have all of software’s challenges, plus new ones:

Data versioning: Code + model + data together determine behavior. Versioning only code misses most of what defines an ML system.
Experiment tracking: A data scientist might run 500 experiments with different hyperparameters, feature sets, and architectures. Keeping track of what produced what result requires tooling.
Non-determinism: Shuffling, random seeds, GPU parallelism mean two identical training runs may produce slightly different models.
Silent failures: A software bug usually crashes loudly. An ML model that’s degraded due to data drift keeps running while quietly getting worse.
Feedback loops: Production predictions can change user behavior, which changes future training data.

MLOps emerged around 2017–2018 as organizations like Uber, Airbnb, and Facebook published details of their internal ML infrastructure. The term itself was coined around 2018 and is now an industry standard.

The ML Lifecycle Components

Experiment Tracking

During model development, data scientists run many experiments varying:

Model architecture and hyperparameters
Feature selection and preprocessing
Training data slices

MLflow (Databricks, 2018) is the most widely used open-source experiment tracking tool. For each experiment run, MLflow tracks:

Code version and parameters
Metrics (loss, accuracy, F1)
Artifacts (model file, plots, serialized feature transformers)
Environment (Python packages, hardware)

Hosted alternatives: Weights & Biases (W&B), Neptune.ai, Comet ML. All provide the same core capability: a database of experiments with comparison views.

Data Versioning and Pipelines

DVC (Data Version Control): Extends Git to track large datasets and model files in external storage (S3, GCS, Azure Blob), while tracking metadata in Git. Enables git checkout v1.3 to get the exact data, code, and model configuration from that point.

Feature stores: Centralize feature engineering and computation. Rather than each team computing age, income_ratio, or last_7d_purchase_count independently (often inconsistently), a feature store computes them once and serves them to training and inference.

Key feature store concepts:

Online store (low latency, Redis): Features for real-time inference
Offline store (high throughput, Hive/BigQuery): Features for training
Consistency: Same features at training time and inference time (preventing train-serve skew)

Feast (Tecton, open-source) and Tecton (managed) are leading tools. Uber’s Michelangelo platform popularized the feature store concept in 2017.

Model Registry

A model registry is a versioned catalog of production-ready models. It bridges model development (experimentation) and deployment.

MLflow Model Registry provides:

Staging/Production/Archived lifecycle stages
Version history (who deployed what, when)
Model metadata (training run, dataset version, metrics)
Approval workflows (require sign-off before promoting to production)

When a model is promoted to production, deployment pipelines pull from the registry — not from a data scientist’s local directory.

CI/CD for ML

Traditional CI/CD automates: code change → test → build → deploy. ML CI/CD adds:

Data validation: Check that new training data matches expected schema and distributions
Model training: Trigger retraining when code or data changes
Model evaluation: Compare new model against current production model on a holdout set
Shadow deployment: Run new model in parallel with old model, compare predictions without serving new model to users
Canary deployment: Gradually shift traffic to new model (1% → 5% → 25% → 100%)

Tools: GitHub Actions + custom scripts, Kubeflow Pipelines, Vertex AI Pipelines, ZenML.

Model Serving Infrastructure

A trained model needs to be wrapped in a service that can handle concurrent requests efficiently.

Batch inference: Process records offline (e.g., nightly). Suitable when latency tolerance is high — personalized email recommendations, fraud scoring on completed transactions.

Online inference: Real-time, per-request. Requires low latency (< 100ms typically) and high throughput. Trade-offs:

Model size vs. latency
Accuracy vs. hardware cost
Single model vs. ensemble complexity

Model servers: TorchServe, TensorFlow Serving, Triton Inference Server (NVIDIA) handle batching, versioning, and multi-model serving. Triton optimizes GPU utilization through dynamic batching — waiting briefly to collect multiple requests, then processing them in one GPU batch.

Serving latency components:

Network latency: 1–50ms
Data preprocessing: 1–20ms (can dominate for feature-heavy models)
Model inference: 1ms (small models) to 2s (large LLMs)
Postprocessing: typically <5ms

Monitoring and Drift Detection

Production models degrade over time as the world changes. The formal term: data drift (input distribution changes) and concept drift (the relationship between input and output changes).

Types of drift:

Covariate shift: P(X) changes, P(Y|X) stays same. (Users start submitting more photos from phones — image quality distribution shifts)
Label shift: P(Y) changes, P(X|Y) stays same. (New fraud types become more common)
Concept drift: P(Y|X) changes. (What “good” credit risk means changes due to economic shifts)

Detection methods:

Statistical tests: Kolmogorov-Smirnov test, Population Stability Index (PSI) for distribution shift
Embedding drift: Compare embeddings of recent data to training data
Performance monitoring: Track model accuracy against delayed ground truth labels

Tools: Evidently AI, Arize AI, WhyLabs, and built-in monitoring in cloud platforms (Vertex AI, SageMaker).

One thing to remember: The “80% of ML work is data and infrastructure” observation is real — MLOps exists to handle this reliably at scale, so data scientists can spend more time on the 20% that’s actually modeling.

mlopsmodel-registryexperiment-trackingdata-driftmlflowkubeflow