MLOps — Core Concepts
The ML-Software Engineering Gap
Software engineering has decades of practices: version control, automated testing, continuous integration, infrastructure as code. These practices exist because deploying software reliably at scale is hard.
ML systems have all of software’s challenges, plus new ones:
- Data versioning: Code + model + data together determine behavior. Versioning only code misses most of what defines an ML system.
- Experiment tracking: A data scientist might run 500 experiments with different hyperparameters, feature sets, and architectures. Keeping track of what produced what result requires tooling.
- Non-determinism: Shuffling, random seeds, GPU parallelism mean two identical training runs may produce slightly different models.
- Silent failures: A software bug usually crashes loudly. An ML model that’s degraded due to data drift keeps running while quietly getting worse.
- Feedback loops: Production predictions can change user behavior, which changes future training data.
MLOps emerged around 2017–2018 as organizations like Uber, Airbnb, and Facebook published details of their internal ML infrastructure. The term itself was coined around 2018 and is now an industry standard.
The ML Lifecycle Components
Experiment Tracking
During model development, data scientists run many experiments varying:
- Model architecture and hyperparameters
- Feature selection and preprocessing
- Training data slices
MLflow (Databricks, 2018) is the most widely used open-source experiment tracking tool. For each experiment run, MLflow tracks:
- Code version and parameters
- Metrics (loss, accuracy, F1)
- Artifacts (model file, plots, serialized feature transformers)
- Environment (Python packages, hardware)
Hosted alternatives: Weights & Biases (W&B), Neptune.ai, Comet ML. All provide the same core capability: a database of experiments with comparison views.
Data Versioning and Pipelines
DVC (Data Version Control): Extends Git to track large datasets and model files in external storage (S3, GCS, Azure Blob), while tracking metadata in Git. Enables git checkout v1.3 to get the exact data, code, and model configuration from that point.
Feature stores: Centralize feature engineering and computation. Rather than each team computing age, income_ratio, or last_7d_purchase_count independently (often inconsistently), a feature store computes them once and serves them to training and inference.
Key feature store concepts:
- Online store (low latency, Redis): Features for real-time inference
- Offline store (high throughput, Hive/BigQuery): Features for training
- Consistency: Same features at training time and inference time (preventing train-serve skew)
Feast (Tecton, open-source) and Tecton (managed) are leading tools. Uber’s Michelangelo platform popularized the feature store concept in 2017.
Model Registry
A model registry is a versioned catalog of production-ready models. It bridges model development (experimentation) and deployment.
MLflow Model Registry provides:
- Staging/Production/Archived lifecycle stages
- Version history (who deployed what, when)
- Model metadata (training run, dataset version, metrics)
- Approval workflows (require sign-off before promoting to production)
When a model is promoted to production, deployment pipelines pull from the registry — not from a data scientist’s local directory.
CI/CD for ML
Traditional CI/CD automates: code change → test → build → deploy. ML CI/CD adds:
- Data validation: Check that new training data matches expected schema and distributions
- Model training: Trigger retraining when code or data changes
- Model evaluation: Compare new model against current production model on a holdout set
- Shadow deployment: Run new model in parallel with old model, compare predictions without serving new model to users
- Canary deployment: Gradually shift traffic to new model (1% → 5% → 25% → 100%)
Tools: GitHub Actions + custom scripts, Kubeflow Pipelines, Vertex AI Pipelines, ZenML.
Model Serving Infrastructure
A trained model needs to be wrapped in a service that can handle concurrent requests efficiently.
Batch inference: Process records offline (e.g., nightly). Suitable when latency tolerance is high — personalized email recommendations, fraud scoring on completed transactions.
Online inference: Real-time, per-request. Requires low latency (< 100ms typically) and high throughput. Trade-offs:
- Model size vs. latency
- Accuracy vs. hardware cost
- Single model vs. ensemble complexity
Model servers: TorchServe, TensorFlow Serving, Triton Inference Server (NVIDIA) handle batching, versioning, and multi-model serving. Triton optimizes GPU utilization through dynamic batching — waiting briefly to collect multiple requests, then processing them in one GPU batch.
Serving latency components:
- Network latency: 1–50ms
- Data preprocessing: 1–20ms (can dominate for feature-heavy models)
- Model inference: 1ms (small models) to 2s (large LLMs)
- Postprocessing: typically <5ms
Monitoring and Drift Detection
Production models degrade over time as the world changes. The formal term: data drift (input distribution changes) and concept drift (the relationship between input and output changes).
Types of drift:
- Covariate shift: P(X) changes, P(Y|X) stays same. (Users start submitting more photos from phones — image quality distribution shifts)
- Label shift: P(Y) changes, P(X|Y) stays same. (New fraud types become more common)
- Concept drift: P(Y|X) changes. (What “good” credit risk means changes due to economic shifts)
Detection methods:
- Statistical tests: Kolmogorov-Smirnov test, Population Stability Index (PSI) for distribution shift
- Embedding drift: Compare embeddings of recent data to training data
- Performance monitoring: Track model accuracy against delayed ground truth labels
Tools: Evidently AI, Arize AI, WhyLabs, and built-in monitoring in cloud platforms (Vertex AI, SageMaker).
One thing to remember: The “80% of ML work is data and infrastructure” observation is real — MLOps exists to handle this reliably at scale, so data scientists can spend more time on the 20% that’s actually modeling.
See Also
- Edge Ai Why AI is moving from cloud data centers to your devices — and what becomes possible when AI runs right where you are instead of sending your data far away.
- Gpu Computing Why the graphics cards gamers use became the engine of the AI revolution — and how thousands of tiny processors working together changed what's computationally possible.
- Kubernetes You built a toy factory with robots. Then business exploded and you need 50 factories. Kubernetes is the boss who makes sure all the robots stay busy — without you having to do anything.