MLflow Experiment Tracking in Python — Deep Dive
MLflow Architecture
MLflow Tracking has three storage components:
- Backend store: Stores experiment and run metadata (parameters, metrics, tags). Can be a local file system or a database (SQLite, PostgreSQL, MySQL).
- Artifact store: Stores large files (models, plots, datasets). Can be local or cloud (S3, GCS, Azure Blob).
- Tracking server: An HTTP server that exposes a REST API and the web UI.
Starting a Tracking Server
# Local development (SQLite + local artifacts)
mlflow server \
--backend-store-uri sqlite:///mlflow.db \
--default-artifact-root ./mlartifacts \
--host 0.0.0.0 \
--port 5000
# Production (PostgreSQL + S3)
mlflow server \
--backend-store-uri postgresql://user:pass@db-host:5432/mlflow \
--default-artifact-root s3://my-bucket/mlflow-artifacts \
--host 0.0.0.0 \
--port 5000
Point your client code to the server:
import mlflow
mlflow.set_tracking_uri("http://tracking-server:5000")
Logging Runs Manually
Basic Run
import mlflow
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, accuracy_score
from sklearn.model_selection import train_test_split
mlflow.set_experiment("churn-prediction")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
with mlflow.start_run(run_name="rf-baseline"):
# Log parameters
params = {"n_estimators": 100, "max_depth": 10, "random_state": 42}
mlflow.log_params(params)
# Train
model = RandomForestClassifier(**params)
model.fit(X_train, y_train)
# Evaluate and log metrics
y_pred = model.predict(X_test)
mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
mlflow.log_metric("f1", f1_score(y_test, y_pred))
# Log the model
mlflow.sklearn.log_model(model, "model")
# Log artifacts
mlflow.log_artifact("data/feature_config.json")
# Tags
mlflow.set_tag("author", "alice")
mlflow.set_tag("dataset_version", "v2.3")
Logging Metric History
Track metrics across training steps:
with mlflow.start_run():
for epoch in range(100):
train_loss = train_one_epoch(model, train_loader)
val_loss = evaluate(model, val_loader)
mlflow.log_metric("train_loss", train_loss, step=epoch)
mlflow.log_metric("val_loss", val_loss, step=epoch)
The UI renders these as line charts, making it easy to spot overfitting or convergence issues.
Autologging
MLflow can automatically log parameters, metrics, and models for supported frameworks:
import mlflow
# Scikit-learn
mlflow.sklearn.autolog()
# PyTorch Lightning
mlflow.pytorch.autolog()
# XGBoost
mlflow.xgboost.autolog()
# TensorFlow / Keras
mlflow.tensorflow.autolog()
With autolog enabled, simply calling model.fit() records everything:
mlflow.sklearn.autolog()
with mlflow.start_run():
model = RandomForestClassifier(n_estimators=200, max_depth=8)
model.fit(X_train, y_train)
# Parameters, metrics, model, and feature importance are logged automatically
Autolog captures hyperparameters, training metrics, model signature, input examples, and even feature importance plots for tree-based models.
Searching and Comparing Runs
Programmatic Search
runs = mlflow.search_runs(
experiment_names=["churn-prediction"],
filter_string="metrics.f1 > 0.85 AND params.n_estimators = '200'",
order_by=["metrics.f1 DESC"],
max_results=10,
)
print(runs[["run_id", "params.n_estimators", "metrics.f1"]])
Loading a Previous Run’s Model
best_run = runs.iloc[0]
model_uri = f"runs:/{best_run.run_id}/model"
loaded_model = mlflow.sklearn.load_model(model_uri)
predictions = loaded_model.predict(X_new)
Model Registry
The model registry adds versioning and staging to logged models:
# Register a model from a run
model_uri = f"runs:/{run_id}/model"
mlflow.register_model(model_uri, "churn-classifier")
# Transition to staging
from mlflow import MlflowClient
client = MlflowClient()
client.transition_model_version_stage(
name="churn-classifier",
version=1,
stage="Staging",
)
# Promote to production after validation
client.transition_model_version_stage(
name="churn-classifier",
version=1,
stage="Production",
)
# Load production model by stage
model = mlflow.pyfunc.load_model("models:/churn-classifier/Production")
Custom MLflow Models
Wrap any prediction logic in a custom PythonModel:
class ChurnPredictor(mlflow.pyfunc.PythonModel):
def __init__(self, threshold=0.5):
self.threshold = threshold
def load_context(self, context):
import joblib
self.model = joblib.load(context.artifacts["sklearn_model"])
self.scaler = joblib.load(context.artifacts["scaler"])
def predict(self, context, model_input):
scaled = self.scaler.transform(model_input)
probas = self.model.predict_proba(scaled)[:, 1]
return (probas >= self.threshold).astype(int)
# Log the custom model
with mlflow.start_run():
artifacts = {
"sklearn_model": "models/rf_model.joblib",
"scaler": "models/scaler.joblib",
}
mlflow.pyfunc.log_model(
artifact_path="model",
python_model=ChurnPredictor(threshold=0.45),
artifacts=artifacts,
)
Integration with PyTorch
import mlflow.pytorch
import torch
mlflow.pytorch.autolog()
with mlflow.start_run():
model = MyNeuralNet()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
for epoch in range(50):
train_loss = train_epoch(model, optimizer, train_loader)
val_metrics = evaluate(model, val_loader)
mlflow.log_metric("train_loss", train_loss, step=epoch)
mlflow.log_metric("val_accuracy", val_metrics["accuracy"], step=epoch)
# Log final model
mlflow.pytorch.log_model(model, "model")
CI/CD Integration
Run experiments in CI and promote the best model automatically:
# .github/workflows/train.yml
name: Train and Evaluate
on:
push:
paths: ["src/model/**", "data/features/**"]
jobs:
train:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Train model
env:
MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_URI }}
run: |
python src/model/train.py
- name: Check improvement
run: |
python scripts/check_improvement.py --threshold 0.02
The check_improvement.py script compares the new run against the current production model and promotes it only if the improvement exceeds the threshold.
Best Practices
- One experiment per project: Keep related runs together for easy comparison.
- Descriptive run names: Use names like
rf-depth10-lr0.01instead of auto-generated IDs. - Log environment info: Capture Python version, package versions, and Git hash.
- Use model signatures: Define input/output schemas so consumers know what the model expects.
- Clean up old runs: Archive or delete failed/obsolete runs to keep the UI navigable.
- Separate tracking from training code: Create a utility module that handles all MLflow calls, keeping training scripts clean.
# utils/tracking.py
import mlflow
def start_tracked_run(experiment_name, run_name, params):
mlflow.set_experiment(experiment_name)
run = mlflow.start_run(run_name=run_name)
mlflow.log_params(params)
return run
Alternatives Comparison
| Tool | Strengths | Weaknesses |
|---|---|---|
| MLflow | Open source, framework-agnostic, model registry | UI is basic, scaling requires setup |
| Weights & Biases | Beautiful UI, collaboration features | Cloud-hosted (cost), vendor lock-in |
| Neptune.ai | Real-time monitoring, easy integration | Paid beyond free tier |
| DVC | Data versioning + experiments, Git-native | Steeper learning curve |
One thing to remember: MLflow turns ad-hoc experimentation into a structured, reproducible process — the earlier you adopt it in a project, the less time you waste recreating results and debugging “what changed.”
See Also
- Python Ab Testing Ml Models Why taste-testing two cookie recipes with different friends is the fairest way to pick a winner.
- Python Feature Store Design Why a shared ingredient pantry saves every cook in the kitchen from buying the same spices over and over.
- Python Ml Pipeline Orchestration Why a factory assembly line needs a foreman to make sure every step happens in the right order at the right time.
- Python Model Explainability Shap How asking 'why did you pick that answer?' turns a mysterious black box into something you can actually trust.
- Python Model Monitoring Drift Why a weather forecast that was perfect last summer might completely fail this winter — and how to catch it early.