MLflow Experiment Tracking in Python — Deep Dive

MLflow Architecture

MLflow Tracking has three storage components:

  1. Backend store: Stores experiment and run metadata (parameters, metrics, tags). Can be a local file system or a database (SQLite, PostgreSQL, MySQL).
  2. Artifact store: Stores large files (models, plots, datasets). Can be local or cloud (S3, GCS, Azure Blob).
  3. Tracking server: An HTTP server that exposes a REST API and the web UI.

Starting a Tracking Server

# Local development (SQLite + local artifacts)
mlflow server \
    --backend-store-uri sqlite:///mlflow.db \
    --default-artifact-root ./mlartifacts \
    --host 0.0.0.0 \
    --port 5000

# Production (PostgreSQL + S3)
mlflow server \
    --backend-store-uri postgresql://user:pass@db-host:5432/mlflow \
    --default-artifact-root s3://my-bucket/mlflow-artifacts \
    --host 0.0.0.0 \
    --port 5000

Point your client code to the server:

import mlflow
mlflow.set_tracking_uri("http://tracking-server:5000")

Logging Runs Manually

Basic Run

import mlflow
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, accuracy_score
from sklearn.model_selection import train_test_split

mlflow.set_experiment("churn-prediction")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

with mlflow.start_run(run_name="rf-baseline"):
    # Log parameters
    params = {"n_estimators": 100, "max_depth": 10, "random_state": 42}
    mlflow.log_params(params)
    
    # Train
    model = RandomForestClassifier(**params)
    model.fit(X_train, y_train)
    
    # Evaluate and log metrics
    y_pred = model.predict(X_test)
    mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
    mlflow.log_metric("f1", f1_score(y_test, y_pred))
    
    # Log the model
    mlflow.sklearn.log_model(model, "model")
    
    # Log artifacts
    mlflow.log_artifact("data/feature_config.json")
    
    # Tags
    mlflow.set_tag("author", "alice")
    mlflow.set_tag("dataset_version", "v2.3")

Logging Metric History

Track metrics across training steps:

with mlflow.start_run():
    for epoch in range(100):
        train_loss = train_one_epoch(model, train_loader)
        val_loss = evaluate(model, val_loader)
        
        mlflow.log_metric("train_loss", train_loss, step=epoch)
        mlflow.log_metric("val_loss", val_loss, step=epoch)

The UI renders these as line charts, making it easy to spot overfitting or convergence issues.

Autologging

MLflow can automatically log parameters, metrics, and models for supported frameworks:

import mlflow

# Scikit-learn
mlflow.sklearn.autolog()

# PyTorch Lightning
mlflow.pytorch.autolog()

# XGBoost
mlflow.xgboost.autolog()

# TensorFlow / Keras
mlflow.tensorflow.autolog()

With autolog enabled, simply calling model.fit() records everything:

mlflow.sklearn.autolog()

with mlflow.start_run():
    model = RandomForestClassifier(n_estimators=200, max_depth=8)
    model.fit(X_train, y_train)
    # Parameters, metrics, model, and feature importance are logged automatically

Autolog captures hyperparameters, training metrics, model signature, input examples, and even feature importance plots for tree-based models.

Searching and Comparing Runs

runs = mlflow.search_runs(
    experiment_names=["churn-prediction"],
    filter_string="metrics.f1 > 0.85 AND params.n_estimators = '200'",
    order_by=["metrics.f1 DESC"],
    max_results=10,
)
print(runs[["run_id", "params.n_estimators", "metrics.f1"]])

Loading a Previous Run’s Model

best_run = runs.iloc[0]
model_uri = f"runs:/{best_run.run_id}/model"
loaded_model = mlflow.sklearn.load_model(model_uri)
predictions = loaded_model.predict(X_new)

Model Registry

The model registry adds versioning and staging to logged models:

# Register a model from a run
model_uri = f"runs:/{run_id}/model"
mlflow.register_model(model_uri, "churn-classifier")

# Transition to staging
from mlflow import MlflowClient
client = MlflowClient()
client.transition_model_version_stage(
    name="churn-classifier",
    version=1,
    stage="Staging",
)

# Promote to production after validation
client.transition_model_version_stage(
    name="churn-classifier",
    version=1,
    stage="Production",
)

# Load production model by stage
model = mlflow.pyfunc.load_model("models:/churn-classifier/Production")

Custom MLflow Models

Wrap any prediction logic in a custom PythonModel:

class ChurnPredictor(mlflow.pyfunc.PythonModel):
    def __init__(self, threshold=0.5):
        self.threshold = threshold
    
    def load_context(self, context):
        import joblib
        self.model = joblib.load(context.artifacts["sklearn_model"])
        self.scaler = joblib.load(context.artifacts["scaler"])
    
    def predict(self, context, model_input):
        scaled = self.scaler.transform(model_input)
        probas = self.model.predict_proba(scaled)[:, 1]
        return (probas >= self.threshold).astype(int)

# Log the custom model
with mlflow.start_run():
    artifacts = {
        "sklearn_model": "models/rf_model.joblib",
        "scaler": "models/scaler.joblib",
    }
    mlflow.pyfunc.log_model(
        artifact_path="model",
        python_model=ChurnPredictor(threshold=0.45),
        artifacts=artifacts,
    )

Integration with PyTorch

import mlflow.pytorch
import torch

mlflow.pytorch.autolog()

with mlflow.start_run():
    model = MyNeuralNet()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    
    for epoch in range(50):
        train_loss = train_epoch(model, optimizer, train_loader)
        val_metrics = evaluate(model, val_loader)
        
        mlflow.log_metric("train_loss", train_loss, step=epoch)
        mlflow.log_metric("val_accuracy", val_metrics["accuracy"], step=epoch)
    
    # Log final model
    mlflow.pytorch.log_model(model, "model")

CI/CD Integration

Run experiments in CI and promote the best model automatically:

# .github/workflows/train.yml
name: Train and Evaluate
on:
  push:
    paths: ["src/model/**", "data/features/**"]

jobs:
  train:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Train model
        env:
          MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_URI }}
        run: |
          python src/model/train.py
      - name: Check improvement
        run: |
          python scripts/check_improvement.py --threshold 0.02

The check_improvement.py script compares the new run against the current production model and promotes it only if the improvement exceeds the threshold.

Best Practices

  1. One experiment per project: Keep related runs together for easy comparison.
  2. Descriptive run names: Use names like rf-depth10-lr0.01 instead of auto-generated IDs.
  3. Log environment info: Capture Python version, package versions, and Git hash.
  4. Use model signatures: Define input/output schemas so consumers know what the model expects.
  5. Clean up old runs: Archive or delete failed/obsolete runs to keep the UI navigable.
  6. Separate tracking from training code: Create a utility module that handles all MLflow calls, keeping training scripts clean.
# utils/tracking.py
import mlflow

def start_tracked_run(experiment_name, run_name, params):
    mlflow.set_experiment(experiment_name)
    run = mlflow.start_run(run_name=run_name)
    mlflow.log_params(params)
    return run

Alternatives Comparison

ToolStrengthsWeaknesses
MLflowOpen source, framework-agnostic, model registryUI is basic, scaling requires setup
Weights & BiasesBeautiful UI, collaboration featuresCloud-hosted (cost), vendor lock-in
Neptune.aiReal-time monitoring, easy integrationPaid beyond free tier
DVCData versioning + experiments, Git-nativeSteeper learning curve

One thing to remember: MLflow turns ad-hoc experimentation into a structured, reproducible process — the earlier you adopt it in a project, the less time you waste recreating results and debugging “what changed.”

pythonmlflowexperiment-trackingmlops

See Also