Model Versioning with MLflow and DVC
Definition
Model versioning is the practice of systematically tracking and managing different versions of machine learning models throughout their lifecycle, similar to how Git tracks code changes. MLflow is an open-source platform for the complete ML lifecycle including experiment tracking, model packaging, and model registry. DVC (Data Version Control) extends Git to handle large data files, ML models, and pipelines that don't fit well in traditional version control. Together, MLflow and DVC solve complementary problems: MLflow manages the model lifecycle from experimentation to deployment, while DVC versions datasets and tracks data lineage. Model versioning ensures reproducibility by capturing the exact code, data, and parameters used to create each model version. This enables rollback to previous versions, A/B testing between model variants, and maintaining multiple model versions in production. The registry pattern separates model development from deployment, allowing data scientists to push new versions while ops teams manage production deployments.
Intuition
Imagine you're writing a novel, but instead of saving each draft, you just overwrite the same file. You'd lose all your previous work and couldn't compare versions. Now imagine you also had external research files (datasets) too large to email. Model versioning is like having a smart filing system that: saves every draft (model versions), tracks which research files (data) went into each draft, remembers the writing conditions (hyperparameters), and lets publishers (production) pick which draft to print while you keep writing new ones.
Mathematical Formula
Step-by-Step Explanation:
- \(M_v\): model version v (reproducible artifact)
- \(C_v\): code version at time of training
- \(D_v\): dataset version used for training
- \(P_v\): hyperparameters and configuration
- f: training function (deterministic given seeds)
- Git tracks C_v, DVC tracks D_v, MLflow tracks P_v and M_v
Real-World Use Cases
Credit model rollback: When a deployed fraud detection model starts underperforming after a data drift event, quickly rollback to the previous stable version while investigating the issue, ensuring minimal business impact.
Regulatory audit trail: Maintain complete version history of diagnostic models with exact dataset versions and training parameters for FDA approval and compliance reviews.
A/B testing recommendations: Run two model versions simultaneously - v2.1 for 90% of users and experimental v2.2 for 10%, using MLflow registry to manage the canary deployment.
Safety-critical model updates: Version every perception model with associated training data, sensor calibration parameters, and validation results for traceability in incident investigations.
Implementation
Manual Implementation (No Libraries)
# Manual model versioning - brittle
import pickle
import json
import os
from datetime import datetime
# Create version directory
version = 'v1.2.3'
os.makedirs(f'models/{version}', exist_ok=True)
# Save model
with open(f'models/{version}/model.pkl', 'wb') as f:
pickle.dump(model, f)
# Save metadata manually
metadata = {
'version': version,
'created': datetime.now().isoformat(),
'accuracy': 0.95,
'training_data': 'data/train_v2.csv',
'hyperparameters': {'lr': 0.001, 'epochs': 100},
'git_commit': 'abc123'
}
with open(f'models/{version}/metadata.json', 'w') as f:
json.dump(metadata, f)
# Problems:
# - No central registry
# - Manual metadata tracking is error-prone
# - No data versioning
# - Can't track large files in Git
# - No model promotion workflow
# - Hard to reproduce exactly
Using Libraries (mlflow, dvc)
# MLflow Model Tracking and Registry
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
# Start MLflow tracking server:
# mlflow server --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./artifacts
# Set tracking URI
mlflow.set_tracking_uri('http://localhost:5000')
mlflow.set_experiment('customer-churn')
# Training with MLflow logging
with mlflow.start_run():
# Log parameters
mlflow.log_param('n_estimators', 100)
mlflow.log_param('max_depth', 10)
mlflow.log_param('min_samples_split', 5)
# Train model
model = RandomForestClassifier(
n_estimators=100, max_depth=10, min_samples_split=5
)
model.fit(X_train, y_train)
# Log metrics
train_acc = model.score(X_train, y_train)
val_acc = model.score(X_val, y_val)
mlflow.log_metric('train_accuracy', train_acc)
mlflow.log_metric('val_accuracy', val_acc)
# Log model to registry
mlflow.sklearn.log_model(
model,
artifact_path='model',
registered_model_name='churn-predictor'
)
# Log artifacts
mlflow.log_artifact('confusion_matrix.png')
mlflow.log_artifact('feature_importance.csv')
# Model Registry Operations
from mlflow.tracking import MlflowClient
client = MlflowClient()
# List registered models
models = client.search_registered_models()
for model in models:
print(f'Model: {model.name}')
# Get latest version
versions = client.get_latest_versions('churn-predictor')
for version in versions:
print(f'Version {version.version}: {version.status}')
# Transition model stage
client.transition_model_version_stage(
name='churn-predictor',
version=3,
stage='Staging'
)
# Load specific version
model_uri = 'models:/churn-predictor/3'
# or 'models:/churn-predictor/Staging'
model = mlflow.sklearn.load_model(model_uri)
# DVC for Data Versioning
# Initialize DVC
# dvc init
# Track dataset
# dvc add data/train.csv
# git add data/train.csv.dvc .gitignore
# git commit -m 'Add training data v1'
# Python DVC integration
import dvc.api
# Load specific data version
data_url = dvc.api.get_url(
path='data/train.csv',
repo='https://github.com/org/repo',
rev='v1.0' # Git tag or commit
)
# Or use DVC Python API
from dvc.repo import Repo
repo = Repo('.')
# Pull specific version
repo.pull('data/train.csv')
# Reproduce pipeline
# dvc repro
# DVC Pipeline Definition (dvc.yaml)
"""
stages:
prepare:
cmd: python src/prepare.py data/raw.csv data/prepared.csv
deps:
- data/raw.csv
- src/prepare.py
outs:
- data/prepared.csv
train:
cmd: python src/train.py data/prepared.csv model.pkl
deps:
- data/prepared.csv
- src/train.py
outs:
- model.pkl
metrics:
- metrics.json:
cache: false
"""
# Run DVC pipeline
# dvc repro
# Push to remote storage
# dvc remote add -d myremote s3://mybucket/dvc
# dvc push
When to Use
✅ Appropriate Use Cases:
- Multiple model versions in production (A/B testing)
- Need for model rollback capabilities
- Team collaboration on model development
- Regulatory compliance requiring audit trails
- Large datasets that don't fit in Git
- Reproducible ML pipelines
- Model promotion workflows (dev → staging → prod)
- Tracking data lineage alongside models
- Sharing models across teams or organizations
- Automated model deployment pipelines
- Experiment tracking with artifact storage
❌ Avoid When:
- Single model with no versioning needs
- Very small datasets (Git handles fine)
- Prototypes with no production path
- When using managed platforms (SageMaker, Vertex AI)
- Teams with simpler needs (just pickle + Git LFS)
- When existing ML platform already provides versioning
- Proof-of-concept projects without collaboration
- Simple scripts with no data dependencies
Common Pitfalls
- Not versioning data alongside models (reproducibility gap)
- Using MLflow without setting up artifact storage
- Forgetting to log model dependencies
- Not setting up DVC remote for team collaboration
- Overwriting production models without staging
- Not tagging Git commits when registering models
- Ignoring data drift between model versions
- Not backing up DVC cache
- Mixing experiment tracking with model registry
- Not cleaning up old model versions (storage costs)
- Missing dependency logging (Python version, libraries)
- Not testing model loading before registry
- Hardcoding model paths instead of using registry URIs