Model Deployment Basics
Definition
Model deployment is the process of putting a trained machine learning model into a production environment where it can receive real-time or batch data and return predictions to users or other systems. Deployment transforms a model from a research artifact into a usable service. Key concepts include model serialization (saving model weights and architecture), API creation (providing HTTP endpoints for inference), containerization (packaging the model with its dependencies using Docker), and serving infrastructure (managing model instances, load balancing, and scaling). Deployment patterns include batch prediction (processing large datasets offline), real-time inference (low-latency API responses), and edge deployment (running models on devices). Modern deployment uses microservices architecture, where models are containerized services that can be independently scaled and updated. Key considerations include latency requirements, throughput capacity, model size, dependency management, monitoring, and rollback capabilities. Popular deployment frameworks include Flask/FastAPI for APIs, Docker for containerization, and Kubernetes for orchestration.
Intuition
Imagine you trained a world-class chef (your model) in a private kitchen (training environment). Deployment is moving that chef to a restaurant where customers can order food. You need to: give them the recipes (model weights), set up their station (dependencies), create a menu system (API), make sure the kitchen can handle rush hour (scaling), and ensure food safety standards (monitoring). Just having a great chef isn't enough - they need to serve actual customers reliably.
Mathematical Formula
Step-by-Step Explanation:
- latency: response time from request to prediction
- \(L_{target}\): target latency (e.g., 100ms for real-time)
- \(SLO_{threshold}\): service level objective (e.g., 99th percentile)
- P: probability that latency meets target
- Deployment must optimize this probability through caching, batching, and scaling
Real-World Use Cases
Real-time recommendation API: Deploy a collaborative filtering model behind a FastAPI endpoint that returns personalized product recommendations in under 50ms for millions of users browsing the site.
Fraud detection service: Deploy an XGBoost model as a microservice that scores credit card transactions for fraud probability, integrated with payment processing pipelines.
Medical imaging inference: Containerize a TensorFlow CNN model for radiology image classification, deployed on GPU-enabled Kubernetes cluster for hospital PACS integration.
Edge deployment on factory floor: Deploy lightweight quality control model directly on industrial cameras using TensorRT optimization, processing images locally without cloud connectivity.
Implementation
Manual Implementation (No Libraries)
# Simple model saving and loading
import pickle
import json
# Save model
with open('model.pkl', 'wb') as f:
pickle.dump(model, f)
# Load and predict
with open('model.pkl', 'rb') as f:
loaded_model = pickle.load(f)
predictions = loaded_model.predict(X_new)
# Problems with this approach:
# - No API for remote access
# - Tight coupling to Python version
# - No dependency management
# - No scaling or load balancing
# - No monitoring or logging
# - No containerization
# - Manual deployment only
Using Libraries (flask, fastapi, gunicorn, docker, joblib)
# Complete deployment pipeline with Flask, Docker, and Gunicorn
# 1. Save model using joblib (better for sklearn)
import joblib
joblib.dump(model, 'app/model.pkl')
# 2. Create Flask API (app.py)
"""
from flask import Flask, request, jsonify
import joblib
import numpy as np
import logging
from datetime import datetime
app = Flask(__name__)
# Load model at startup
model = joblib.load('model.pkl')
# Setup logging
logging.basicConfig(level=logging.INFO)
@app.route('/health', methods=['GET'])
def health():
return jsonify({'status': 'healthy', 'timestamp': datetime.now().isoformat()})
@app.route('/predict', methods=['POST'])
def predict():
try:
# Get input data
data = request.get_json()
features = np.array(data['features']).reshape(1, -1)
# Log request
app.logger.info(f'Received prediction request: {len(features)} samples')
# Make prediction
prediction = model.predict(features)
probability = model.predict_proba(features) if hasattr(model, 'predict_proba') else None
# Build response
response = {
'prediction': prediction.tolist(),
'model_version': '1.0.0',
'timestamp': datetime.now().isoformat()
}
if probability is not None:
response['probability'] = probability.max(axis=1).tolist()
return jsonify(response)
except Exception as e:
app.logger.error(f'Prediction error: {str(e)}')
return jsonify({'error': str(e)}), 500
@app.route('/predict/batch', methods=['POST'])
def predict_batch():
try:
data = request.get_json()
features = np.array(data['features'])
# Batch prediction
predictions = model.predict(features)
return jsonify({
'predictions': predictions.tolist(),
'count': len(predictions),
'timestamp': datetime.now().isoformat()
})
except Exception as e:
return jsonify({'error': str(e)}), 500
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
"""
# 3. requirements.txt
"""
flask==2.3.3
gunicorn==21.2.0
joblib==1.3.2
numpy==1.24.3
scikit-learn==1.3.0
"""
# 4. Dockerfile
"""
FROM python:3.9-slim
WORKDIR /app
# Copy requirements
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy model and app
COPY app/ .
# Expose port
EXPOSE 5000
# Health check
HEALTHCHECK --interval=30s --timeout=30s --start-period=5s --retries=3
CMD curl -f http://localhost:5000/health || exit 1
# Run with gunicorn
CMD ["gunicorn", "-w", "4", "-b", "0.0.0.0:5000", "app:app"]
"""
# 5. Build and run Docker container
"""
# Build image
docker build -t ml-model:v1 .
# Run container
docker run -p 5000:5000 ml-model:v1
# Test API
curl -X POST http://localhost:5000/predict
-H 'Content-Type: application/json'
-d '{"features": [5.1, 3.5, 1.4, 0.2]}'
"""
# 6. Production deployment with Docker Compose
"""
# docker-compose.yml
version: '3.8'
services:
ml-api:
build: .
ports:
- '5000:5000'
environment:
- MODEL_VERSION=1.0.0
- LOG_LEVEL=INFO
deploy:
replicas: 3
resources:
limits:
cpus: '2'
memory: 4G
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:5000/health"]
interval: 30s
timeout: 10s
retries: 3
nginx:
image: nginx:alpine
ports:
- '80:80'
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf
depends_on:
- ml-api
"""
# 7. Using FastAPI (modern alternative to Flask)
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import joblib
import numpy as np
from typing import List
app = FastAPI(title='ML Model API', version='1.0.0')
# Load model
model = joblib.load('model.pkl')
class PredictionRequest(BaseModel):
features: List[float]
class PredictionResponse(BaseModel):
prediction: List
probability: float = None
model_version: str
@app.get('/health')
async def health():
return {'status': 'healthy'}
@app.post('/predict', response_model=PredictionResponse)
async def predict(request: PredictionRequest):
try:
features = np.array(request.features).reshape(1, -1)
prediction = model.predict(features)
prob = model.predict_proba(features).max() if hasattr(model, 'predict_proba') else None
return PredictionResponse(
prediction=prediction.tolist(),
probability=prob,
model_version='1.0.0'
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
# Run: uvicorn app:app --host 0.0.0.0 --port 5000
# 8. Model optimization with ONNX
import onnx
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
# Convert to ONNX for faster inference
initial_type = [('float_input', FloatTensorType([None, 4]))]
onnx_model = convert_sklearn(model, initial_types=initial_type)
with open('model.onnx', 'wb') as f:
f.write(onnx_model.SerializeToString())
# Load with ONNX Runtime for optimized inference
import onnxruntime as ort
session = ort.InferenceSession('model.onnx')
input_name = session.get_inputs()[0].name
preds = session.run(None, {input_name: X_test.astype(np.float32)})
When to Use
✅ Appropriate Use Cases:
- Models need to serve predictions to other applications
- Batch processing jobs require scheduled inference
- Real-time predictions needed for user-facing features
- Model updates must be deployed without downtime
- Multiple teams need access to the same model
- Edge deployment for offline/ low-latency scenarios
- A/B testing different model versions
- Integration with existing microservices architecture
- Regulatory requirements mandate versioned model serving
❌ Avoid When:
- Research and experimentation only
- Single user ad-hoc analysis
- When managed services suffice (SageMaker, Vertex AI)
- Simple notebooks for one-time predictions
- When team lacks DevOps/infrastructure expertise
- Very low-scale prototypes (<10 requests/day)
- When model changes too frequently for versioning
Common Pitfalls
- Not versioning the model artifact separately from code
- Loading model on every request (performance hit)
- Not handling model input validation
- Ignoring memory leaks in long-running services
- Not setting up health checks and monitoring
- Hardcoding model paths instead of configuration
- Forgetting to pin dependency versions
- Not testing the containerized version locally
- No logging for debugging production issues
- Not handling batch vs single prediction consistently
- Ignoring cold start latency
- Not setting up proper error responses
- Deploying without load testing
- Forgetting to handle model input/output serialization