Model Deployment Basics

Definition

Model deployment is the process of putting a trained machine learning model into a production environment where it can receive real-time or batch data and return predictions to users or other systems. Deployment transforms a model from a research artifact into a usable service. Key concepts include model serialization (saving model weights and architecture), API creation (providing HTTP endpoints for inference), containerization (packaging the model with its dependencies using Docker), and serving infrastructure (managing model instances, load balancing, and scaling). Deployment patterns include batch prediction (processing large datasets offline), real-time inference (low-latency API responses), and edge deployment (running models on devices). Modern deployment uses microservices architecture, where models are containerized services that can be independently scaled and updated. Key considerations include latency requirements, throughput capacity, model size, dependency management, monitoring, and rollback capabilities. Popular deployment frameworks include Flask/FastAPI for APIs, Docker for containerization, and Kubernetes for orchestration.

Intuition

💡

Imagine you trained a world-class chef (your model) in a private kitchen (training environment). Deployment is moving that chef to a restaurant where customers can order food. You need to: give them the recipes (model weights), set up their station (dependencies), create a menu system (API), make sure the kitchen can handle rush hour (scaling), and ensure food safety standards (monitoring). Just having a great chef isn't enough - they need to serve actual customers reliably.

Mathematical Formula

Service Level Objective:

P(latency < L_target) > SLO_threshold

Step-by-Step Explanation:

latency: response time from request to prediction
\(L_{target}\): target latency (e.g., 100ms for real-time)
\(SLO_{threshold}\): service level objective (e.g., 99th percentile)
P: probability that latency meets target
Deployment must optimize this probability through caching, batching, and scaling

Real-World Use Cases

E-commerce

Real-time recommendation API: Deploy a collaborative filtering model behind a FastAPI endpoint that returns personalized product recommendations in under 50ms for millions of users browsing the site.

Finance

Fraud detection service: Deploy an XGBoost model as a microservice that scores credit card transactions for fraud probability, integrated with payment processing pipelines.

Healthcare

Medical imaging inference: Containerize a TensorFlow CNN model for radiology image classification, deployed on GPU-enabled Kubernetes cluster for hospital PACS integration.

Manufacturing

Edge deployment on factory floor: Deploy lightweight quality control model directly on industrial cameras using TensorRT optimization, processing images locally without cloud connectivity.

Implementation

Manual Implementation (No Libraries)

Simple pickle serialization works for local use but fails for production. Real deployment requires containerization, API design, dependency management, and infrastructure orchestration.

# Simple model saving and loading
import pickle
import json

# Save model
with open('model.pkl', 'wb') as f:
    pickle.dump(model, f)

# Load and predict
with open('model.pkl', 'rb') as f:
    loaded_model = pickle.load(f)
predictions = loaded_model.predict(X_new)

# Problems with this approach:
# - No API for remote access
# - Tight coupling to Python version
# - No dependency management
# - No scaling or load balancing
# - No monitoring or logging
# - No containerization
# - Manual deployment only

Using Libraries (flask, fastapi, gunicorn, docker, joblib)

# Complete deployment pipeline with Flask, Docker, and Gunicorn

# 1. Save model using joblib (better for sklearn)
import joblib
joblib.dump(model, 'app/model.pkl')

# 2. Create Flask API (app.py)
"""
from flask import Flask, request, jsonify
import joblib
import numpy as np
import logging
from datetime import datetime

app = Flask(__name__)

# Load model at startup
model = joblib.load('model.pkl')

# Setup logging
logging.basicConfig(level=logging.INFO)

@app.route('/health', methods=['GET'])
def health():
    return jsonify({'status': 'healthy', 'timestamp': datetime.now().isoformat()})

@app.route('/predict', methods=['POST'])
def predict():
    try:
        # Get input data
        data = request.get_json()
        features = np.array(data['features']).reshape(1, -1)
        
        # Log request
        app.logger.info(f'Received prediction request: {len(features)} samples')
        
        # Make prediction
        prediction = model.predict(features)
        probability = model.predict_proba(features) if hasattr(model, 'predict_proba') else None
        
        # Build response
        response = {
            'prediction': prediction.tolist(),
            'model_version': '1.0.0',
            'timestamp': datetime.now().isoformat()
        }
        
        if probability is not None:
            response['probability'] = probability.max(axis=1).tolist()
        
        return jsonify(response)
        
    except Exception as e:
        app.logger.error(f'Prediction error: {str(e)}')
        return jsonify({'error': str(e)}), 500

@app.route('/predict/batch', methods=['POST'])
def predict_batch():
    try:
        data = request.get_json()
        features = np.array(data['features'])
        
        # Batch prediction
        predictions = model.predict(features)
        
        return jsonify({
            'predictions': predictions.tolist(),
            'count': len(predictions),
            'timestamp': datetime.now().isoformat()
        })
        
    except Exception as e:
        return jsonify({'error': str(e)}), 500

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)
"""

# 3. requirements.txt
"""
flask==2.3.3
gunicorn==21.2.0
joblib==1.3.2
numpy==1.24.3
scikit-learn==1.3.0
"""

# 4. Dockerfile
"""
FROM python:3.9-slim

WORKDIR /app

# Copy requirements
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy model and app
COPY app/ .

# Expose port
EXPOSE 5000

# Health check
HEALTHCHECK --interval=30s --timeout=30s --start-period=5s --retries=3 
    CMD curl -f http://localhost:5000/health || exit 1

# Run with gunicorn
CMD ["gunicorn", "-w", "4", "-b", "0.0.0.0:5000", "app:app"]
"""

# 5. Build and run Docker container
"""
# Build image
docker build -t ml-model:v1 .

# Run container
docker run -p 5000:5000 ml-model:v1

# Test API
curl -X POST http://localhost:5000/predict 
  -H 'Content-Type: application/json' 
  -d '{"features": [5.1, 3.5, 1.4, 0.2]}'
"""

# 6. Production deployment with Docker Compose
"""
# docker-compose.yml
version: '3.8'

services:
  ml-api:
    build: .
    ports:
      - '5000:5000'
    environment:
      - MODEL_VERSION=1.0.0
      - LOG_LEVEL=INFO
    deploy:
      replicas: 3
      resources:
        limits:
          cpus: '2'
          memory: 4G
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:5000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      
  nginx:
    image: nginx:alpine
    ports:
      - '80:80'
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    depends_on:
      - ml-api
"""

# 7. Using FastAPI (modern alternative to Flask)
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import joblib
import numpy as np
from typing import List

app = FastAPI(title='ML Model API', version='1.0.0')

# Load model
model = joblib.load('model.pkl')

class PredictionRequest(BaseModel):
    features: List[float]

class PredictionResponse(BaseModel):
    prediction: List
    probability: float = None
    model_version: str

@app.get('/health')
async def health():
    return {'status': 'healthy'}

@app.post('/predict', response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    try:
        features = np.array(request.features).reshape(1, -1)
        prediction = model.predict(features)
        prob = model.predict_proba(features).max() if hasattr(model, 'predict_proba') else None
        
        return PredictionResponse(
            prediction=prediction.tolist(),
            probability=prob,
            model_version='1.0.0'
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

# Run: uvicorn app:app --host 0.0.0.0 --port 5000

# 8. Model optimization with ONNX
import onnx
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType

# Convert to ONNX for faster inference
initial_type = [('float_input', FloatTensorType([None, 4]))]
onnx_model = convert_sklearn(model, initial_types=initial_type)

with open('model.onnx', 'wb') as f:
    f.write(onnx_model.SerializeToString())

# Load with ONNX Runtime for optimized inference
import onnxruntime as ort
session = ort.InferenceSession('model.onnx')
input_name = session.get_inputs()[0].name
preds = session.run(None, {input_name: X_test.astype(np.float32)})

When to Use

✅ Appropriate Use Cases:

Models need to serve predictions to other applications
Batch processing jobs require scheduled inference
Real-time predictions needed for user-facing features
Model updates must be deployed without downtime
Multiple teams need access to the same model
Edge deployment for offline/ low-latency scenarios
A/B testing different model versions
Integration with existing microservices architecture
Regulatory requirements mandate versioned model serving

❌ Avoid When:

Research and experimentation only
Single user ad-hoc analysis
When managed services suffice (SageMaker, Vertex AI)
Simple notebooks for one-time predictions
When team lacks DevOps/infrastructure expertise
Very low-scale prototypes (<10 requests/day)
When model changes too frequently for versioning

Common Pitfalls

Not versioning the model artifact separately from code
Loading model on every request (performance hit)
Not handling model input validation
Ignoring memory leaks in long-running services
Not setting up health checks and monitoring
Hardcoding model paths instead of configuration
Forgetting to pin dependency versions
Not testing the containerized version locally
No logging for debugging production issues
Not handling batch vs single prediction consistently
Ignoring cold start latency
Not setting up proper error responses
Deploying without load testing
Forgetting to handle model input/output serialization