Labs ICT
⭐ Pro Login

Deploying ML Models

Taking models from notebook to production.

Deploying ML Models

Training a model is only half the battle. Deploying it β€” making it available to real users in production β€” is where the real engineering happens. A model sitting in a Jupyter notebook helps nobody.

Deployment means turning your model into a service that applications can call reliably, quickly, and at scale.

Model Formats

Before deploying, you need to save your model in a format suitable for production:

Pickle/joblib: Python's native serialization. Simple but tied to Python. Don't use for untrusted environments β€” pickle can execute arbitrary code.

ONNX (Open Neural Network Exchange): Framework-agnostic format. Train in PyTorch, deploy in TensorFlow or C++. Great for interoperability.

TorchScript / SavedModel: Framework-specific optimized formats. PyTorch and TensorFlow respectively. Best performance within their ecosystems.

TensorRT: NVIDIA's optimized format for GPU inference. Can give 2-5x speedup over standard models.


    Model Export Flow
    ──────────────────────────────────────────────
    β”‚                                             β”‚
    β”‚  Training Framework                         β”‚
    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                       β”‚
    β”‚  β”‚ PyTorch / TF     β”‚                       β”‚
    β”‚  β”‚ Train model      β”‚                       β”‚
    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                       β”‚
    β”‚           β”‚                                 β”‚
    β”‚           β–Ό Export                          β”‚
    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                       β”‚
    β”‚  β”‚ .pt / .h5 / .pb  β”‚  ← native format     β”‚
    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                       β”‚
    β”‚           β”‚                                 β”‚
    β”‚           β–Ό Convert (optional)              β”‚
    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                       β”‚
    β”‚  β”‚ .onnx / .trt     β”‚  ← optimized format  β”‚
    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                       β”‚
    ──────────────────────────────────────────────
    

Flask API

Flask is the simplest way to turn a model into a web API. A few lines of Python and you have a prediction endpoint.


    Flask Deployment Architecture
    ──────────────────────────────────────────────
    β”‚                                             β”‚
    β”‚  Client Request                             β”‚
    β”‚       β”‚                                     β”‚
    β”‚       β–Ό                                     β”‚
    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚
    β”‚  β”‚  Flask   │───▢│ Load Model   β”‚          β”‚
    β”‚  β”‚  Server  β”‚    β”‚ (once at     β”‚          β”‚
    β”‚  β”‚  :5000   β”‚    β”‚  startup)    β”‚          β”‚
    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
    β”‚       β”‚                                     β”‚
    β”‚       β–Ό                                     β”‚
    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                               β”‚
    β”‚  β”‚ Process  β”‚                               β”‚
    β”‚  β”‚ Input    β”‚                               β”‚
    β”‚  β”‚ Preprocessβ”‚                              β”‚
    β”‚  β”‚ Predict  β”‚                               β”‚
    β”‚  β”‚ Postprocessβ”‚                             β”‚
    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                               β”‚
    β”‚       β”‚                                     β”‚
    β”‚       β–Ό                                     β”‚
    β”‚  JSON Response                              β”‚
    β”‚  {"prediction": "cat", "confidence": 0.95}  β”‚
    ──────────────────────────────────────────────
    

Key considerations: load the model once at startup, not per request. Use JSON for input/output. Add input validation. Handle errors gracefully. Add logging for debugging.

Docker Containerization

Docker packages your model and all its dependencies into a container that runs identically everywhere. No more "it works on my machine" problems.


    Without Docker:           With Docker:
    ──────────────────────────────────────────────
    β”‚ Dev: Python 3.8       β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
    β”‚ Prod: Python 3.6      β”‚ β”‚ Container    β”‚ β”‚
    β”‚ Result: BROKEN        β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
    β”‚                        β”‚ β”‚ β”‚ Your App β”‚ β”‚ β”‚
    β”‚ "Works on my machine!" β”‚ β”‚ β”‚ Python   β”‚ β”‚ β”‚
    β”‚                        β”‚ β”‚ β”‚ Libs     β”‚ β”‚ β”‚
    β”‚                        β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
    β”‚                        β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
    β”‚                        β”‚   Runs anywhere  β”‚
    ──────────────────────────────────────────────
    

A Dockerfile for ML is straightforward: start from a Python base image, install dependencies, copy your model and code, expose the port, and run the Flask server.

Docker Compose helps when you need multiple services β€” model server, database, cache, monitoring. Define everything in one file and spin up the whole stack with one command.

Model Serving Alternatives

Flask is great for prototyping, but production often needs more:

FastAPI: Like Flask but faster (async support) and automatic API documentation. Built-in data validation with Pydantic. Increasingly popular for ML APIs.

TensorFlow Serving: Purpose-built for TF models. Handles versioning, batching, and GPU optimization out of the box.

TorchServe: PyTorch's equivalent of TF Serving. Model versioning, multi-model serving, and metrics.

KServe: Kubernetes-native model serving. Scales automatically, supports multiple frameworks.


    Serving Options Comparison
    ──────────────────────────────────────────────
    β”‚ Option           β”‚ Best For                β”‚
    ──────────────────────────────────────────────
    β”‚ Flask            β”‚ Prototyping, small scaleβ”‚
    β”‚ FastAPI          β”‚ Production APIs         β”‚
    β”‚ TF Serving       β”‚ TensorFlow at scale     β”‚
    β”‚ TorchServe       β”‚ PyTorch at scale        β”‚
    β”‚ Triton           β”‚ Multi-framework, GPU    β”‚
    β”‚ KServe           β”‚ Kubernetes deployments  β”‚
    ──────────────────────────────────────────────
    

Monitoring & Maintenance

Deployment isn't the finish line β€” it's the starting line. Models degrade over time as real-world data diverges from training data.

Data drift: The distribution of incoming data changes. Your model trained on 2023 data struggles with 2024 data. Monitor feature distributions and trigger retraining when drift exceeds thresholds.

Concept drift: The relationship between features and targets changes. Spam patterns evolve. Fraud techniques adapt. Your model needs to keep up.

Performance monitoring: Track prediction latency, throughput, error rates, and accuracy metrics in real-time. Set alerts for anomalies.


    Production ML Pipeline
    ──────────────────────────────────────────────────
    β”‚                                                 β”‚
    β”‚  Data     ──▢ Feature  ──▢ Model  ──▢ Serving  β”‚
    β”‚  Pipeline     Store         Registry   API      β”‚
    β”‚                                                 β”‚
    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
    β”‚  β”‚            Monitoring Layer              β”‚    β”‚
    β”‚  β”‚  - Data drift detection                 β”‚    β”‚
    β”‚  β”‚  - Model performance tracking           β”‚    β”‚
    β”‚  β”‚  - Latency & error rate metrics         β”‚    β”‚
    β”‚  β”‚  - Alerting & retraining triggers       β”‚    β”‚
    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
    ──────────────────────────────────────────────────
    

Scaling Considerations

When your model gets popular, you need to handle load:

Horizontal scaling: Run multiple instances behind a load balancer. Simple, effective, works with any model.

Batching: Group multiple predictions into one batch. GPUs love batches β€” processing 32 images at once is almost as fast as processing one.

Caching: Store predictions for repeated inputs. If the same image comes in twice, return the cached result instantly.

Model optimization: Quantize, prune, or distill your model for faster inference. A 4x speedup from quantization is often worth the tiny accuracy drop.

Deployment Checklist

Before going live, ensure you have:

Input validation (reject bad data gracefully), error handling (don't crash on unexpected inputs), logging (for debugging and auditing), health checks (know if the service is up), versioning (rollback if something breaks), and A/B testing capability (compare model versions).

Start simple with Flask + Docker, then optimize as needed. Over-engineering early is a common and expensive mistake.

πŸ§ͺ Quick Quiz

What is model deployment?