Deployment & Serving

Chapter 13: ML Systems & Production Section 4 of 9

I avoided deployment for an embarrassingly long time. I could train models, tune hyperparameters, hit great metrics on hold-out sets. But every time someone asked "so how do we get this in front of users?" I'd mumble something about Flask and change the subject. My models lived in notebooks and died in notebooks. Finally the discomfort of watching other people ship the same models I'd built — faster and cheaper — grew too much for me. Here is that dive.

Deployment is the practice of taking a trained model out of your development environment and putting it somewhere that can serve predictions to real users, real systems, or real devices. It covers everything from converting your model into a portable format, wrapping it in a server, squeezing it down to fit on a phone, and figuring out how to replace it with a better version without waking anyone up at 3 AM. The field has matured enormously in the last few years, with dedicated serving frameworks, hardware-aware compilers, and container orchestration systems designed specifically for ML workloads.

Before we start, a heads-up. We're going to touch on Docker, Kubernetes, gRPC, GPU memory management, and some infrastructure concepts that might sound intimidating. You don't need to know any of them beforehand. We'll add the concepts we need one at a time, with explanation.

This isn't a short journey, but I hope you'll be glad you came.

Contents

The restaurant that explains deployment

Getting the model out of Python — serialization

When does the cooking happen? — batch vs. real-time vs. streaming

The kitchen equipment — serving frameworks

Shrinking the recipe — optimization for production

Rest stop

Packing the kitchen into a box — containerization

Taking the kitchen to the customer — edge deployment

Opening more locations — scaling and auto-scaling

Speed of service — latency optimization

Changing the menu without closing the restaurant — deployment strategies

Resources and credits

The Restaurant That Explains Deployment

Imagine we're opening a small restaurant. We have one chef (our model), a kitchen (our server), and a menu of three dishes: pasta, salad, and soup. Customers walk in, order a dish, and the chef prepares it. That's model serving in its most basic form — a request comes in, a prediction goes out.

We'll keep coming back to this restaurant throughout. Every deployment concept maps to something a restaurant owner has to figure out: Do we prep food ahead of time or cook to order? How do we handle a dinner rush? What if we want to replace the chef without closing the restaurant? How do we open a food truck that serves the same menu in a parking lot?

For now, our restaurant has three dishes, one chef, and no customers yet. Let's start with the most basic problem: how do we write down the chef's recipes in a format that any kitchen can use?

Getting the Model Out of Python — Serialization

Here's a problem I didn't appreciate for a long time. Your trained model lives inside a Python process. It has weights, a computation graph, and a specific framework runtime (PyTorch, TensorFlow) keeping it alive. But your production server might not be Python. Your phone definitely isn't. Your browser isn't. To serve a model anywhere, you need to capture everything — the architecture, the weights, the computation steps — into a portable file. This process is called serialization, and the format you choose determines where your model can run, how fast it runs, and which serving tools you can use.

Think of it this way: your chef knows how to make pasta. But if you want to open a second restaurant in another city, you can't ship the chef. You need to write the recipe down precisely enough that a different chef — one who's never seen your kitchen — can reproduce the dish exactly. The format of that recipe book matters. Some formats only work in Italian kitchens. Others work everywhere but lose some nuance.

ONNX — The Universal Recipe Book

ONNX stands for Open Neural Network Exchange. It's a framework-agnostic format: you can export from PyTorch or TensorFlow, then run the model with ONNX Runtime, TensorRT, OpenVINO, or nearly any production serving framework. ONNX represents your model as a graph of operators — each operator being a mathematical operation like matrix multiply, convolution, or ReLU.

Let's export our restaurant's simple dish-classifier model to ONNX. The model takes an image of food and classifies it as pasta, salad, or soup.

import torch

model = load_trained_dish_classifier()  # Our 3-class model
model.eval()

# ONNX export needs a sample input to trace the computation graph
sample_image = torch.randn(1, 3, 224, 224)

torch.onnx.export(
    model,
    sample_image,
    "dish_classifier.onnx",
    opset_version=17,
    input_names=["food_image"],
    output_names=["dish_scores"],
    dynamic_axes={"food_image": {0: "batch_size"}}
)

A few things are happening here that deserve attention. The opset_version is the version of the ONNX operator specification — think of it as the edition of the recipe book. Version 17 is broadly supported in 2024. Using an older opset limits which operations you can export; using a too-new opset might not work with your target runtime. The dynamic_axes parameter tells ONNX that the batch dimension can change — without it, the model would be locked to processing exactly one image at a time, which would be like a restaurant that can only serve one customer at a time regardless of demand.

Now we can serve this model with ONNX Runtime, which applies graph optimizations — fusing operations, eliminating redundant computations, planning memory layouts — to squeeze out 2-3× the speed of raw PyTorch.

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession(
    "dish_classifier.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)

food_photo = preprocess_image("customer_photo.jpg")
scores = session.run(None, {"food_image": food_photo})
# scores[0] → array([[0.85, 0.10, 0.05]])  → probably pasta

I'll be honest — ONNX export doesn't always go smoothly. Some custom PyTorch operations don't have ONNX equivalents. Dynamic control flow (if-statements that depend on tensor values) can trip up the tracer. When export fails, the error messages are cryptic at best. I still run onnx.checker.check_model and then onnx-simplifier on every export as a sanity check, and I recommend you do the same.

TorchScript — Staying in the PyTorch Family

If your entire serving stack is PyTorch, TorchScript is PyTorch's own serialization format. It comes in two flavors. torch.jit.trace records the operations that happen during one forward pass with a sample input — it's fast and works well for models with no if-statements or loops that change based on the input. torch.jit.script actually parses your Python code and converts it into a statically typed representation — it handles control flow, but not all Python constructs survive the translation.

More recently, PyTorch introduced torch.export and torch.compile, which are gradually replacing TorchScript for many use cases. torch.compile is particularly notable because you can often get a 1.5-3× speedup by adding a single line of code to your existing model, with no format conversion at all.

TensorFlow SavedModel — The TF Ecosystem's Format

If your model is built in TensorFlow or Keras, SavedModel is your format. It bundles the computation graph, weights, and signatures into a directory structure that TF Serving, TFLite, and TensorFlow.js all know how to consume. There's no compelling reason to use it outside the TensorFlow ecosystem, but within it, the integration is seamless.

The limitation of all these formats: they freeze your model at a point in time. The recipe is written down, but it can't adapt. If you want to change anything — update the architecture, add a new output — you need to re-export. This feels obvious, but I've seen teams debug production issues for hours before realizing they were serving a stale export.

When Does the Cooking Happen? — Batch vs. Real-Time vs. Streaming

Back to our restaurant. There are fundamentally different ways to serve food, and the same is true for model predictions. The distinction comes down to one question: when does the model run relative to when someone needs the answer?

Batch — The Buffet

A buffet prepares all the food before any customer arrives. When someone shows up, they grab a plate from what's already been cooked. In ML terms, you run your model over all known inputs ahead of time, store the predictions in a database, and look them up when a user makes a request. No model server needed at serving time — the serving layer is a database read.

import redis

def run_nightly_scoring():
    """Score every known user, cache results for tomorrow's traffic."""
    model = load_dish_recommender()
    users = load_all_user_profiles()

    cache = redis.Redis(host="redis-server", port=6379)
    pipe = cache.pipeline()

    for user in users:
        recommendation = model.predict(user.features)
        pipe.set(f"recs:{user.id}", recommendation, ex=86400)

    pipe.execute()  # Write all predictions at once

The beauty of batch is its low operational burden. There's no model server to keep running, no GPU to keep warm, no tail latency to worry about. A nightly cron job, a database, done. You'd be surprised how far this stretches. Precompute recommendations for your top 80% of active users; fall back to popular items for the rest.

The buffet analogy reveals the limitation too. What if a customer wants something that wasn't prepped? A brand-new user has no precomputed recommendation. Fraud detection can't wait for tonight's batch — the fraudster is swiping the card right now. When freshness matters, the buffet fails.

Real-Time — The Made-to-Order Kitchen

A customer orders. The chef cooks. The dish goes out. In ML terms, a request hits an endpoint, the model runs inference on that specific input, and the response comes back — all within a latency budget, typically under 100 milliseconds. This is the hardest paradigm to operate. It demands a live model server, GPU management, health monitoring, and auto-scaling. But when predictions need to reflect the most recent context — "is this particular credit card transaction, right now, fraudulent?" — it's the only option.

Streaming — The Conveyor Belt Sushi Bar

In a conveyor belt sushi restaurant, the kitchen continuously prepares dishes and sends them down the belt. Nobody is standing at the counter waiting for their specific plate. The food flows continuously, and diners grab what's relevant.

Streaming inference works the same way. Events flow in continuously — from Kafka topics, IoT sensors, clickstreams — and the model processes each one as it arrives, emitting predictions downstream. No client sits waiting for a response. The system is event-driven.

from kafka import KafkaConsumer, KafkaProducer
import json

consumer = KafkaConsumer(
    "order-events",
    bootstrap_servers=["kafka:9092"],
    value_deserializer=lambda m: json.loads(m)
)
producer = KafkaProducer(
    bootstrap_servers=["kafka:9092"],
    value_serializer=lambda v: json.dumps(v).encode()
)
model = load_fraud_model()

for message in consumer:
    event = message.value
    features = extract_features(event)
    score = model.predict(features)
    producer.send("fraud-scores", value={
        "order_id": event["id"],
        "score": float(score)
    })

The critical distinction: real-time is request-response (the customer is standing at the counter waiting). Streaming is fire-and-forget (the sushi goes on the belt whether anyone grabs it or not). Streaming fits continuous monitoring, anomaly detection, and anything where you maintain running state over time windows.

One paradigm I should mention before we move on: serverless inference (AWS Lambda, Google Cloud Functions). You upload your model and code, the cloud handles scaling. You pay per invocation. It's good for bursty, low-traffic workloads. It's terrible for anything latency-sensitive, because when the function hasn't been called recently, the first request triggers a cold start — loading the model from scratch — which can add 2 to 30 seconds of delay. That's an eternity in most serving contexts.

The Kitchen Equipment — Serving Frameworks

So we need a real-time kitchen. Our first instinct is to wrap the model in a web framework we already know. FastAPI, Flask, something familiar.

from fastapi import FastAPI
from pydantic import BaseModel
import numpy as np

app = FastAPI()
model = load_dish_classifier()

class DishRequest(BaseModel):
    image_url: str

@app.post("/predict")
async def predict(req: DishRequest):
    image = download_and_preprocess(req.image_url)
    scores = model.predict(image)
    dish = ["pasta", "salad", "soup"][np.argmax(scores)]
    return {"dish": dish, "confidence": float(np.max(scores))}

@app.get("/health")
async def health():
    return {"status": "ok"}

For a demo, an internal tool, or a model that gets ten requests per minute, this works. But as our restaurant grows, we'll run into problems that a web framework was never designed to solve.

The first problem is GPU utilization. A GPU is like a massive griddle — it can cook 32 burgers at once, but cooking one burger uses the same amount of energy and nearly the same amount of time. When requests arrive one at a time and we process them individually, we're heating the entire griddle for a single patty. What we need is dynamic batching: collecting requests that arrive within a short window (say 5 milliseconds), bundling them together, and running inference on the whole batch. Throughput goes way up. Per-request cost goes way down.

The second problem is multi-model management. Our restaurant now serves breakfast, lunch, and dinner — three different menus, meaning three different models. We need to load and unload them, version them, route traffic to the right one, and monitor each independently. A web framework gives us none of this.

The third problem is GPU memory management. Models are big. A single transformer can eat 4 GB of GPU memory. Two models on one GPU requires careful coordination. Loading a new model version while still serving the old one requires even more care.

This is why dedicated model-serving frameworks exist. They handle batching, versioning, GPU scheduling, health checks, and protocol support (both REST and gRPC) out of the box. Let's look at the main ones.

Triton Inference Server — The Industrial Kitchen

NVIDIA's Triton is the Swiss Army knife. It serves models from any framework — PyTorch, TensorFlow, ONNX, TensorRT, even plain Python — on the same GPU, simultaneously. Its killer feature is concurrent model execution. If your NLP model uses 40% of a GPU, Triton schedules a vision model on the remaining 60%. In multi-model deployments, this can cut your GPU bill in half.

I'll be honest — Triton's configuration is verbose. Each model lives in a directory with a config.pbtxt file specifying input/output shapes, batching parameters, and scheduling policy. It's not something you set up in an afternoon the first time. But once configured, it's the most capable serving infrastructure I've worked with.

TorchServe — The PyTorch Kitchen

If your stack is all PyTorch, TorchServe provides deep integration. You write a handler class that defines three steps — preprocess, inference, postprocess — and package everything into a .mar archive.

from ts.torch_handler.base_handler import BaseHandler
import torch

class DishClassifierHandler(BaseHandler):
    def preprocess(self, requests):
        images = []
        for req in requests:
            img = decode_image(req["body"])
            images.append(self.transform(img))
        return torch.stack(images)

    def inference(self, batch):
        with torch.no_grad():
            return self.model(batch)

    def postprocess(self, outputs):
        probs = torch.softmax(outputs, dim=1)
        dishes = ["pasta", "salad", "soup"]
        return [
            {dishes[i]: float(p[i]) for i in range(3)}
            for p in probs
        ]

TorchServe handles batching, model versioning, and health monitoring. The tradeoff compared to Triton: it only serves PyTorch models, and it doesn't do concurrent multi-model GPU sharing as effectively.

TF Serving, BentoML, and vLLM

TF Serving is TensorFlow's production server. Mature, battle-tested, tight integration with the TF ecosystem. If your models are TensorFlow, it's a reliable choice, though it only serves TensorFlow models.

BentoML takes a different approach: framework-agnostic. You define your serving logic in a Python class, and BentoML builds a production container from it — complete with batching, API routes, and dependency management. The appeal is that you write Python, and it handles the infrastructure plumbing.

vLLM is purpose-built for large language models. Its core innovation is PagedAttention, which manages the GPU memory used by the key-value cache during text generation the way an operating system manages virtual memory — in pages that can be allocated, freed, and shared. Combined with continuous batching (starting new requests as soon as a slot opens, rather than waiting for a whole batch to finish), vLLM achieves 2-4× the throughput of naive LLM serving. If you're serving LLMs, it's the tool to reach for.

A Quick Note on Protocols

Most serving frameworks expose two interfaces: REST (JSON over HTTP) and gRPC (Protocol Buffers over HTTP/2). REST is simple and debuggable — you can test it with curl. gRPC serializes data in a compact binary format and is 2-5× faster for high-throughput internal services. Many production systems use both: gRPC between microservices, REST for external APIs, with a gateway translating between them.

Shrinking the Recipe — Optimization for Production

Our restaurant is running, but the head chef (our model) is expensive. A transformer at 200ms per prediction on a $2/hour GPU, serving 100 requests per second, costs about $50,000 per month. Your VP of engineering wants that to be $5,000. Now we need to make the recipe faster, cheaper, and smaller — without ruining the dish.

Every optimization technique trades some accuracy for speed or size. The art is finding the sweet spot where you hit your latency and cost targets while staying within an acceptable accuracy budget. Let's walk through the three major techniques.

Quantization — Using Less Precise Measurements

Imagine our chef is measuring ingredients with a precision scale that shows four decimal places. Does it matter whether you use 2.3847 grams of salt or 2.4 grams? The dish tastes the same. Quantization is the same idea applied to model weights: reduce the numerical precision from 32-bit floating point (FP32, 4 bytes per weight) to 16-bit (FP16, 2 bytes), 8-bit integers (INT8, 1 byte), or even 4-bit (INT4, half a byte).

Why does this work? Because a weight of 0.23847291 and 0.23828125 produce the same prediction. The network learned patterns and relationships, not decimal places.

There are two approaches. Post-Training Quantization (PTQ) takes a trained model, converts the weights to lower precision, and calls it done. It needs a small calibration dataset — a few hundred samples — to determine the best mapping from floating point to integer ranges. No retraining required.

import torch
from torch.quantization import quantize_dynamic

model_fp32 = load_dish_classifier()

# Dynamic quantization: convert Linear layers to INT8
model_int8 = quantize_dynamic(
    model_fp32,
    {torch.nn.Linear},
    dtype=torch.qint8
)

The result: the model is roughly 4× smaller. On CPU, inference is 2-4× faster. On GPU, FP16 is effectively free (every modern GPU has FP16 hardware), and INT8 support is broad across both CPUs and recent GPUs.

The other approach is Quantization-Aware Training (QAT). Instead of quantizing after training, QAT simulates quantization during training by inserting fake quantization nodes into the forward pass. The model learns to tolerate the precision loss. It's more work, but it recovers 50-100% of the accuracy that PTQ drops. For models where that last fraction of a percent matters — medical imaging, autonomous driving — QAT is worth the effort.

I'm still developing my intuition for when PTQ is "good enough" versus when QAT is worth the extra training time. As a rough guide: for non-LLM models, INT8 PTQ is almost free accuracy-wise and is a worthwhile first step on any deployment. For large language models, specialized quantization methods like GPTQ (which uses Hessian information to protect the most important weights) and AWQ (which protects the 1% of weights that handle salient activations) enable 4-bit quantization with minimal quality loss, cutting serving costs 4-8×.

Pruning — Removing Unnecessary Ingredients

Research on the lottery ticket hypothesis revealed something surprising: large neural networks are massively over-parameterized. Inside a big network, there's a much smaller network that would have performed nearly as well if you'd trained it from scratch. Pruning finds and removes the weights that don't contribute meaningfully to the output.

There are two flavors, and the distinction matters more than it might seem. Unstructured pruning zeroes out individual weights wherever they appear. You might prune 90% of weights this way. The problem: the resulting weight matrices are sparse (mostly zeros), and standard GPU hardware doesn't know how to exploit that sparsity. The matrix is still the same shape; it's full of zeros that still get multiplied. You save storage, but not compute time — unless you have specialized sparse-compute hardware.

Structured pruning removes entire neurons, channels, or attention heads. The compression ratio is lower — maybe 30-50% rather than 90% — but the result is a regular, dense model that runs faster on any hardware. No special sparse libraries needed. This is usually what you want for deployment.

import torch.nn.utils.prune as prune

# Remove 30% of channels from every Conv2d layer
for name, module in model.named_modules():
    if isinstance(module, torch.nn.Conv2d):
        prune.ln_structured(
            module, name="weight",
            amount=0.3, n=2, dim=0
        )

# Make the pruning permanent (remove the mask, bake zeros into weights)
for name, module in model.named_modules():
    if hasattr(module, "weight_orig"):
        prune.remove(module, "weight")

The most effective recipe is iterative: prune 10-20%, retrain for a few epochs to let the model recover, prune again. Each cycle, the model adapts. This reaches much higher compression than one-shot pruning.

A trap I've fallen into: getting excited about "90% sparse!" numbers without benchmarking actual inference time. NVIDIA's Ampere and Hopper GPUs support 2:4 structured sparsity (every group of 4 weights has at most 2 non-zeros) with dedicated hardware. Other sparsity patterns need software support that may or may not exist for your setup. Always measure wall-clock time, not parameter count.

Distillation — Teaching a Smaller Chef

Here's a different approach. Instead of shrinking the existing chef, we hire a new, cheaper chef and train them by watching the expensive one cook.

Knowledge distillation trains a small "student" model to mimic a large "teacher" model. The key insight is what the student learns from. Normally, a model trains on hard labels: this image is pasta (1), not salad (0), not soup (0). But the teacher's output is richer. The teacher might say: this is probably pasta (0.85), but it has some salad-like qualities (0.12), and a tiny bit of soup resemblance (0.03). Those soft probabilities carry inter-class relationships that hard labels completely miss.

import torch.nn.functional as F

def distillation_loss(student_logits, teacher_logits, true_labels,
                      temperature=4.0, alpha=0.7):
    # Soften both distributions with temperature
    soft_teacher = F.softmax(teacher_logits / temperature, dim=-1)
    soft_student = F.log_softmax(student_logits / temperature, dim=-1)

    # KL divergence between soft distributions
    distill_loss = F.kl_div(soft_student, soft_teacher,
                            reduction="batchmean")
    distill_loss *= temperature ** 2  # Scale gradient magnitude

    # Standard cross-entropy with hard labels
    hard_loss = F.cross_entropy(student_logits, true_labels)

    # Blend: mostly learn from teacher, some from ground truth
    return alpha * distill_loss + (1 - alpha) * hard_loss

The temperature parameter controls how soft the probabilities become. At temperature 1, they're the teacher's raw output. At temperature 4, the differences between classes are amplified — the soft probabilities become even more informative. The alpha parameter balances learning from the teacher versus learning from the ground truth labels.

The results are remarkable. DistilBERT is 40% smaller and 60% faster than BERT, while retaining 97% of its language understanding. TinyBERT achieves 7.5× compression through a more aggressive two-stage distillation that also matches the teacher's intermediate representations. These aren't research curiosities — they're running in production at companies processing millions of requests per day.

Stacking It All Together — Compilation

One more lever: graph compilation. Tools like TensorRT, ONNX Runtime, and torch.compile analyze your model's computation graph and fuse operations. A sequence of MatMul → Add → ReLU becomes a single fused kernel that reads memory once and writes once, instead of three separate operations that each read and write independently. Since memory bandwidth is typically the inference bottleneck, this matters enormously.

The real power comes from stacking techniques. Distill the model → quantize to INT8 → compile with TensorRT. A model that started at 2 GB and 200ms per inference can end up at 50 MB and 5ms. That's a 40× improvement on both size and speed. Each technique is roughly multiplicative with the others, which is why production ML teams routinely apply all three.

Rest Stop

Congratulations on making it this far. You can stop if you want.

At this point you have a useful mental model for deployment. You know that models need to be serialized into a portable format (ONNX being the most universal). You know the three fundamental serving paradigms — buffet (batch), made-to-order (real-time), and conveyor belt (streaming). You know that dedicated serving frameworks like Triton and TorchServe exist because web frameworks can't handle GPU batching or multi-model scheduling. And you know the three main tricks for making models smaller and faster: quantization (less precise numbers), pruning (fewer weights), and distillation (smaller model that mimics a larger one).

That's a solid foundation. If someone in a meeting says "we need to deploy this model with canary rollout on Triton with INT8 quantization," you can follow the conversation and ask the right questions.

But there's more to the story. We haven't talked about how to package your model so it runs the same way on every machine (containerization). We haven't talked about putting models on phones and IoT devices (edge deployment). We haven't talked about what happens when traffic spikes 10× (auto-scaling). And we haven't talked about how to replace a live model without anyone noticing (deployment strategies).

If the discomfort of not knowing what's underneath is nagging at you, read on.

Packing the Kitchen Into a Box — Containerization

Our restaurant metaphor reaches its most literal form here. Imagine shipping your entire kitchen — stove, ingredients, recipe book, the specific brand of olive oil the chef insists on — in a standardized shipping container. You drop it anywhere in the world, open the doors, and the kitchen works exactly as it did at home.

That's Docker for ML. The problem it solves is the dependency nightmare. Your model needs a specific version of PyTorch (say 2.2.1), a specific CUDA version (12.1), cuDNN (8.9), Python 3.11, and forty other packages in exact versions. If any one of these doesn't match between your development machine and production, things break in mysterious ways. "It works on my machine" is not a deployment strategy.

FROM python:3.11-slim AS base

RUN apt-get update && apt-get install -y --no-install-recommends \
    libgomp1 && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY models/ /app/models/
COPY src/ /app/src/
WORKDIR /app
EXPOSE 8080

HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
    CMD curl -f http://localhost:8080/health || exit 1

CMD ["uvicorn", "src.server:app", "--host", "0.0.0.0", "--port", "8080"]

There's a subtle but important ordering in this Dockerfile. Notice that requirements.txt is copied and installed before the application code. Docker caches each layer. If you change your Python code but not your dependencies, the package installation layer is cached and skipped on rebuild. This turns a 10-minute rebuild into a 30-second one. Putting application code first would invalidate the cache every time and reinstall everything from scratch.

For GPU models, you'll start from an nvidia/cuda base image instead of python:3.11-slim. These images are big — often 8 GB or more — because they include CUDA libraries, cuDNN, and GPU drivers. Pin exact versions everywhere. torch==2.2.1, not torch>=2.0. When a Tuesday morning rebuild grabs a new minor version that's subtly incompatible with your CUDA driver, you'll be grateful for the pin.

Once your kitchen is in a container, Kubernetes is the system that decides how many copies of that container to run, restarts ones that crash, and routes traffic to healthy ones. We'll come back to this in the scaling section.

Taking the Kitchen to the Customer — Edge Deployment

Sometimes you can't send the customer's order to a central kitchen. The customer is in a car with spotty cell service. Or the customer is a security camera that needs to detect intruders in 5 milliseconds, and a 200-millisecond round trip to the cloud means the intruder is already inside. Or the customer is a hospital, and sending patient images to an external server violates privacy regulations.

Edge deployment means running the model on the device itself — a phone, a camera, a car, an industrial sensor. Three forces drive it: latency (5ms on-device versus 50-200ms round-trip to the cloud), privacy (data never leaves the device), and offline capability (cars and drones can't depend on WiFi).

The challenge is that edge devices are constrained. A phone's neural processing unit has a fraction of a cloud GPU's power. RAM is measured in gigabytes, not hundreds of gigabytes. Battery matters. So the models need to be small and fast, which is where all the optimization techniques we covered — quantization, pruning, distillation — become essential rather than optional.

Each platform has its preferred format. TFLite (TensorFlow Lite) is the go-to for Android, with deep integration into the NNAPI hardware acceleration layer. Core ML dominates iOS, where Apple's Neural Engine provides remarkably low latency for models that fit its constraints. ONNX Runtime Mobile is the cross-platform option — it can use NNAPI on Android and Core ML on iOS as backend delegates.

import tensorflow as tf

# Convert our dish classifier to TFLite with INT8 quantization
converter = tf.lite.TFLiteConverter.from_saved_model("dish_classifier/")
converter.optimizations = [tf.lite.Optimize.DEFAULT]

# Provide representative data so the converter can calibrate quantization
def representative_dataset():
    for image in calibration_images[:100]:
        yield [image.numpy()]

converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [
    tf.lite.OpsSet.TFLITE_BUILTINS_INT8
]
tflite_model = converter.convert()

with open("dish_classifier.tflite", "wb") as f:
    f.write(tflite_model)

In practice, most edge deployments are hybrid. A small, fast model on the device handles the common cases — your keyboard's next-word prediction, your camera's face detection. Hard or unusual cases get routed to a more powerful cloud model. The device model acts as a first pass: fast for the 95% of inputs it can handle confidently, with a fallback for the rest.

I haven't figured out a great way to convey how different the debugging experience is on-device versus in the cloud. In the cloud, you can SSH in, attach a debugger, check GPU utilization. On a phone, your model is a black box inside an app. Logging is limited. Profiling is awkward. Test on real devices early and often — the simulator lies about performance characteristics.

Opening More Locations — Scaling and Auto-Scaling

Our restaurant is popular. At noon, there's a line out the door. At 2 AM, the dining room is empty. We need to scale up during rush hour and scale back down to avoid paying staff to stand around.

Vertical scaling means hiring a faster chef (bigger GPU). It's the simplest approach, but there's a ceiling — GPUs only get so big. Horizontal scaling means opening more kitchens — running more replicas of your model server behind a load balancer. This is how production systems scale, and Kubernetes makes it possible with the Horizontal Pod Autoscaler (HPA).

Standard HPA scales on CPU or memory utilization. For ML serving, that's usually the wrong metric. What you care about is request latency or queue depth. If latency is creeping above your target — say, p95 latency exceeds 100ms — that's the signal to add replicas. Kubernetes supports custom metrics for this, often piped in from Prometheus.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: dish-classifier-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: dish-classifier
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: inference_latency_p95_ms
      target:
        type: AverageValue
        averageValue: "80"

For GPU workloads, there's an additional nuance: GPU sharing. GPUs are expensive, and a single model often doesn't saturate one. Triton's concurrent model execution helps here — running multiple models on one GPU, as we discussed. Another approach is MIG (Multi-Instance GPU) on NVIDIA A100/H100 hardware, which physically partitions a single GPU into isolated slices, each with its own memory and compute. One physical GPU becomes, say, seven independent mini-GPUs.

Load Balancing for ML

Standard round-robin load balancing doesn't work well for ML. Not all requests are equally expensive. Classifying a 10-word text takes 10ms; processing a 10-page document takes 200ms. Round-robin might send three expensive requests to the same server while others sit idle.

Least-connections routing is better: send each new request to the server with the fewest active requests. It naturally adapts to varying request cost. For multi-model GPU deployments, you may also need model-aware routing — sending requests to the server that actually has the right model loaded in GPU memory, because loading a model on-the-fly adds seconds of latency.

Speed of Service — Latency Optimization

A customer who waits too long leaves. In production serving, every millisecond costs you either users (higher latency → fewer conversions) or money (you buy bigger hardware to meet latency targets). Let's walk through the main levers, in roughly the order I'd try them.

Dynamic Batching

We've touched on this, but it deserves a closer look. A GPU processing 1 request takes, say, 15ms. Processing 32 requests takes 18ms. That's not a typo — the GPU's massive parallelism means batch processing is nearly free. The trick is that requests don't arrive in neat batches. They trickle in one at a time.

A dynamic batcher solves this by holding incoming requests in a queue for a short window — 5ms is typical — and bundling whatever has accumulated into a single batch. There's a tension: wait longer, get bigger batches, better throughput, but each individual request waits longer. Wait less, smaller batches, lower throughput, but faster per-request response. The max_wait_ms parameter is where you dial this tradeoff.

import asyncio
from collections import deque

class DynamicBatcher:
    def __init__(self, model, max_batch=32, max_wait_ms=5):
        self.model = model
        self.max_batch = max_batch
        self.max_wait_ms = max_wait_ms
        self.queue = deque()

    async def predict(self, input_data):
        """Each caller awaits their individual result."""
        future = asyncio.Future()
        self.queue.append((input_data, future))

        if len(self.queue) >= self.max_batch:
            self._flush()
        else:
            # Schedule a flush after max_wait_ms
            asyncio.get_event_loop().call_later(
                self.max_wait_ms / 1000, self._flush
            )
        return await future

    def _flush(self):
        if not self.queue:
            return
        inputs, futures = [], []
        while self.queue and len(inputs) < self.max_batch:
            inp, fut = self.queue.popleft()
            inputs.append(inp)
            futures.append(fut)
        results = self.model.batch_predict(inputs)
        for fut, result in zip(futures, results):
            fut.set_result(result)

Warm-Up and Caching

Two quick wins that are easy to overlook. Model warm-up: the first inference after loading a model is always slow because the GPU needs to allocate memory, compile kernels, and populate caches. Run a few dummy predictions at startup. One line of defensive code that prevents your first real user from experiencing 10× the normal latency.

Response caching: if the same input appears frequently, cache the prediction. A Redis layer in front of your model server can absorb 30-60% of traffic for many workloads. Hash the input features, check the cache, only call the model on cache misses. For our dish classifier, if people keep photographing the same restaurant's menu items, caching pays for itself quickly.

Circuit Breakers — When the Kitchen Catches Fire

In a real restaurant, if the stove malfunctions, you stop sending orders to that station. A circuit breaker does the same thing for a failing model server. After N consecutive failures (say, 5), the circuit "opens" — no more traffic goes to that server. After a cooling-off period (say, 60 seconds), it sends one test request. If it succeeds, the circuit "closes" and normal traffic resumes. If it fails again, the circuit stays open.

This prevents a single sick server from cascading failures across the whole system. Without a circuit breaker, the load balancer keeps sending requests to the failing server, those requests time out, the timeouts pile up, back-pressure builds, and suddenly all your servers are slow.

In production, service meshes like Istio and Linkerd provide circuit breakers, retries, timeouts, and load balancing out of the box. Building your own is a good learning exercise; relying on it in production is inviting heartache.

Changing the Menu Without Closing the Restaurant

Our chef developed a better pasta recipe. How do we start serving it without risking a dinner rush disaster if the new recipe turns out to be terrible? This is the core challenge of deployment strategies: getting a new model in front of users without breaking things.

Canary Deployment — One Table Gets the New Dish

Name one table in the restaurant the "canary table." Serve them the new pasta. Everyone else gets the old pasta. Watch closely. If the canary table loves it, gradually expand: two tables, then half the restaurant, then everyone. If they get food poisoning, pull it back immediately.

In ML terms: route 5% of production traffic to the new model, 95% to the old one. Monitor for errors, latency spikes, and metric degradation. If everything looks clean, ramp: 10%, 25%, 50%, 100%. If something goes wrong, instantly route everything back. The blast radius is contained to 5% of users.

# Istio VirtualService: 5% canary traffic split
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: dish-classifier
spec:
  hosts:
    - dish-classifier
  http:
    - route:
        - destination:
            host: dish-classifier
            subset: stable
          weight: 95
        - destination:
            host: dish-classifier
            subset: canary
          weight: 5

Blue-Green — Two Complete Kitchens

Run two identical environments. Blue serves the current model. Green has the new model loaded and ready. When you're confident, flip the switch — all traffic moves from blue to green at once. If green fails, flip back. The advantage over canary: the switch is instant in both directions. The disadvantage: it's all-or-nothing. There's no gradual ramp-up, so you're going from 0% to 100% in one step.

Shadow (Dark Launch) — The Secret Kitchen

Route production traffic to both old and new models. Serve the old model's response to users. Log the new model's response for comparison. Zero user impact — you're testing the new model with real traffic without exposing anyone to its predictions. The tradeoff: you pay double the compute cost for the duration of the shadow test.

Shadow deployments are particularly valuable for high-stakes models — fraud detection, medical diagnosis — where you want extensive evidence that the new model behaves correctly before any user sees its output.

A/B Testing — Measuring if the New Dish is Actually Better

Here's a distinction that I still occasionally see teams get wrong. Canary deployment tests whether the new model is safe — it doesn't crash, latency is reasonable, no error spikes. A/B testing measures whether the new model is better — it improves a business metric like conversion rate, click-through rate, or revenue.

For A/B testing, you split users (not requests) into control and treatment groups. The same user must always see the same version, or you'll contaminate your experiment.

import hashlib

def assign_group(user_id: str, experiment: str) -> str:
    """Deterministic assignment: same user always gets same group."""
    key = f"{user_id}:{experiment}"
    hash_val = int(hashlib.sha256(key.encode()).hexdigest(), 16)
    return "treatment" if hash_val % 100 < 10 else "control"

A/B tests require statistical rigor to mean anything. Define your hypothesis and minimum detectable effect before the experiment starts. Calculate the required sample size. Run for the full planned duration — don't peek at intermediate results and stop early when they look good, because that inflates your false positive rate. I've seen teams declare a winner after two days of a planned two-week test because the numbers "looked really good." The numbers looked really good because of random fluctuation, and the model turned out to perform worse in the long run.

Wrap-Up

If you're still with me, thank you. I hope it was worth it.

We started with a model trapped in a Python notebook and walked through every step of getting it to users. We serialized it into a portable format (ONNX, TorchScript, SavedModel). We chose a serving paradigm — buffet, made-to-order, or conveyor belt — based on when predictions are needed. We picked a serving framework that handles GPU batching and multi-model management. We squeezed the model with quantization, pruning, and distillation until it fit our budget. We packed the whole thing into a Docker container, learned how to put it on phones and IoT devices, scaled it with Kubernetes, optimized every millisecond of latency, and figured out how to swap in a new model without anyone noticing.

My hope is that the next time someone asks "how do we get this model into production?", instead of the mumbling and subject-changing that used to be my response, you'll have a pretty good mental map of the terrain — the tradeoffs, the tools, and the strategies — and a solid idea of where to start.

Resources and Credits

Chip Huyen, "Designing Machine Learning Systems" (O'Reilly, 2022) — The best single book on production ML. Chapter on deployment is wildly practical and opinionated in all the right ways.

NVIDIA Triton Inference Server documentation — Dense but comprehensive. The "QuickStart" guide is the place to begin; the model configuration reference is what you'll keep coming back to.

ONNX Runtime documentation and performance tuning guide — Insightful coverage of graph optimizations, execution providers, and quantization. The benchmarking notebooks are unforgettable once you see the speedup numbers.

The Lottery Ticket Hypothesis (Frankle & Carlin, 2019) — The O.G. paper that showed large networks contain small, trainable subnetworks. Changed how we think about pruning.

Hinton et al., "Distilling the Knowledge in a Neural Network" (2015) — The foundational distillation paper. Short, readable, and the temperature-scaling trick is still the core of every distillation pipeline today.

vLLM project and PagedAttention paper — If you're serving LLMs, this is required reading. The analogy to virtual memory management is elegant and the throughput gains are real.