Infrastructure & Cloud

Chapter 13: ML Systems & Production Section 7 of 9

I avoided thinking about ML infrastructure for an embarrassingly long time. Every time someone mentioned GPU types, cloud pricing tiers, or cluster orchestration, I'd nod along and silently pray nobody asked me a follow-up question. I was an ML person, not an ops person. The model was my job. The metal it ran on was someone else's problem.

Then I accidentally left four A100 GPUs running over a long weekend. The bill was $2,300 for a fine-tuning job that had finished Friday afternoon. That's when the discomfort of not understanding what's underneath grew too sharp to ignore. Here is that dive.

ML infrastructure is the collection of hardware, cloud services, and orchestration tools that turn your model.fit() call into something that actually runs — fast enough, cheap enough, and reliably enough to matter. GPUs became the dominant hardware for deep learning around 2012, cloud ML platforms emerged around 2017–2018, and the landscape has been accelerating since the LLM explosion of 2022–2023. Today, the difference between a well-chosen and a poorly-chosen infrastructure setup can be a factor of 10× in cost for the same work.

Before we start, a heads-up. We're going to be talking about GPU memory hierarchies, distributed computing strategies, cloud platform trade-offs, and cost arithmetic. You don't need to know any of it beforehand. We'll build each concept from the ground up, one piece at a time, with explanation.

This isn't a short journey, but I hope you'll be glad you came.

Why hardware matters at all
The GPU as a kitchen: VRAM, compute, and bandwidth
The GPU landscape: T4 through H100 and beyond
The VRAM calculation you'll do a hundred times
Rest stop
When one GPU isn't enough: scaling up and scaling out
Multi-GPU strategies: data, tensor, and pipeline parallelism
The three cloud kingdoms: SageMaker, Vertex AI, and Azure ML
Spot instances: 70% savings with a catch
The on-prem vs. cloud decision
Serverless inference: when you don't want GPUs at all
Cluster management: Kubernetes, Ray, or managed platforms
Wrap-up
Resources

Why Hardware Matters at All

Here's a scenario we'll carry through this entire section. Imagine you're building PetSnap — a mobile app that identifies dog breeds from photos. You've got a ResNet-50 that works beautifully on your laptop. Accuracy is great. Latency is fine. You show it to your team and everyone's excited.

Then someone asks: "Can we serve this to 10,000 users?" And everything changes.

Your laptop GPU has 8 GB of memory. The model needs 100 milliseconds per image. At 10,000 requests per hour, you need either a faster GPU or multiple GPUs running in parallel. And if someone later says "What if we used a larger vision-language model instead?" — now the model itself might not fit in 8 GB at all.

This is the fundamental tension of ML infrastructure. The model is the science. The hardware is the physics. You can design the most elegant architecture in the world, but it has to physically exist somewhere — on silicon that has a fixed amount of memory, a fixed number of arithmetic units, and a fixed rate at which data can flow between them. Ignoring those physical constraints is how teams end up with models that work in a notebook and fail in production.

Think of it this way. If training a model is like cooking a meal, the GPU is your stove. The stove's size limits how many pots you can run simultaneously. The burner heat determines how fast each pot boils. And the counter space — that's your VRAM — limits how many ingredients you can have out at once. You can be the best chef in the world, but if your kitchen is too small for the recipe, the food doesn't get made.

The GPU as a Kitchen: VRAM, Compute, and Bandwidth

A GPU has three numbers that matter for ML. Understanding what each one does — and which one will bottleneck you first — is the single most useful piece of infrastructure intuition you can develop.

VRAM (Video RAM, also called GPU memory) is the counter space in our kitchen. It's where the model weights, the input data, the intermediate activations, and the gradients all have to fit simultaneously. If your model plus its working data exceeds VRAM, the job crashes. There's no graceful fallback — it's an out-of-memory error, and your training run dies.

For our PetSnap ResNet-50, the model weights in FP32 (32-bit floating point, meaning each parameter takes 4 bytes) occupy about 100 million parameters × 4 bytes = 400 MB. That's nothing. An 8 GB GPU handles it easily. But a 7-billion parameter LLM in FP16 (2 bytes per parameter) needs 14 GB for the weights alone. Add gradients, optimizer states, and activations during training, and you're looking at 80–100 GB. That doesn't fit on any single consumer GPU.

Compute (measured in TFLOPS — trillions of floating-point operations per second) is the heat of the burners. More TFLOPS means more matrix multiplications per second, which means faster training and faster inference. Modern GPUs achieve their highest TFLOPS using specialized hardware called Tensor Cores, which are circuits designed specifically for the matrix multiply-accumulate operations that dominate deep learning.

Memory bandwidth is how fast ingredients travel from the fridge to the counter. It's measured in GB/s and determines how quickly data moves between VRAM and the compute units. I'll be honest — I underestimated this number for years. I assumed TFLOPS was everything. But for LLM inference, the model spends most of its time reading weights from memory, not doing arithmetic. The bottleneck is bandwidth, not compute. This is called being memory-bandwidth-bound, and it's why two GPUs with similar TFLOPS can have wildly different inference speeds if their bandwidth differs.

Here's the punchline. Training is usually compute-bound — you want maximum TFLOPS to chew through billions of tokens. Inference, especially for large language models, is usually bandwidth-bound — you want fast memory reads to stream weights through the compute units. Picking hardware without knowing which bottleneck you're solving is how teams end up paying H100 prices for a job that a cheaper GPU could handle at the same speed.

The GPU Landscape: T4 Through H100 and Beyond

Let's make this concrete. Here are the GPUs you'll actually encounter in cloud ML today, ordered from least to most expensive. I'll keep returning to our PetSnap example to show where each one fits.

The NVIDIA T4 is the budget workhorse. It has 16 GB of VRAM, delivers about 65 TFLOPS at FP16, and costs around $0.50–1.00 per hour on cloud. It's based on the Turing architecture from 2018. For PetSnap's ResNet-50 inference, a T4 is more than enough. The model fits in memory with room to spare, and the compute is sufficient for hundreds of predictions per second. Where it struggles: anything larger than about a 3-billion parameter model, and any serious training job.

The NVIDIA L4 is the T4's successor, built on the Ada Lovelace architecture (2023). It bumps VRAM to 24 GB, compute to about 121 FP16 TFLOPS, and adds fourth-generation Tensor Cores. Cloud pricing runs $0.80–1.20 per hour. For PetSnap, if we upgrade to a vision-language model around 3–7B parameters with INT8 quantization, the L4 is the sweet spot. It's the inference GPU I'd recommend for most production workloads that don't involve massive models.

The NVIDIA A100 changed ML infrastructure when it launched in 2020. With 80 GB of HBM2e memory, 312 TF32 TFLOPS, and 2,039 GB/s memory bandwidth, it became the training workhorse of the industry. Cloud pricing runs $3–5 per hour. For PetSnap, the A100 is overkill for inference, but if we're fine-tuning a 7B language model to understand pet breed descriptions, a single A100 can handle it with QLoRA. Two A100s can handle full fine-tuning of models up to about 13B parameters.

The NVIDIA H100 (Hopper architecture, 2022) is the current frontier. Same 80 GB of VRAM as the A100, but 3× the memory bandwidth (3,350 GB/s via HBM3), nearly 3× the compute (989 TF32 TFLOPS), and native FP8 support that the A100 lacks entirely. Cloud pricing runs $8–12 per hour. Benchmarks show the H100 training a model at the same task 3–4× faster than the A100 — and because it finishes sooner, the cost per training run can actually be lower despite the higher hourly rate. I didn't believe this when I first read it. The math checks out.

Beyond NVIDIA's lineup, there are two other players worth knowing. Google TPUs (Tensor Processing Units) are custom silicon designed for matrix multiplications. TPU v5e offers 16 GB HBM at about $1.20–2.00 per hour — competitive pricing, but you're locked into Google's JAX/TensorFlow ecosystem and XLA compilation. The TPU v5p, with 95 GB HBM and 459 TFLOPS, competes with the A100 for large training runs on GCP. AWS custom chips — Trainium2 for training and Inferentia2 for inference — offer competitive performance at potentially lower prices, but require Amazon's NeuronSDK, which means rewriting parts of your training code.

I'm still developing my intuition for when custom silicon like TPUs or Trainium is worth the ecosystem lock-in versus sticking with NVIDIA's universally-supported CUDA stack. The safe answer: if you're already deep in GCP with JAX, TPUs make sense. If you're already on AWS and running standard PyTorch, stay with NVIDIA GPUs unless cost pressure forces you to explore alternatives.

The VRAM Calculation You'll Do a Hundred Times

Every infrastructure decision starts with one question: how much GPU memory do I need? Let me walk through the arithmetic, because once you see it, every GPU choice becomes straightforward.

Back to PetSnap. Suppose we've decided to upgrade from ResNet-50 (100M parameters) to a 7-billion parameter vision-language model so users can ask questions about their pets in natural language. How much VRAM does this need?

For inference, the dominant cost is the model weights. Each parameter occupies a number of bytes determined by the numerical precision we store it in. FP32 (full precision) uses 4 bytes per parameter. FP16 and BF16 (half precision) use 2 bytes. INT8 uses 1 byte. INT4 uses half a byte. So our 7B model in FP16 takes 7 billion × 2 bytes = 14 GB. In INT4, that drops to 3.5 GB. We also need overhead for the KV cache (the memory used to store previous tokens during text generation) — typically an extra 20–30% on top. So FP16 inference needs roughly 18 GB. INT4 inference needs about 4.5 GB.

That means INT4 inference fits on an L4 (24 GB). FP16 inference needs an A100 (80 GB) or at least a GPU with 24+ GB if we're careful. Already, the hardware choice becomes concrete: L4 for quantized serving, A100 for full precision.

For training, the story is harsher. You need memory for the model weights, plus the gradients (same size as the weights), plus the optimizer states. The Adam optimizer — the most common choice — stores two additional FP32 copies of every parameter (the first and second moment estimates). So training a 7B model in BF16 requires roughly: 14 GB (weights in BF16) + 14 GB (gradients in BF16) + 56 GB (optimizer states in FP32, which is 7B × 4 bytes × 2 copies). That's about 84 GB before accounting for activations. Add activation memory (highly variable depending on batch size and sequence length, but often 20–50 GB), and you're well over 100 GB. No single GPU handles this. You need multiple GPUs working together.

def estimate_vram_gb(params_billions, precision="fp16", mode="inference"):
    """The back-of-envelope VRAM calculation every ML engineer needs."""
    bytes_per_param = {"fp32": 4, "fp16": 2, "bf16": 2, "int8": 1, "int4": 0.5}
    bp = bytes_per_param[precision]
    model_mem = params_billions * 1e9 * bp / (1024**3)

    if mode == "inference":
        # Model weights + KV cache overhead (~20-30%)
        return model_mem * 1.3
    else:
        # Weights + gradients + Adam optimizer (2 × FP32 copies) + activations
        optimizer_mem = params_billions * 1e9 * 4 * 2 / (1024**3)
        grad_mem = model_mem
        activation_mem = model_mem  # rough lower bound
        return model_mem + optimizer_mem + grad_mem + activation_mem

# PetSnap's 7B vision-language model:
# INT4 inference: ~4.5 GB  → fits on an L4 (24 GB) with room to spare
# BF16 training:  ~105 GB  → needs multiple GPUs (2× A100 80GB, or 1 H100 with FSDP)

There's an important asymmetry here. Inference memory scales linearly with model size and precision — predictable, easy to plan for. Training memory has that extra optimizer-states term that makes it grow much faster. This is why teams can often serve a model on cheaper hardware than they used to train it.

Rest Stop

Congratulations on making it this far. You can stop here if you want.

You now have a working mental model for GPU selection: you know that VRAM is the first constraint, that compute and bandwidth determine speed, and you can do the arithmetic to figure out which GPU fits your workload. For PetSnap, you know a T4 handles the ResNet, an L4 handles quantized LLM inference, and training a 7B model needs A100s or H100s. That's a genuinely useful framework — it'll carry you through most hardware conversations.

What it doesn't cover is what happens when one GPU isn't enough. How do you split a training job across multiple GPUs? How do the three major cloud platforms differ in practice? How do you avoid hemorrhaging money on idle compute? And when does it make sense to buy your own hardware instead of renting it?

The short version: multi-GPU training uses three flavors of parallelism (data, tensor, pipeline), cloud platforms matter less than where your data already lives, spot instances save 60–90% but require checkpoint discipline, and on-prem breaks even at around 40–50% utilization over three years. There. You're 60% of the way through.

But if the discomfort of not knowing what's underneath is nagging at you, read on.

When One GPU Isn't Enough: Scaling Up and Scaling Out

Our PetSnap 7B model needs ~105 GB for training. An A100 has 80 GB. The model doesn't fit. Now what?

There are two directions to go, and they have very different trade-offs. Think of it as the difference between getting a bigger kitchen versus opening a second kitchen across town.

Scaling up (also called vertical scaling) means adding more GPUs within the same physical machine. A server with 8× A100 GPUs connected via NVLink — a high-speed interconnect that moves data between GPUs at 600 GB/s on A100s and 900 GB/s on H100s — acts like one giant kitchen with multiple stoves sharing the same counter. Data flows between GPUs almost as fast as it flows within a single GPU. This is the right first move for most teams. Our PetSnap training job needs 105 GB across two A100s on one node? That works well. The GPUs communicate over NVLink, and the overhead is minimal.

Scaling out (also called horizontal scaling) means adding more machines. Going from one 8×A100 node to four of them. Now the kitchens are in different buildings. Data travels between machines over InfiniBand (50–100 GB/s) or, in cheaper setups, plain Ethernet (~12.5 GB/s for 100 Gigabit). That's 10–70× slower than NVLink. Every time GPUs on different machines need to exchange information — and in training, they need to exchange gradient updates after every batch — they pay this communication tax.

The hierarchy matters: NVLink within a node at 900 GB/s, InfiniBand between nodes at 50–100 GB/s, Ethernet at 12.5 GB/s. If a NVLink transfer takes one second at a certain data size, the same transfer takes roughly 10 seconds over InfiniBand and 70 seconds over Ethernet. This isn't a minor detail — it determines which parallelism strategies are even feasible.

For inference scaling, life is easier. Each model replica serves requests independently, so you can add machines without any communication between them. Stick a load balancer in front, add replicas as traffic grows. The challenge is auto-scaling: GPUs are expensive to keep idle but slow to cold-start — loading a large model takes 30–60 seconds. For PetSnap, this means pre-warming at least a few replicas during expected traffic spikes rather than scaling from zero.

Multi-GPU Strategies: Data, Tensor, and Pipeline Parallelism

When training needs multiple GPUs, the question becomes: how do you split the work? There are three strategies, each with different trade-offs. I'll be honest — the distinction between tensor and pipeline parallelism confused me for a long time, and I still have to think carefully about which to use when. Walking through a concrete example helped it click.

Let's say PetSnap's 7B model has 32 transformer layers, and we have 4 GPUs on one machine.

Data parallelism is the most intuitive approach. Every GPU gets a complete copy of the model. The training data is split into 4 chunks — GPU 0 processes batch chunk 0, GPU 1 processes batch chunk 1, and so on. After each forward and backward pass, the GPUs exchange their gradient updates (via an operation called AllReduce) so every copy stays synchronized. This is what PyTorch's DistributedDataParallel (DDP) does. It's the easiest to set up and works well when the model fits on a single GPU. But PetSnap's 7B model needs 105 GB for training — it doesn't fit on one A100. Data parallelism alone can't help us here.

That's where Fully Sharded Data Parallelism (FSDP) comes in. Instead of replicating the entire model on every GPU, FSDP shards the parameters, gradients, and optimizer states across GPUs. Each GPU only holds 1/N of the model's memory footprint (where N is the number of GPUs). When a layer needs the full parameters for a forward or backward pass, the GPUs gather the relevant shard, compute, and release it. Our 105 GB training job across 4 GPUs becomes about 26 GB per GPU — each A100 handles that comfortably.

from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import MixedPrecision, ShardingStrategy

# FSDP shards everything — each GPU holds 1/N of the memory
model = FSDP(
    pet_vision_model,
    sharding_strategy=ShardingStrategy.FULL_SHARD,
    mixed_precision=MixedPrecision(param_dtype=torch.bfloat16),
)
# 105 GB model → ~26 GB per GPU across 4 A100s
# The trade-off: more communication between GPUs, but the model fits

Tensor parallelism takes a different slice. Instead of splitting the data or the memory footprint, it splits individual layers. Imagine one of PetSnap's transformer layers has a weight matrix of shape [4096, 16384]. Tensor parallelism splits that matrix into 4 columns — GPU 0 handles columns 0–4095, GPU 1 handles 4096–8191, and so on. Each GPU computes part of the layer's output, and they combine results. This is what NVIDIA's Megatron-LM framework does. The catch: this requires extremely fast communication between GPUs, because they need to exchange partial results within every single layer. NVLink is fast enough. InfiniBand is not. So tensor parallelism is viable only within a single node.

Pipeline parallelism splits the model by layers. GPU 0 gets layers 0–7, GPU 1 gets layers 8–15, GPU 2 gets 16–23, GPU 3 gets 24–31. Data flows through GPU 0, then GPU 1, then GPU 2, then GPU 3 — like an assembly line. The communication happens only between adjacent stages, and it's less data per transfer than tensor parallelism requires. This tolerates higher-latency interconnects, making it suitable for multi-node training. The downside: pipeline bubbles. While GPU 0 processes micro-batch 2, GPU 3 is waiting for micro-batch 1 to finish the earlier stages. There's inherent idle time unless you interleave micro-batches carefully.

In practice, large-scale training combines all three. 3D parallelism — data parallelism across nodes, tensor parallelism within a node, and pipeline parallelism across groups of nodes — is how frontier labs train models with hundreds of billions of parameters. DeepSpeed and Megatron-LM provide the machinery for this. For PetSnap's 7B model, FSDP across 4 GPUs on one node is all we need. The 3D approach becomes necessary at the 70B+ scale.

An important rule of thumb: doubling GPUs rarely halves training time. Communication overhead eats into the gains. Getting 75% scaling efficiency (8 GPUs delivering 6× the throughput of 1 GPU) is considered good. Below 60%, you're burning money on communication. Always benchmark before committing to larger clusters.

The Three Cloud Kingdoms: SageMaker, Vertex AI, and Azure ML

Let's say PetSnap has grown. We need to train models regularly, serve them to millions of users, and we've decided to use the cloud rather than buying hardware. Three platforms dominate the ML cloud: AWS SageMaker, GCP Vertex AI, and Azure ML. I spent weeks trying to find the "best" one before realizing the question was wrong. The right question is: where does your data already live?

AWS SageMaker is the kitchen-sink approach. It has a service for everything — training jobs, real-time endpoints, batch inference, feature stores, pipelines, experiment tracking, AutoML (called Autopilot), and bias detection (Clarify). For PetSnap on AWS, we'd upload our training data to S3, define a training job with instance type and count, and SageMaker handles provisioning the GPU cluster, running the job, and tearing it down afterward. We pay only for the training time — no idle GPUs.

from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    entry_point="train.py",
    instance_count=2,
    instance_type="ml.g5.2xlarge",  # 1× A10G per node, 2 nodes total
    distribution={"torch_distributed": {"enabled": True}},
    hyperparameters={"epochs": 10, "lr": 1e-4},
)
estimator.fit({"train": "s3://petsnap-data/images/"})
# Cluster spins up → trains → saves model to S3 → tears down

The trade-off: SageMaker instances cost 20–40% more than the same raw EC2 GPU instances. That premium buys you cluster management, auto-teardown, and integration with the rest of the AWS ML stack. For teams that want to focus on models rather than infrastructure, the premium is worth it. For teams with strong DevOps skills, raw EC2 plus custom orchestration can be cheaper.

GCP Vertex AI has two genuine differentiators. First, TPU access — Google's custom AI chips aren't available anywhere else. For teams using JAX, TPUs can be 30–50% cheaper per training-FLOP than equivalent NVIDIA GPUs. Second, Model Garden provides one-click fine-tuning and deployment of foundation models (Gemini, Llama, Mistral). Vertex AI Pipelines run on Kubeflow under the hood, which is open-source and somewhat portable to other clouds. The Feature Store is backed by BigQuery, which is natural if your data warehouse is already there.

Azure ML wins on enterprise governance. The Responsible AI Dashboard — bundling fairness assessment, interpretability, error analysis, and counterfactual explanations in one interface — is the most comprehensive of the three. Azure OpenAI Service provides enterprise-grade GPT-4o access with data residency guarantees, VNet isolation, and content filtering that regulated industries (healthcare, finance) require. If PetSnap were a veterinary AI company bound by data regulations, Azure would be a strong choice for that reason alone.

Here's the honest assessment: all three platforms can do all the things. The feature checklists are converging. What actually determines the right choice is data gravity — a term for the observation that your ML infrastructure will orbit your data. If PetSnap's images are already in S3, we'd use SageMaker, because moving terabytes of data to another cloud costs both money ($0.09 per GB of egress) and time. If our analytics pipeline runs on BigQuery, Vertex AI. If we're a Microsoft shop with Active Directory and Power BI, Azure ML. That's the whole decision.

Spot Instances: 70% Savings with a Catch

Cloud GPU instances at on-demand prices are expensive. Running our PetSnap training job on two A10G instances for 20 hours costs about $60 at on-demand rates. Not bad. But training a 70B model on 8× H100 nodes for a week costs over $13,000. At that scale, cost optimization stops being nice-to-have.

Spot instances (AWS's term) or preemptible VMs (GCP's term) are cloud GPUs offered at 60–90% discounts because they use the provider's spare capacity. The catch: the cloud can reclaim them with as little as 30 seconds' notice. Your training job gets terminated mid-step. If you haven't saved your progress, everything since the last checkpoint is lost.

I remember the first time a spot instance was reclaimed during a training run. It was 14 hours in, no checkpointing configured, and I had to restart from scratch. That was the last time I ran a training job without the checkpoint-and-resume pattern.

The pattern is straightforward. Save checkpoints to durable storage (S3, GCS) every N steps. When the job restarts — either because spot reclaimed the instance or because of any other interruption — load the latest checkpoint and continue from where you left off. You lose at most N steps of work.

import torch, os

CHECKPOINT_DIR = "s3://petsnap-checkpoints/"  # Durable storage, survives instance death

def save_checkpoint(model, optimizer, epoch, step, path):
    torch.save({
        "model": model.state_dict(),
        "optimizer": optimizer.state_dict(),
        "epoch": epoch,
        "step": step,
    }, path)

def resume_from_checkpoint(model, optimizer, checkpoint_dir):
    checkpoints = sorted(list_s3_objects(checkpoint_dir))
    if not checkpoints:
        return 0, 0  # Fresh start
    latest = torch.load(download(checkpoints[-1]))
    model.load_state_dict(latest["model"])
    optimizer.load_state_dict(latest["optimizer"])
    return latest["epoch"], latest["step"]

# Every 500 steps, save. If spot reclaims, we lose at most 500 steps.
# At 60-90% discount, the savings easily justify the occasional re-work.

For PetSnap's training, spot instances with checkpointing every 500 steps would reduce our compute bill by 60–70%. The job might get interrupted 2–3 times, losing perhaps 20 minutes of work total across a 20-hour run. The math works overwhelmingly in our favor.

For inference, spot is riskier. A reclaimed instance means dropped requests — users staring at a loading screen. Use spot for batch inference (processing a queue of images overnight) where latency doesn't matter. Keep on-demand or reserved instances for real-time serving. Reserved instances (AWS) and committed use discounts (GCP) lock you into 1–3 year contracts at 30–60% savings over on-demand. The rule: commit only for your baseline load — the minimum compute you'll always need. Handle spikes with on-demand, experiments with spot.

The On-Prem vs. Cloud Decision

At some point, every ML team doing serious work asks: should we buy our own GPUs?

The arithmetic is more nuanced than either side admits. Cloud advocates say "you can't beat the flexibility." On-prem advocates say "we saved 60% by buying hardware." Both are right, for their specific situations. Let me walk through the numbers.

Suppose PetSnap has grown to the point where we're running GPU workloads 18 hours a day, 7 days a week. That's about 6,500 GPU-hours per year per GPU. On cloud, a single A100 at $4/hour costs us $26,000/year. Buying an A100-based server (around $150,000 for an 8-GPU system, which is about $19,000 per GPU slot after amortization) plus $2,000/year in electricity and cooling gives us roughly $21,000/year for three years, then drops to $2,000/year for the remaining life. Over three years, cloud costs $78,000 per GPU. On-prem costs about $25,000 per GPU. The savings are real.

But those numbers hide important costs. On-prem means you own capacity planning — if PetSnap's traffic doubles overnight, you can't spin up new GPUs in minutes. You own maintenance — when a GPU fails (and at scale, they do), you need spare hardware and people who know how to swap it. You own cooling — a server room full of H100s generates enormous heat. And you own the depreciation risk — the $150,000 server you buy today may be outperformed by hardware that costs half as much in 18 months.

The break-even point that most analyses converge on: if you're consistently using GPU capacity at 40–50% utilization or higher, sustained over 3+ years, on-prem becomes cheaper. Below that, or for unpredictable workloads, cloud wins on flexibility.

The pattern that works for many teams: on-prem for steady-state (the inference fleet that runs 24/7), cloud for burst (large training runs, experiments, handling traffic spikes). PetSnap might own 4 GPUs for its always-on inference service and rent cloud GPUs for monthly retraining runs. This captures the cost advantages of both without being fully committed to either.

Serverless Inference: When You Don't Want GPUs at All

There's a question that kept nagging me as I was learning all this GPU and cloud machinery: what if I don't want to manage any of it?

Serverless ML inference means you deploy a model, and the platform handles everything — provisioning GPUs when requests arrive, scaling to zero when there's no traffic, billing by the millisecond of actual compute. You never see a GPU instance. You never configure an auto-scaler. The model runs when called and sleeps otherwise.

For PetSnap's early days — sporadic traffic, unpredictable usage patterns, a team of three people who'd rather work on the model than on infrastructure — serverless is compelling. AWS SageMaker Serverless Inference, Google Cloud Run with GPU support, and third-party platforms like Modal and Replicate all offer this model.

The trade-off is cold start latency. When no requests have come in for a while, the platform has deallocated the GPU. The next request triggers reallocation and model loading — anywhere from 5 seconds for a small model to 60+ seconds for a large one. For PetSnap, if a user uploads a photo and waits 30 seconds for a breed prediction, that's a bad experience. Serverless works when cold starts are acceptable (batch processing, internal tools, low-traffic APIs) and falls apart when consistent low latency matters.

The other limitation: serverless platforms give you less control over the GPU type, quantization settings, batching strategy, and memory configuration. If you need to squeeze every ounce of performance from your serving setup, you'll eventually outgrow serverless and need dedicated instances with custom configurations.

I still use serverless for prototypes and internal tools. For production-facing inference with latency requirements, dedicated GPU instances with a proper auto-scaler remain the more reliable path.

Cluster Management: Kubernetes, Ray, or Managed Platforms

As PetSnap grows further — multiple models, multiple teams, a mix of training and inference workloads — someone has to decide how all those GPU resources get shared and orchestrated. This is the cluster management question, and it has three main answers.

Managed platforms (SageMaker, Vertex AI, Azure ML) are the path of least resistance. You don't manage a cluster at all — you submit jobs and deploy endpoints, and the platform handles scheduling, scaling, and GPU allocation. For PetSnap with a team of 5–15 ML engineers, this is the right starting point. The overhead of running your own cluster isn't justified until you hit specific pain points: the platform's auto-scaling doesn't match your traffic pattern, or its container constraints block your model architecture, or you need fine-grained control over GPU scheduling priorities between teams.

Kubernetes (K8s) is the industry standard for running your own cluster. It lets you request GPUs as a resource (like CPU or memory), schedule jobs across a pool of machines, enforce resource quotas per team, and auto-scale serving deployments based on metrics like queue depth or latency. Kubeflow adds ML-specific capabilities on top: pipeline orchestration, hyperparameter tuning (via Katib), and model serving (via KServe). The investment is real, though — installing, configuring, and maintaining Kubeflow requires dedicated platform engineers. It's best suited for organizations that already have Kubernetes expertise and need multi-cloud portability.

Ray takes a different approach. Instead of building on Kubernetes abstractions, Ray provides its own distributed runtime where you write Python and Ray handles distribution across a cluster. Ray Train orchestrates distributed training, Ray Serve handles model serving with request batching and multi-model composition, and Ray Data manages distributed preprocessing. Ray can run on Kubernetes (via the KubeRay operator) or on bare VMs. Its strength is composability — if PetSnap needs a pipeline where the breed classifier feeds into a health assessment model which feeds into a recommendation engine, Ray Serve lets you wire those together in Python rather than YAML.

from ray import serve

@serve.deployment(num_replicas=2, ray_actor_options={"num_gpus": 1})
class BreedClassifier:
    def __init__(self):
        self.model = load_model("petsnap-breed-v3")

    async def __call__(self, request):
        image = (await request.json())["image"]
        return self.model.predict(image)

# Each replica gets its own GPU. Ray handles load balancing and scaling.
app = BreedClassifier.bind()
serve.run(app, host="0.0.0.0", port=8000)

My honest assessment: start with managed platforms. Move to Ray when you need multi-model composition or have Python-heavy teams that don't want to learn Kubernetes. Move to Kubeflow/KServe when you have dedicated platform engineers and need Kubernetes-native infrastructure that spans multiple clouds or on-prem clusters.

Wrap-Up

If you're still with me, thank you. I hope it was worth the detour into ops territory.

We started from the bare physics of it — VRAM, compute, and bandwidth as the three constraints that determine what any GPU can do. We walked through the GPU landscape from the $0.50/hr T4 to the $12/hr H100, learned the VRAM arithmetic that drives every hardware choice, and followed PetSnap as it scaled from a single laptop GPU to multi-GPU training. We compared the three major cloud platforms and discovered that data gravity matters more than feature checklists. We learned to save 60–90% with spot instances (at the cost of checkpoint discipline), worked through the on-prem break-even math, explored serverless as an alternative to managing GPUs at all, and ended with the cluster management spectrum from managed platforms to Kubernetes to Ray.

My hope is that the next time someone mentions GPU types or cloud pricing in a meeting, instead of nodding along and hoping no one asks a follow-up question, you'll be the person who can do the VRAM math on the back of a napkin and suggest the right GPU before anyone else has opened a pricing calculator.

Resources

A curated list of things I found genuinely helpful while learning this material.

NVIDIA's GPU architecture whitepapers (Ampere, Hopper, Ada Lovelace) — dry reading, but the definitive source for understanding what Tensor Cores actually do. The Hopper whitepaper's section on the Transformer Engine is particularly insightful.
Stas Bekman's "Machine Learning Engineering" book (free online) — the most practical guide to ML infrastructure I've found. The chapters on multi-GPU training and memory estimation are wildly helpful.
The Hugging Face documentation on FSDP and DeepSpeed — best hands-on tutorial for distributed training with actual code you can run. Start here before touching the PyTorch docs.
cloud-gpu-benchmarks.com — a community-maintained comparison of GPU pricing and performance across cloud providers. The numbers change monthly, so always check the latest.
Ray documentation's "Patterns" section — the design patterns for multi-model serving pipelines are well-written and solve problems I didn't know I'd have until I had them.
"The GPU Poor" blog posts by various authors — a genre of practical writing about doing ML on limited budgets. Grounding and honest about what you can accomplish without H100 clusters.

← Previous Monitoring & Observability Next → AI Engineering with Foundation Models